Thursday, June 14, 2012

ConcurrentHashMap

This article is based on 1.6 code. In 1.7 ver#,  'Unsafe' class is used to directly write to memory location.


Data Structure

ConcurrentHashMap main motive is to allow concurrent access to the map. HashTable offers synchronized access to the map but the entire map is locked to perform any sort of operation.

In ConcurrentHashMap, the basic strategy is to subdivide the table among segments so that the lock is applied only on a segment rather than the entire table. Each segment manages its own internal hash table in size 2x >=(capacity/no. of segments).

Locking is applied only for updates. In case of of retrievals, it allows full concurrency, retrievals reflect the results of the most recently completed update operations.

If we assume four threads are going to concurrently work on a map of initial capacity 32, we would want the table to be partitioned into four segments, each managing a hash table of capacity 8.



ConcurrentHashMap Data Structure would be:


ConcurrentHashMap
ConcurrentHashMap




For comparison sake, below figure shows the data structure of a HashMap.

ConcurrentHashMap
ConcurrentHashMap



Segment

The collection maintains a list of 16 segments by default, each of which is used to guard (or lock on) a single bucket of the map. This effectively means that 16 threads can modify the collection at a single time (as long as they’re all working on different buckets). This level of concurrency can be increased using the optional concurrencyLevel constructor argument.

public ConcurrentHashMap(int initialCapacity,
                         float loadFactor, int concurrencyLevel)
The maximum size it can go up to is 216. Greater the concurrency level, greater would be the no. of segments and lesser the size of hash table that the segment manages. Using a significantly higher value than you need can waste space and time, and a significantly lower value can lead to thread contention.

Put Entry

New Entry

ConcurrentHashMap doesn't allow NULL values. The key cannot be NULL, it is used to locate the segment index and then the actual bucket within the segment. The key's hash is re-hashed to strengthen the hash value to defend against poor quality hash functions. This is important as the hash will be used to locate both segment and hash table index. The upper bits will be used to locate the segment index and the lower bits to locate the table index within the segment. This is why re-hashing function is different from that of HashMap as the hash bits need to be spread better to regularize both segment and index locations.

    private static int hash(int h) {
        h += (h <>> 10);
        h += (h <>>  6);
        h += (h <<   2) + (h <>> 16);
    }
ConcurrentHashMap
ConcurrentHashMap





Put If Absent

ConcurrentMap offers new method putIfAbsent() which is similar to put except that the value will not be overridden if key already exists. This is equivalent to

if (!map.containsKey(key))
    return map.put(key, value);
else
   return map.get(key);
except that the action is performed atomically.

containsKey is not synchronized so the above code may cause unexpected behavior. Lets look into the below scenario:
  1. Thread A is trying to remove a key, calls remove(key), acquires lock.
  2. Thread B tries to execute above code, calls containsKey(key). The key exists as Thread A has not yet released the lock so the old value is returned.
  3. Thread A removes the key and releases the lock.

The above scenario proves that the code is not atomic as it returned a key which it thinks exists but actually doesn't.
Also, performance wise it is slow, containsKey has to locate the segment and traverse through the table to find the key. Method put needs to do the same traversing to locate the bucket and put the value. The new method putIfAbsent is a bit faster as it avoids double traversing. It goes through the same internal flow as put, avoids overriding the old value if key already exists and simply returns the old value.

Class Diagram

Below class diagram shows that ConurrentHashMap is made up of Segments which in turn is made up of HashEntries.


ConcurrentHashMap
ConcurrentHashMap



Remove Entry

HashEntry is an immutable class so the next link can't be adjusted. All entries following removed node stays in the list, but all preceding ones need to be cloned.

HashEntry newFirst = e.next;
for (HashEntry p = first; p != e; p = p.next)
newFirst = new HashEntry(p.key, p.hash, newFirst, p.value);
tab[index] = newFirst;
ConcurrentHashMap
ConcurrentHashMap




Re-sizing

Since power-of-two expansion is used, the elements from each bin must either stay at same index, or move with a power of two offset.

int idx = e.hash & sizeMask;

For a capacity of 16 elements, sizeMask would be 1111 so the lower 4 bits of hash will form the index.
When the capacity is doubled, sizeMask would be 11111 and now the lower 5 bits of hash will form the index.

So the index would be same if the 5th is 0 else the index would be incremented by 24.
The trailing consecutive sequence nodes can be re-used instead of unnecessary node creation if they they point to the same slot after the re-hashing. Here in the below example the trailing nodes in blue hues will all end up in the same index.





Home


20 comments:

  1. Waooo..... Finally i read something useful. Thanks a Lot.

    ReplyDelete
  2. Hi Sathis,

    Nice. It'd be good if you can explain diagram also.

    Thanks
    PC

    ReplyDelete
  3. where do you get these information? Really very nice post..Thanks :)

    ReplyDelete
    Replies
    1. Hi Harsh,
      Thanks for visiting!
      I went through the java source code and tried to reveal the internal data structures used. In general, I am passionate about going through various open sources and extract the data structures and algorithms used.
      I am currently working on kahadb (activemq file persistence database) and H2 (java SQL database), so stay around for future posts.
      Ram Satish

      Delete
  4. Very Nice article...really nice work

    need to explain diagrams little bit.

    ReplyDelete
    Replies
    1. Hi Pavan,

      Thanks! I agree with you. There has been some changes to the implementation of ConcurrentHashMap in Java7, though the data structure is more or less same. I will be posting the revised one soon.

      Ram Satish

      Delete
  5. Nice work Ram!

    Keep it up.. \m/

    ReplyDelete
  6. I have one doubt while removing entry from the concHashMap. I think it does not create any clone it simply finds the node and set its pred node's next to point to the next node in the list. This what I found in jdk7. Could you please clean my doubt.

    ReplyDelete
    Replies
    1. Hi,
      The post is actually based on 1.6 ver# (Added the ver# now in the beginning). In 1.6, cloning is done. whereas In 1.7#, 'Unsafe' class is used to directly write to memory location (the previous entry's next reference is updated to the next entry of the node being removed and this is done atomically directly so no need of cloning).
      Thanks
      Ram Satish

      Delete
  7. Hi Ram

    I have a Question If two threads try to update a ConcurrenthashMap parallely. But One has raise the limit to 75% and other raised the same to 100% . So rehashing will be done at the Bucket's level . How the locking will behave in the Map . Will it provides the lock at the Buckets level or it will acquire lock at the Map level.

    ReplyDelete
    Replies
    1. Hi Gaurav,
      Sorry for delay in my response. I guess by 'bucket' you mean segment.
      Locking is always done at the segment level and NOT at the Map level. If both the threads try to access the same segment, locking will come into picture else there won't be any locking. If the threshold is reached, the segments will be expanded and re-hashing will happen to the entries that belong to that segment.
      Hope this answers your question.
      Thanks
      Ram Satish

      Delete
  8. " The collection maintains a list of 16 segments by default, each of which is used to guard (or lock on) a single bucket of the map. ". By default the size is 16 and there are 16 segments which are locking the buckets in the map. Correct me if i am wrong here...

    As said in the article, if 4 threads access my map, the size of 32 is split into 4 segments where each segment is holding a hashtable of size 8. At this point there are only 4 segments and what about the buckets state?. Who gaurds the each bucket in the hashtable?.

    I am bit confused on these lines?.

    ReplyDelete
    Replies
    1. Hi Kevin
      Once the segment is identified, the lock needs to be obtained. Unless it obtains the lock, it won't be updating the bucket.
      So to your question my answer is - The lock guards each bucket in the segment's hashtable.
      Thanks
      Ram Satish

      Delete
  9. Hi Ram,

    This is really nice article. Can you explain on how you choose the lower and higher bits from "0110 0111 1100 1010 1100 0111 1001 1011". how come the table's index is 1?.

    ReplyDelete
    Replies
    1. Since the segments length is 16, the upper 4 bits of the hash will be used to derive to a segment index.
      16 = 2 to the power of 4 so 4 bits will be used for index calculation and these 4 bits will come from the upper bits of hash.
      To identify the upper 4 bits of hash,we need to right shift the hash bits by 32 - 4 = 28.
      By default, we have 16 segments, and each segment must have minimum of 2 elements so in its hash table.
      The hashtable index = hash & (hashtable length -1) thus only the last lower bit will come in picture to identify the bucket index as table length is 2.

      Delete
  10. Awesome Work, even daigram itself is enough to get clear picture of the concept. Thanks a tons !

    ReplyDelete