Thursday, June 14, 2012

HashMap

Internal working of HashMap


A hash map is a data structure that uses a hash function to map keys to their associated values. This article describes the internal workings of a HashMap. The main source of the article is java source code.

For Q&A, visit HashMap Q&A.

Data Structure

In HashMap, the data structure is based on array and linked list. An entry finds its location in the array based on its hash value. If an array element is already occupied, the new entry replaces the old entry and the old entry is linked to the new one. Below is an example of the entries in the data structure.




Hash Map Data Structure
HashMap Data Structure



Put Entry

When a new entry is assigned to an index, the existing entry to the index, if any, becomes the next entry to the new entry.


Hash Map New Entry
Adding an entry to HashMap


Entry

An Entry consists of key, value, hash and the next entry sharing the index.


Hash Map Entry Class diagram
Entry Class diagram

Capacity

The default initial capacity MUST be a power-of-two. Even if the caller provides a capacity c which is 2x-1 > c < 2x, the capacity is converted to 2x to create the internal array.

Hash Map Capacity
Capacity Adjustment to match power-of-two



If the capacity of array is in power-of-two, the hash code can be easily converted to the index based on a simple AND operation:

index =  hashCode & (length-1)

Here length = 2x.

In Hashtable, the capacity doesn't need to be in power of 2 and the internal array is created for whatever capacity is provided but the index calculation is based on modulus operand:

index = (hashCode & 0x7FFFFFFF) % length.
Since the capacity is in power of 2, a simple AND operation is able to return us the index.

Lets see how it works using an example.

Suppose the hash=311 and length=16, applying the modulus operation we get:
hash = 16 * 19 + 7 (length*quotient + index), thus index = 7.

Let us convert the hash to binary.
311 = 1 0011 0111 in binary
= 1 0011 0111 = 1 0011 0000 + 0111
= 24*1 + 25*1 + 26*0 + 27*0 + 28*1 + 0111
= 24(20*1 + 21*1 + 22*0 + 23*0 + 24*1) + 0111
= 24(10011) + 0111

This is in the same form as the length*quotient + index, here index = 0111, thus if length=2x, index is nothing but the lower x bits of hash code.

We can obtain these lower bits by doing & operation.

index = 0111 = 1 0111 0111 & 1111

Thus index = hash & (2length - 1).
public void testFindModulusByAND() {
        int n = 311;
        int d = 1 << 4;
        int m = 7;
        assertEquals(m, n & (d -1));
    }

Re-hashing

Since HashMap uses power-of-two length hash table, the lower bits form the index. If the hash function is poor, it may lead to different hash codes but with same lower bits.

If a hash code results into the same index, we encounter collisions. In order to avoid this, HashMap applies a supplemental hash function to a given hashCode. For example, hash codes 375 and 311, has the same lower bits so they return the same index.

375 = 1 0111 0111
311 = 1 0011 0111

The index would be the lower four bits 0111=7.
HashMap applies the below hash function to refine the original hash code.
 h ^= (h >>> 20) ^ (h >>> 12);
 h ^= (h >>> 7) ^ (h >>> 4);

When we apply it on 311, we get 294=1 0010 0110, index is 6.
When we apply it on 375, we get 354=1 0110 0010, index is 2.

Collision

If two different key objects return the same hashcode, it will result in collision as the bucket location would be same. The new entry will replace the previous entry and the next reference of the new entry will point to the previous entry. For hash tables, this means that the second table insertion will be slower as the links have to be adjusted.




Hash Map Collision
Hashmap Collision




Get Entry

The key is converted to its hash value and then HashMap's internal hash function is applied on it to get the actual hash code. This hash code is then used to identify the bucket.

The Object sitting in the bucket is of type Entry so it knows the key, value, hash and its next entry, if any. There may be more than one object stored in same bucket, in such case, the entry will be a linked list where next entry will be its next node.

We must traverse through the list to identify the Object for the associated key. If the entry's hash matches with key's hash value and its key matches with the key in context, it means we have found the correct node.
  1. e.hash == hash
  2. e.key == key || key.equals(k)
It is important that the key used in HashMap should be immutable, final object with proper equals() and hashcode() implementation.

NULL Key

HashMap accepts Null key, index 0 is reserved for the Null key entry.


HashMap Null Key
HashMap with a NullKey 

Re-sizing

Threshold


Threshold defines the next size value at which to resize the hash table. If the size of the map exceeds a given threshold defined by load-factor e.g. if load factor is .75, it will act to re-size the map once it filled 75%. Java HashMap does that by creating another new bucket array of size twice of previous size of HashMap. The new entry can occupy any index based on the hash value.

In the below diagram, the emphasis is on the relation between size, threshold and expansion of capacity. In order to simplify the diagram, it is assumed the new entries occupy array elements in serial order.




Hash Map Threshold
HashMap Threshold Mechanics





All the entries are transferred from current table to new Table.



HashMap Re-sizing




This process is called rehashing because it also applies hash function to find new bucket location.

Transfer of elements

Whenever we have to re-size the hash table, the elements from the old array will have to be transferred to the new expanded array. Since the capacity is doubled, the elements based on the hash key may either stay at same index, or move with a power of two offset.



HashMap Transfer of entries
Re-hashing of HashMap

Iterator

Since the hash value determines the bucket location, the iterator doesn't promises any order. Iterator returns the first not null element in the array starting with index=0.
Below example puts keys in order "Hello","!", "How","are","you" to a HashMap object but iterator returns in completely different order. It returns elements from the first available entry, its next entry in the linked list, if any, else the next available entry in the table till it reaches the end of the table.



Hash Map Iterator
Order of elements returned by the iterator



Iterator Example

    public void testMapItr() {
        Map m = new HashMap();
        m.put("Hello", null);
        m.put("!", null);
        m.put("How", null);
        m.put("are", null);
        m.put("you", null);
        Iterator itr = m.keySet().iterator();
        while (itr.hasNext()) {
            String key = itr.next();
            int hash = hash(key.hashCode());
            System.out.println("table[" + (hash & 15) + "]=" + key);
        }
    }
    static int hash(int h) {
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }


Result:
table[3]=are
table[3]=!
table[10]=you
table[11]=How
table[14]=Hello

More Java Maps

There are more articles on Java Maps that you may want to read here:
  1. Concurrent HashMap
  2. Linked HashMap
  3. TreeMap

Home

11 comments:

  1. Nice!

    the following code actually is generating same index which is an example for collision,

    public class ItemTest {
    public static void main(String[] args) {

    Map m = new HashMap();
    m.put("Hello", null);
    m.put("How", null);
    m.put("are", null);
    m.put("you!", null);

    Iterator itr = m.keySet().iterator();
    while (itr.hasNext()) {
    String key = (String)itr.next();
    int hash = key.hashCode();
    System.out.println("table[" + (hash & 15) + "]=" + key);
    }
    }
    }


    output :

    table[4]=are
    table[2]=you!
    table[0]=How
    table[2]=Hello


    look at index 2

    ReplyDelete
    Replies
    1. Hi there!
      Thanks! I realized I missed out an additional hash function method in my code.
      HashMap applies a supplement hash function on top of key's hash code to further strengthen it since table length is always a power of two. See http://javaopensourcecode.blogspot.in/2012/11/hashmap-q.html for more details.

      I have modified your code, applied supplement hash code, now the collision disappears.
      public class ItemTest {
      public static void main(String[] args) {

      Map m = new HashMap();
      m.put("Hello", null);
      m.put("How", null);
      m.put("are", null);
      m.put("you!", null);

      Iterator itr = m.keySet().iterator();
      while (itr.hasNext()) {
      String key = (String)itr.next();
      int hash = key.hashCode();
      System.out.println("table[" + (hash & 15) + "]=" + key);
      }
      }
      static int hash(int h) {
      h ^= (h >>> 20) ^ (h >>> 12);
      }
      }

      Hash Table[3]=are
      Hash Table[5]=you!
      Hash Table[11]=How
      Hash Table[14]=Hello


      Thanks
      Ram Satish

      Delete
    2. Hi Ram,
      I hope the above two comments are giving wrong idea and are wrong in the way the index is calculated.

      Collision will not happen in HashMap as it is autodealt by the API (API has written a custom hashfunction) There is no need for us to write a hash function in your class.

      You will override hashCode() of Object class only if your class needs to be serialised or placed in any Collections ( to get the unique id)

      The initial program that anonymous has written is actually quite misleading as to calculate the index where the key is stored, the formula used by the anonymous person is

      System.out.println("table[" + (hash & 15) + "]=" + key); which hash*15 that is wrong the hash there is not the hashCode of the key it has to be Map'shashof the key's hashCode

      Please use the following if yo want

      index=(hashFunctionInHashMap) & (MayEntryTableLength-1)

      If we go into details of proving this

      Open the HashMap API and you will notice that to calculate index while you put an Entry, check put method of HashMap there the formula used by HashMap for this is
      int index= (customMap'sHashvalue) & (EntryTableslength-1);


      where customMap'sHashvalue is calculated using function where you need to pass on the hashCode of the Key you are placing in the Entry.

      static int hash(int h) {
      // This function ensures that hashCodes that differ only by
      // constant multiples at each bit position have a bounded
      // number of collisions (approximately 8 at default load factor).
      h ^= (h >>> 20) ^ (h >>> 12);
      return h ^ (h >>> 7) ^ (h >>> 4);
      }

      In the above examples, the hashcode of the keys put in hashcode "Hello" , "How" , "are" and "you!" can be easily obtained by calling the hashCode method on String class as they are String objects

      so, "Hello".hashCode() will give you the value -> 69609650
      "How".hashCode() will give you the value -> 72752
      "are".hashCode() will give you the value -> 96852
      "you!".hashCode() will give you the value -> 3715042


      and EntryTable length by default starts with 16 (power of 2)

      now we calculate custom hash prepared by HashMap for the string hashCode are

      Map's hash for "Hello" = hash(69609650 ) will give you the value -> 74203374
      Map's hash for "How" = hash(72752 ) will give you the value -> 69595
      Map's hash for "are" = hash(96852 ) will give you the value -> 93971
      Map's hash for "you!" = hash(3715042 ) will give you the value -> 3889141


      so index for "Hello" is (74203374 ) & (16-1) = 14
      and index for "How" is (69595 ) & (16-1) = 11
      and index for "are" is (93971 ) & (16-1) = 3
      and index for "you!" is (3889141 ) & (16-1) = 5


      so, Hello will be placed as Map.Entry at index 14
      and How will be placed as Map.Entry at index 11
      and are will be placed as Map.Entry at index 3
      and you! will be placed as Map.Entry at index 5


      which you can see by debugging the program you have written ( Note the above index positions will change depending on the size of the Map)


      So, HashMap internally takes care of collision, no need to write the anotherhash


      You can use debugger Eclipse to confirm the index positions (Look inside map's entrySet->this$0->entrySet table->indexes

      Program modification is posted in the next comment



      Delete
    3. Hi,
      You are right. Java takes care of the hashing. It applies a supplementary hash to strengthen it so that there will be less collisions. The main reason behind this is the table length is always in power-of-two.
      In my iterator example, I used re-hashing only to derive at the table index, the way you did.
      More about why HashMap uses power-of-two table length, you can find in HashMap Q&A post - http://javaopensourcecode.blogspot.hk/2012/11/hashmap-q.html

      Thanks
      Ram

      Delete
  2. The above program to be corrected as follows:

    import java.util.HashMap;
    import java.util.Iterator;
    import java.util.Map;

    @SuppressWarnings("all")
    public class Test extends HashMap{

    public static void main(String[] args) {

    HashMap m = new HashMap();
    m.put("Hello", null);
    m.put("How", null);
    m.put("are", null);
    m.put("you!", null);

    Iterator itr = m.keySet().iterator();
    while (itr.hasNext()) {
    String key = (String)itr.next();
    System.out.println("table[" + (getHashMapsHash(key.hashCode()) & 15) + "]=" + key);
    }

    }
    // Note I had to use the following method from HashMap's class hash method, because though this is marked static in HashMap, I cannot use it with class name like HashMap.hash(key.hashCode), because the method is marked with access modifier default


    static int getHashMapsHash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
    }

    }


    ------output:

    table[3]=are
    table[5]=you!
    table[11]=How
    table[14]=Hello

    ReplyDelete
  3. Sorry , the
    class declaration should be

    public class Test{

    not
    public class Test extends HashMap{


    typo error( was trying something) My Apologies

    ReplyDelete
    Replies
    1. Thanks for your comments. If you have one more put using key "!", there will be a collision at 3rd element.
      m.put("!", null);

      Also, please do visit my other post on Q&A HashMap which has more details on the internals.
      http://javaopensourcecode.blogspot.hk/2012/11/hashmap-q.html

      Delete
  4. Hi I was reading about HashMap and came across this Diagram:

    http://en.wikipedia.org/wiki/File:Hash_table_5_0_1_1_1_1_0_LL.svg

    Which made things a bit confusing, Please correct me if I am wrong:

    Under the Sub-Title "put entry" , in the Diagram showing Entry(in orange) being added
    The Entry being added (Shown in orange) shouldn't it be added to the array(i.e.table) instead of the LinkedList
    and the "previous entry" that would be the element residing in the array would shift to become part of LinkedList .i.e it would become the overflowing entry

    ReplyDelete
    Replies
    1. Your understanding is correct. I guess the diagram can be more re-fined. I took the liberty of simplifying it as the entry object itself is a linked list. Pls. see the paragraph under title "entry". It has a next pointer which points to the next entry. The horizontal boxes in white are the array elements. When the orange entry gets added, it finds that an entry (previous) already exists so it simply shifts this entry by pointing its (orange entry's) next entry to it (the previous one) and the orange entry becomes the first element of the linked list. The orange entry now sits in the array element. The way I have shown in the diagram might be confusing, the array element is an entry object which has (key, value) pair and a next entry. You can assume the orange entry to be within the white box.

      Delete
    2. Could you please explain the diagram from start be avoid confusion...because after going through ur comments I became more confused.I couldn't find what is wrong and what is right explained above.

      It is very much detailed explanation. thanks a lot.Really really helpful.Thanks

      Delete
    3. Please go through the QA on HashMap.
      http://javaopensourcecode.blogspot.com/2012/11/hashmap-q.html

      If you have any specific questions pls. feel free to ask, meantime I will also try to make the post better.

      Delete