Chapter 18. Dictionaries, Hash-Tables and Sets

Note that the accessor method get of the V this[K] property of the Dictionary class in .NET throws an exception if the given key K does not exist in the dictionary. There are two major implementations of the IDictionary interface in the .NET Framework - Dictionary and SortedDictionary. Creating a hash table is done by calling some Dictionary constructors.

Hashing and Hash-Functions

Even if the current table capacity is changed while you are working with it, it is also highly possible that the order of the pairs can also be changed. This method is inherited by the class Object, which is the root member in the hierarchy of .NET classes. The implementation in the class Object of the method GetHashCode() does not guarantee the unique value of the result.

In this case, the value of the number itself is used for the hash code. This implementation is complex, but we must remember that it tries to ensure the uniqueness of the result: different hash code for different input strings. The authors of the above code use a small trick (dangerous code) to work directly with the low-level representation of the string in memory.

We leave it to the reader to look at other implementations of the GetHashCode() method in some of the most commonly used .NET types such as int, DateTime, long, float, and double. We have already explained that leaving the implementation already built in the class object is not an acceptable solution.

If we use key objects from the class that implements GetHashCode() above in the hash table, it will have very poor performance, because every time we add a new element to the table, we would have to insert into the same place. In order to avoid the described behavior, we need a hash function that evenly distributes the keys among the possible values of the hash code. Therefore, using the constant 42 for the hash code turns our hash table into a linear list and becomes very inefficient.

First, we need to choose which fields of the class will participate in the implementation of the Equals(object) method. This is necessary because every time Equals() returns true, the result of GetHashCode() must always return the same value. This way, the fields that do not participate in Equals() should also not participate in GetHashCode().

After choosing which fields will participate in the calculation of GetHashCode(), we need to receive values from them (of type int). If the field is an array or a collection, we take the hash code of each.

If the field is bool, for true we take 1, and for false we take 0 (or directly call the GetHashCode() method on bool). If the field is of type int, byte, short, char, we can convert it to int with the cast operator (int) (or we could directly call GetHashCode()). If the field is of type long, float or double, we could use the result from their own implementations of GetHashCode().

If the field is not a primitive type, we can call the method GetHashCode() of this object. In the end we sum all the received int values, and before each addition we multiply the temporary result by a prime number (for example 83), while ignoring the eventual overflow of type int. At the end we get a hash code, which is very well distributed in the range of all 32-bit values.

We can expect collisions to be rare with the hash code calculated in this way, since any change in some of the fields involved in GetHashCode() causes a larger change in the hash code, thus reducing the chance of a collision.

Point3D.cs public class Point3D

This implementation is incomparably better than returning a constant or just one of the fields or their sum. One of the most important things we've learned so far is that the class must properly implement GetHashCode() and Equals in order to use instances of a class as keys for a dictionary. But what should we do if we want to use a class that we cannot inherit or change.

Resolving the Collision Problem

Implementation of a Dictionary with Hash-Table and Chaining

Two properties are defined: one to access the key (Key) and another to access the value (Value). It's a good idea because the objects that will be kept within the dictionary implementation will be the same as those that we will return as a result of a method to take all ordered pairs in the dictionary, for example. If the key already exists, the value /// is replaced with the new value and the old value is returned /// .

In the above interface as well as in the previous class, we use generics (template types), through which we define the parameters for the keys (K) and values (V). Our interface IDictionary is very similar to the .NET standard interface System.Collections.Generic.IDictionary, but it is simplified and describes only the most important operations of the "dictionary" data structure. V>>, thus the dictionary can be easily traversed by a simple for loop.

Get(K key)

Set(K key, V value)

Set replaces an existing /// value or creates a new key-value pair if the /// key does not exist.

The next thing we pay attention to is the actual implementation of the hash table with chaining. When instantiating the hash table, we initialize inside the constructor an array of lists that will contain any of our objects of type KeyValuePair. It provides a kind of convenience for the methods to add elements and resize the hash table.

That's why we need to transfer all the elements of the old table, not just copying the chains, but again adding all the KeyValuePair objects to the newly created internal table of chains. To implement the possibility of iterating over the hash table elements in foreach loops, we have implemented the IEnumerable> interface, which has the GetEnumerator() method, which returns an iterator (IEnumerator) of the elements of the hash table. Now let's give an example of how we can use our implementation of hash table and its iterator.

We want to test that the hash table handles collisions and expansion properly, so when creating the hash table we purposely change the initial capacity to 3 and the load factor to 0.9 to ensure that the size will change soon after we put some elements into it. First we put an element, then we read it, then we overwrite its value, then we read it again, then we add a new element, which causes a collision, then we read it, then we read the first element, then we add an element, which causes the hash table to expand its internal string, etc.

Open Addressing Methods for Collision Resolution

For a hash table with capacity of the kind 2n, it is best to choose c1 and c2 equal to 0.5. This method is better than the linear and quadratic lookup because all the following lookup depends on the value of the key and not on the table position within the hash table. This makes sense because the position of a given key depends on the current capacity of the hash table.

The main idea of this method is to use two hash functions instead of one. In the latter case, the entire hash table is rebuilt with larger size and new hash functions. If at least one of the two hash functions result is a free cell, there is no problem.

Let it turn out that both hash functions are a given cell and we have randomly chosen one of them. Using three different hash functions instead of two can result in an efficient upper bound on the load factor above 0.9.

The "Set" Data Structure

Let's assume that the cell the new element tries to "kick" an element out of is the cell containing H. In its simplest version, this method has constant access to its elements, even in the worst case, but this is valid with the limitation that the load factor is less than 0.5. Nevertheless, this method is still not well adopted in the industry and is not used internally in the .NET Framework.

The main stopper is the need for two hash functions, which means the System.Object class would need to introduce two GetHashCode() methods.

The Abstract Data Structure "Set"

The elements of the set cannot be accessed by any key, as is the case with dictionaries. This is why, in practice, we access all the elements of a given set at once while iterating, using the foreach loop construct.

Set Implementations in .NET Framework

As we have already mentioned, the hash-table implementation of set in .NET is the HashSet class. This is because, as we have already explained, an element can be present at most once in a given array. The standard .NET class SortedSet is an array, implemented by a balanced search tree (red-black tree).

We recall that in .NET this usually means that the objects are instances of a class, which implements IComparable. Brad Pitt likes these bands: Accept, Blind Guardian, Dio, Dream Theater, Iron Maiden, Judas Priest, Kiss, Kreator, Manowar, Megadeth. Angelina Jolie likes these bands: Accept, Dio, Grave Gigger, Iron Maiden, Manowar, Megadeth, Metallica, Running Wild, Slayer Does Brad Pitt like Angelina Jolie.

All the bands that Brad Pitt or Angelina Jolie like: Accept, Blind Guardian, Dio, Dream Theater, Grave Gigger, Iron Maiden, Judas Priest, Kiss, Kreator, Manowar, Megadeth, Metallica, Running Wild, Slayer. As we can notice, the elements in the whole set are always sorted, compared to HashSet.

Exercises

Define a TreeMultiSet class that allows maintaining a set of elements in ascending order and duplicate elements. Implementation of operations add element, find number of occurrences, delete, iterator, min / max element search, min / max delete. Write a program using the HashSet class that returns the number of buses that have arrived and departed after a given interval (start, end).

We call a "lucky sub-sequence within P" every sub-sequence of integers of P with a sum equal to N. Imagine we have a sequence S, containing all the lucky sub-sequences of P, in kept in descending order by their length.

Solutions and Guidelines