Universal Hashing

7 min readJul 10, 2022

Ever heard about hashing? Hashing is one of the most important data structure in computer science. So in this article I would like to cover universal hashing in detail.
Also, If you are reading this article you should be having some sort of a previous exposure with hashing as well.

Introduction

Hashing is the process of converting a given key into another value. Here we use hash function to do this transformation. This hash function can be either hash code map, compression map or combination of together.

Generating a hash value for some key associated with some data or data record will improve the efficiency of insert, delete and search. Generally by using hash we can access data in constant time which is the complexity of O(1).

Hash function has 2 fundamental properties and their performance will be depended on those,

1. The hash function must be computable in constant time (i.e., independent of the size of the hash table)

2. The hash function must distribute its items uniformly among the array slots.

Now let us see in details how this complexity is depending on the hash function that is used to generate the hash value.

Suppose the hash function type compression map. Then it generates an output hash the finite range for any valid input integer key in the infinite range. This may possibly cause to generate same hash values for some distinct input keys. This phenomenon is known as collision.

Let h(x) be a hash function. Let U = {0, 1, 2, …, m-1} be the universe of keys.
And let x, y be the members of U where x != y.
Then, if h(x) = h(y) then its collision.

We can’t avoid collision. But we can reduce it. Its again depend on the corresponding hash function that perform the transformation.

So, as we can see if we choose single hash function it may cause collision for some keys. So, such collisions are resolved using collision resolving techniques (CRT) such as separate chaining, linear probing, quadratic probing and double hashing. So those CRTs may spread and store the collided hash in consecutive locations at the best case or very distant locations at the worst case. Hence the best case searching will be O(1), worst case searching will be O(n) likewise average case is O(n) as well, which contradicts the main goal of hashing. (because in hashing we expect to have constant time search.)

So, we introduce new hashing technique known as universal hashing.

Universal hashing

In Universal hashing we avoid the scenario presented above. In universal hashing the hash function is selected randomly from a carefully designed family of functions. This guarantees no single input key would result the worst-case scenario mentioned above. Also, since the hash function choose randomly for each key the transformation and resulting hash values are different.

In short universal hashing use randomization to avoid linear complexity at the worst case for any arbitrary values taken from the key universe.

Definition 1

A family H of hash functions is universal, if for any x != y, the number of hash functions h in H for which h(x) = h(y) is at most |H| / M where M is the hash table size.

This means that if we chose a hash function h randomly form the family of functions H then the collision between any two distinct keys x and y is at most the probability of 1/M.

Which means,

So, this guarantees that we have hash functions that perform searching in constant time which is O(1).

Definition 2

A family H of hash functions is k-universal, if for any x1 != y1, x2 != y2, …, xk != yk and the number of hash functions h in H for which h(x1) = h(y1), h(x2) = h(y2), …, h(xk) = h(yk) is at most |H| / M^k

This means, if we chose a hash function h randomly from the family of functions H then the collision between all the distinct keys x1, x2, …., xk, y1, y2, …, yk is at most the probability of 1 / M ^k.

Which means,

So, again this guarantees that we have hash functions for each distinct elements that perform searching in constant time which is O(1).

Family of hash functions

Now let’s design a simple family of hash functions. In order to do that let’s general compression map H(x) = hk(x) mod M where M is the hash table size.

Now let hk(x) = (ax + b) mod p where p is a prime and a, b are integers such that 1 <= a <= p-1 and 0 <= b <= p-1

Then let the family of hash function is defined as,

H = {Ha,b(x) = ((ax + b) mod p) mod M where <= a <= p-1 and 0 <= b <= p-1}

For example, in this family of hash function H can be having following random functions,

H3,8(x) = ((3x + 8) mod p) mod M

H10, 0(x) = ((10x) mod p) mod M

H1, 1(x) = ((x + 1) mod p) mod M

Likewise there are total (p — 1)p random functions for this family of hash functions H.

Theorem

The hash family H = {Ha,b(x) = ((ax + b) mod p) mod M where <= a <= p-1 and 0 <= b <= p-1} is universal.

Proof :-

Assume x, y be distinct values(keys) with x > y such that Ha,b(x) = Ha,b(y)

Which means both x, y yields the same hash value. Which means a collision

From the definition of family of hash functions,

Ha,b(x) = ((ax + b) mod p) mod M

Then, ((ax + b) mod p) mod M = ((ay + b) mod p) mod M ; By the right cancelation rule

(ax + b) mod p = (ay + b) mod p

Then by subtracting,

(ax + b) — (ay + b) mod p = 0 mod p

a(x — y) = 0 mod p

This means, p | a(x — y)

Hence p | a or p | (x — y)

But p is a prime and 1 <= a <= p -1 hence this is a contradiction.

Hence p | a can’t be true.

Also, x > y then x — y > 1 also x — y < p — 1 hence this is a contradiction.

Hence p | x — y can’t be true

Therefore, the very first assumption must be false. Hence proof by the contradiction

Ha,b(x) != Ha,b(y)

Hence for any x belongs to Integer Ha,b(x) yields a unique hash.

Hence H is a universal hash family. Proof completed.

For example, this family of hash function H is universal.

H3,8(x) = ((3x + 8) mod p) mod M

H10, 0(x) = ((10x) mod p) mod M

H1, 1(x) = ((x + 1) mod p) mod M

Which means each function is unique and yields a unique hash values.

Note that a, b are the terms which adds the randomness and mod M is a compression map.

Constructing a universal family of hash functions

Step 1 — Condition

Select m as a prime number

Step 2 — Pre-processing part

Make a tuple of keys such that k = <k1, k2, .. , kn> where ki belongs to {0, 1, …, m-1}

Step 3 — The randomness

Add a tuple a = <a1, a2, … an> where ai belongs to {0, 1, …, m-1}

Step 4 — Generate hash function

Step 5 — Generate the universal hash family

For each ai != aj where i != j there exist a universal hash family such that,

Example — Hashing IPv4 Addresses

Note that each IPv4 address consist 32bits group as octet by octet (octet is 8 bits)

Hence IP address range from,

00000000.00000000.00000000.00000000 to 11111111.11111111.11111111.11111111

Then this can be simplified for decimals

0.0.0.0 to 255.255.255.255

Then let any arbitrary IP address be x1.x2.x3.x4

Then,

Step 1: Let m = 997 be a prime

Step 2: Let x = <x1, x2, x3, x4> where 0 <= x1, x2, x3, x4 <= 255

Step 3: Let a = <a1, a2, a3, a4> where 0 <= a1, a2, a3, a4 <= 997–1

Step 4: then a hash function is,

Step 5: Then for each distinct a, The set of

yields the universal family of hash functions.

Alright. So that about universal hashing. And if you get some knowledge from my article give a clap and share with your friends as well.