散列表

Hashing

SIR in O(1)O(1) time

Efficient implementation for ORDERED DICTIONARY:

Search(S, k) Insert(S, x) Remove(S, x)
BinarySearchTree O(h)O(h) in worst case O(h)O(h) in worst case O(h)O(h) in worst case
Treap O(logn)O(\log n) in expectation O(logn)O(\log n) in expectation O(logn)O(\log n) in expectation
RB-Tree O(logn)O(\log n) in worst case O(logn)O(\log n) in worst case O(logn)O(\log n) in worst case
SkipList O(logn)O(\log n) in expectation O(logn)O(\log n) in expectation O(logn)O(\log n) in expectation

If we only care about Search, Insert and Remove operation, can we be faster?

Assuming keys are distinct integers from universe U={1,2,,m1}U = \left\lbrace 1, 2, \cdots, m-1 \right\rbrace. Just allocate an array of size m=Um = |U| and Search, Insert and Remove can be done in O(1)O(1) time.

Problem: What if keys are not integers, e.g., strings?

The real problem is that the universe UU can be very large. Sometime the space complexity is unacceptable.

Hash function

We are given huge universe UU of possible keys and much smaller number nn of actual keys. And we only want to spend mnm \approx n (i.e., mUm \ll |U|) space, meanwhile supporting very fast SIR.

Hash function(散列函数/哈希函数) h ⁣:U[m]h\colon U \to [m] maps keys from universe UU to buckets of a hash table T[0m1]T[0 \cdots m-1].

The time complexity of hash function should be O(1)O(1) for every key.

And we want to avoid collisions. That is, we want hh maps distinct keys to distinct indices.

Two distinct keys k1,k2k_1, k_2 collide(冲突) if h(k1)=h(k2)h(k_1) = h(k_2).

However this is impossible cos mUm \ll |U|(pigeonhole principle). Therefore, collisions are unavoidable and we have to cope with them.

Hashing with Chaining

Chaining(链接法) is a simple way to resolve collisions. We store all the keys that hash to the same bucket in a LINKEDLIST.

Each bucket ii stores a pointer to a LINKEDLIST LiL_i and all keys that are hashed to index ii go to LiL_i.

Space Cost:

  • Θ(m)\Theta(m) for pointers;
  • Θ(n)\Theta(n) for actual elements.

Operation:

  • Search(k): where k is a key.
    • Compute h(k) and go through the corresponding list to search item with key k.
  • Insert(x): where x is a pointer to an item.
    • Compute h(x.key) and insert x to the head of the corresponding list.
  • Remove(x): where x is a pointer to an item.
    • Simply remove x from the LINKEDLIST.

Search can cost Θ(n)\Theta(n) in worst-case if all keys hash to the same bucket. At this time, the hashing table with chaining degrades to a LINKEDLIST.

Performance

Simple Uniform Hashing Assumption

  • Every key is equally likely to map to every bucket.
  • Keys are mapped independently.

Each key goes to a randomly chosen bucket, if there are enough number of buckets, each bucket will not have too many keys.

Consider a hash table containing mm buckets and storing nn keys.

Define load factor(负载因子) α=nm\alpha = \dfrac{n}{m}.

Intuitively, Search will on average cost O(1+α)O(1 + \alpha) time. O(1)O(1) for computing hash value and O(α)O(\alpha) for traversing the LINKEDLIST.

Expected cost of unsuccessful search is Θ(1+α)\Theta(1 + \alpha). The cost is the sum of computing hash value and traversing the entire LINKEDLIST in a bucket.

Expected cost of successful search is Θ(1+α)\Theta(1 + \alpha) too. The cost is the sum of computing hash value and traversing LINKEDLIST in a bucket till key found.

Let CiC_i be the cost for finding the ii-th inserted element xix_i. We want to compute 1ni=1nE[Ci]\displaystyle \dfrac{1}{n} \sum_{i=1}^{n} \mathbb{E}[C_i].

Let XijX_{ij} be an IRV taking value 1 if and only if h(x_i.key) = h(x_j.key).

1ni=1nE[Ci]=1nE[1+j=i+1nXij]=1ni=1n(1+j=i+1nE[Xij])=1ni=1n(1+j=i+1n1m)=1ni=1n(1+nim)=1n(n+n(n1)2m)=1+α2α2n=Θ(1+α)\begin{aligned} \dfrac{1}{n} \sum_{i=1}^{n} \mathbb{E}[C_i] &= \dfrac{1}{n}\mathbb{E}\left[ 1 + \sum_{j=i+1}^{n}X_{ij} \right] \\ &= \dfrac{1}{n} \sum_{i=1}^{n} \left( 1 + \sum_{j=i+1}^{n} \mathbb{E}[X_{ij}] \right)\\ &= \dfrac{1}{n} \sum_{i=1}^{n} \left( 1 + \sum_{j=i+1}^{n}\dfrac{1}{m} \right)\\ &= \dfrac{1}{n} \sum_{i=1}^{n} \left( 1 + \dfrac{n-i}{m} \right)\\ &= \dfrac{1}{n} \left(n + \dfrac{n(n-1)}{2m}\right)\\ &= 1 + \dfrac{\alpha}{2} - \dfrac{\alpha}{2 n}\\ &= \Theta(1 + \alpha) \end{aligned}

So, what is the expected maximum cost for Search? That is, the expected length of the longest LINKEDLIST. This is equivalent to the Max-Load problem(balls into bins problem)

If m=Θ(n)m = \Theta(n), the answer is Θ(lognlognlogn)\Theta\left(\frac{\log n}{\log n \log n}\right).

Here's a hand wavy proof in Stack Overflow: Maximum load when placing N balls in N bins.

However SUH doesn't hold. Keys are not that random (they usually have patterns). Patterns in keys can induce patterns in hash functions.

And once hh is fixed and known, we can find a set of bad keys that hash to same value.

Designs of Hash function

Some bad hash functions

Assume keys are English words.

One bucket for each letter (i.e., 26 buckets). A hash function h(w)h(w) is the first letter of the word ww. This is not uniform since words starting with 'a' are more common than words starting with 'z'.

One bucket for each number in [26×50][26 \times 50]. Another hash function h(w)h(w) is the sum of indices of letters in ww. This is not uniform since most of words are short words.

The Division Method

This is a common technique when designing hash functions.

Hash function

h(k)=kmodmh(k) = k \bmod m

Two keys k1,k2k_1, k_2 collide if k1k2(modm)k_1 \equiv k_2 \pmod m.

The key is to pick appropriate mm.

Say we want to store nn keys.

Here's an idea: Let r=lgnr = \left\lceil \lg n \right\rceil and set m=2rm = 2^r. Then computing h(k)h(k) is very fast: h(k) = k - ((k >> r) << r).

But we only use rightmost rr bits of the input key. For example, if all input keys are even, we use at most half space.

In general, we want mm to be a prime number.

Assume mm to be a composite number and key kk and mm have common divisor dd. h(k)h(k) is also divisible by dd since (kmodm)+kmm=k(k \bmod m) + \left\lfloor \dfrac{k}{m} \right\rfloor \cdot m = k. If all input keys are divisible by dd, we use at most 1d\dfrac{1}{d} space.

Rule of thumb(经验法则)

Pick mm to be a prime number not too close to a power of 22 (or 1010).

The Multiplication Method

Here's another common technique.

Assume key length is at most ww bits. Fix table size m=2rm = 2^r for some rwr \le w. Fix constant 0<A<2w0 < A < 2^w.

Hash function

h(k)=(Akmod2w)(wr)h(k) = (Ak \bmod 2^w) \gg (w-r)

This is faster than the Division Method, cos Multiplication ant bit-shifting are faster than division.

Universal Hashing(全域散列)

However, once hash function hh is fixed and known, there must exist a set of bad keys that hash to the same value. Such adversarial input will result in poor performance.

The solution is to use randomization.

Pick a random hash function hh when the hash table is first built. Once chosen, hh is fixed throughout entire execution. Since hh is randomly chosen, no input is always bad.

A collection of hash functions H\mathscr{H} is universal if for any distinct keys xyx \ne y, at most Hm\dfrac{|\mathscr{H}|}{m} hash functions in H\mathscr{H} lead to h(x)=h(y)h(x) = h(y). Therefore PrhH[h(x)=h(y)]1m\Pr\limits_{h \in \mathscr{H}}[h(x) = h(y)] \le \dfrac{1}{m} for all xyx \ne y.

Source of uncertainty:

  • SUH: randomness of input.
  • Universal Hashing: choice of function hh (and potentially randomness of input).

Performance of hashing with chaining

Let Lh(k)L_{h(k)} be the length of list at index h(k)h(k), and we want to compute E[Lh(k)]\mathbb{E}[L_{h(k)}].

Claims:

  1. If key kk is not in table TT, then E[Lh(k)]α\mathbb{E}[L_{h(k)}] \le \alpha.
    • For any key ll, define IRV XklX_{kl} taking value 1 if and only if h(k)=h(l)h(k) = h(l). Then we have

    E[Lh(k)]=E[lT,lkXkl]=lT,lkE[Xkl]n1m=α\begin{aligned} \mathbb{E}[L_{h(k)}] &= \mathbb{E}\left[ \sum_{l \in T, l \ne k} X_{kl} \right] \\ &= \sum_{l \in T, l \ne k} \mathbb{E}[X_{kl}] \\ &\le n \cdot \dfrac{1}{m}\\ &= \alpha \end{aligned}

  2. If key kk is is table TT, then E[Lh(k)]1+α\mathbb{E}[L_{h(k)}] \le 1 + \alpha.
    • We have

    E[Lh(k)]=1+E[lT,lkXkl]1+(n1)1m<1+α\begin{aligned} \mathbb{E}[L_{h(k)}] &= 1 + \mathbb{E}\left[ \sum_{l \in T, l \ne k} X_{kl} \right] \\ &\le 1 + (n-1) \cdot \dfrac{1}{m}\\ &< 1 + \alpha \end{aligned}

If the hash table is not overloaded, i.e., α=O(1)\alpha = O(1), SIR can be done in O(1)O(1) expected time.

Typical Universal Hash Family

Proposed by Carter and Wegman in 1977.

Find a large prime pp larger than the max possible key value.

Let Zp=[p1]={0,1,,p1}\Z_p = [p-1] = \left\lbrace 0, 1, \dots, p - 1 \right\rbrace and Zp={1,2,,p1}\Z_p^{*} = \left\lbrace 1, 2, \dots, p - 1 \right\rbrace.

Define hab(k)=((ak+b)modp)modmh_{ab}(k) = ((ak + b) \bmod p) \bmod m, then

Hpm={habaZp,bZp}\mathscr{H}_{pm} = \left\lbrace h_{ab} \mid a \in \Z_p^{*}, b \in \Z_p \right\rbrace

is a universal hash family.

We want to prove that, for all klk \ne l, PrhHpm[h(k)=h(l)]1m\Pr\limits_{h \in \mathscr{H}_{pm}}[h(k) = h(l)] \le \dfrac{1}{m}, where k,lZpk, l \in \Z_p.

Let r=(ak+b)modp,s=(al+b)modpr = (ak + b) \bmod p, s = (al + b) \bmod p.

Claim 1

rsr \ne s

Proof.

We have rsa(kl)(modp)r - s \equiv a(k-l) \pmod p. However a≢0(modp)a \not\equiv 0 \pmod p and kl≢0(modp)k - l \not\equiv 0 \pmod p, since pp is a prime.

So rs0r - s \ne 0.

Claim 2

Fix k,lk, l, there is a 1-to-1 mapping between (a,b)(a, b) and (r,s)(r, s) pairs.

Proof.

这里涉及了「模乘法的逆」的概念,可回看《离散数学》笔记

Recall rsa(kl)(modp)r - s \equiv a(k-l) \pmod p, then we get a=(rs)(kl)1modpa = (r-s)(k-l)^{-1} \bmod p and b=(rak)modpb = (r - ak) \bmod p. (kl)1(k-l)^{-1} is the modular multiplicative inverse of klk-l and it is unique since pp is a prime.

rsa(kl)(modp)    a(rs)(kl)1(modp)    a=(rs)(kl)1modp\begin{aligned} r - s \equiv a(k-l) \pmod p &\iff a \equiv (r-s)(k-l)^{-1} \pmod p \\ &\iff a = (r-s)(k-l)^{-1} \bmod p \end{aligned}

Then we get unique (a,b)(a, b) given distinct (r,s)(r, s).

There are p(p1)p(p-1) pairs of (r,s)(r, s) and p(p1)p(p-1) pairs of (a,b)(a, b). As a result, there is a 1-to-1 mapping between (a,b)(a, b) and (r,s)(r, s) pairs.

Thus, for any given pair of distinct inputs of k,lk, l, if we pick (a,b)(a, b) uniformly at random from Z×Zp\Z^{*} \times \Z_p, the resulting pair (r,s)(r, s) is equally likely to be any pair of distinct values modulo pp.

Therefore, the probability that distinct keys k,lk, l collide is equal to the probability that rs(modm)r \equiv s \pmod m.

Lemma

PrhHpm[h(k)=h(l)]=Pr0r,s<p[rs(modm)]=Pr[rs(modm)(r,s) are uniformly random in Zp]pm1p1p+m1m1p1=1m\begin{aligned} \Pr_{h \in \mathscr{H}_{pm}}[h(k) = h(l)] &= \Pr_{0 \le r, s < p}[r \equiv s \pmod m] \\ &= \Pr[r \equiv s \pmod m \mid (r, s) \text{ are uniformly random in } \Z_p] \\ &\le \dfrac{\left\lceil \frac{p}{m} \right\rceil - 1}{p - 1}\\ &\le \dfrac{\frac{p+m-1}{m}-1}{p-1}\\ &= \dfrac{1}{m} \end{aligned}

Open Addressing

可参考「概率论」笔记

Here's another way to resolve collisions except chaining.

Basic idea: On collision, probe a sequence of buckets until an empty one is found.

We redefine h ⁣:U×[m][m]h\colon U \times [m] \to [m], where the first [m][m] means the probe number and the second one means table index.

1
2
3
4
5
6
7
8
9
10
HashInsert(T, k):
i := 0
repeat
j := h(k, i)
if T[j] == NULL or T[j] == DEL
T[j] := k
return j
else i := i + 1
until i == m
error "overflow"
1
2
3
4
5
6
7
8
9
HashSearch(T, k):
i := 0
repeat
j := h(k, i)
if T[j] == k
return j
else i := i + 1
until i == m or T[j] == NULL
return NULL
1
2
3
4
5
HashRemove(T, k):
pos := HashSearch(T, k)
if pos != NULL
T[pos] := DEL
return pos

Without DEL mark:

With DEL mark:

Linear Probing

h(k,i)=(h(k)+i)modmh(k, i) = (h'(k) + i) \bmod m

where hh' is an auxiliary hash function.

Since the initial probe position determines the entire probe sequence, only mm distinct probe sequences are used with linear probing.

Another problem with linear probing is Clustering:

  • Empty slot after a "cluster" has higher chance to be chosen.
  • Cluster grows larger and larger.
  • Cluster leads to higher search time in theory.

The remove mechanism (i.e., the DEL mark) causes "anti-clustering" effect, improving the performance of linear-probing hash tables.

Quadratic Probing

h(k,i)=(h(k)+c1i+c2i2)modmh(k, i) = (h'(k) + c_1 \cdot i + c_2 \cdot i^2) \bmod m

where c1,c2c_1, c_2 are constants.

Problem: (Secondary) Clustering

  • Keys having same hh' value result in same probe sequences.
  • As in linear probing, the initial probe determines the entire sequence, so only mm distinct probe sequences are used.

Double Hashing

h(k,i)=(h1(k)+ih2(k))modmh(k, i) = (h_1(k) + i \cdot h_2(k)) \bmod m

where h1,h2h_1, h_2 are auxiliary hash functions.

Why doubling? Observations:

  1. If h1h_1 is good, h(k,0)h(k, 0) looks random.
  2. If h2h_2 is good, probe sequence looks random.

Linear and quadratic probing doesn't give observation 2.

The value h2(k)h_2(k) must be relatively prime to mm for the entire hash table to be searched. Conveniently, let mm be a prime number.

Otherwise if mm and h2(k)h_2(k) have greatest common divisor d>1d > 1 for some key kk, then a search for key kk would examine only 1d\dfrac{1}{d} of the hash table.

Each possible (h1(k),h2(k))(h_1(k), h_2(k)) pair yields a distinct probe sequence. Therefore double hashing can use Θ(m2)\Theta(m^2) different probe sequences.

This is not the best. Uniform hashing gives the permutation of mm elements, which is m!m!.

Performance of Open Addressing

Let random variable XX be the number of probes made in an unsuccessful search. Then

E[X]11α\mathbb{E}[X] \le \dfrac{1}{1 - \alpha}

where α=nm\alpha = \dfrac{n}{m} is the load factor.

Proof.

证明在「概率论」中就已经证过了,这里略。

Insight: Always make 1st probe, and make 2nd probe with probability α\approx \alpha, 3rd probe with probability α2\approx \alpha^2, and so on.

Let random variable XX be the number of probes made in an successful search. Then

E[X]1αln11α\mathbb{E}[X] \le \dfrac{1}{\alpha} \ln \dfrac{1}{1 - \alpha}

Proof.

Let NiN_i be the expected number of probes made when searching the ii-th inserted key.

Due to previous analysis, we have

Ni11i1mN_i \le \dfrac{1}{1 - \frac{i-1}{m}}

Therefore

E[X]1ni=1nNi1ni=1nmm(i1)=mni=0n11mi=1αk=mn+1m1k1αmnm1x ⁣dx=1αlnmmn=1αln11α\begin{aligned} \mathbb{E}[X] &\le \dfrac{1}{n} \sum_{i=1}^{n} N_i \\ &\le \dfrac{1}{n} \sum_{i=1}^{n} \dfrac{m}{m - (i-1)} \\ &= \dfrac{m}{n} \sum_{i=0}^{n-1} \dfrac{1}{m-i}\\ &= \dfrac{1}{\alpha} \sum_{k=m-n+1}^{m}\dfrac{1}{k}\\ &\le \dfrac{1}{\alpha} \int_{m-n}^{m} \dfrac{1}{x} \d x\\ &= \dfrac{1}{\alpha} \ln \dfrac{m}{m-n}\\ &= \dfrac{1}{\alpha} \ln \dfrac{1}{1 - \alpha} \end{aligned}

Chaining v.s. Open-addressing

  • Good parts of Open-addressing
    • No memory allocation
      • Chaining needs to allocate list-nodes
    • Better cache performance
      • Hash table stores in a continuous region in memory
      • Fewer accesses brings table into cache
  • Bad parts of Open-addressing
    • Sensitive to choice of hash functions
      • Clustering is a common problem
    • Sensitive to load factor
      • Poor performance when α1\alpha \approx 1