排序

The Sorting Problem

The original idea is sorting nn numbers into ascending order.

We can actually sort a collection of any type of data, as long as a total order(全序) is defined for that type of data.

That is, for any distinct data items a and b, we compare them, i.e., we can determine: a<ba < b or b<ab < a, otherwise a=ba = b where "<<" is a binary relation.

We can also sort partially ordered items.

Characteristics

  • In-place(原地): a sorting algorithm is in-place if O(1)O(1) extra space is needed beyond input.
  • Stability(稳定): a sorting algorithm is stable if numbers with the same value appear in the output array in the same order as they do in the input array.

We have learned several sorting algorithms:

  • Insertion Sort: gradually increase size of sorted part.
    • O(n2)O(n^2) time
    • O(1)O(1) space
    • In-place
    • Stable
  • Merge Sort: example of divide-and-conquer
    • O(nlogn)O(n\log n) time
    • O(n)O(n) space
    • Not in-place
    • Stable
  • Heap Sort: Leverage the heap data structure
    • O(nlogn)O(n\log n) time
    • O(1)O(1) space
    • In-place
    • Not stable[1]

Elementary Sorting

Selection Sort

Basic idea

Pick out minimum element from input, then recursively sort remaining elements, and finally concatenate the minimum element with sorted remaining elements.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SelectionSortRec(A):
if |A| == 1
return A
else
min := Min(A)
A' := A - {min}
return Concatenate(min, SelectionSortRec(A'))

SelectionSort(A):
for i := 1 to A.length
min_idx := i
for j := i + 1 to A.length
if A[j] < A[min_idx]
min_idx := j
Swap(A[i], A[min_idx])

Time complexity: i=1n1(Θ(1)+Θ(ni))=Θ(n2)\displaystyle \sum_{i=1}^{n-1} (\Theta(1) + \Theta(n-i)) = \Theta(n^2).

Space complexity: O(1)O(1) extra space, thus in-space.

It is not stable. Swap operation can mess up relative order.

Consider a counterexample for stability [2a, 2b, 1]. It's sorted after the first swap, becoming [1, 2b, 2a].

It's similar to HeapSort. Let A get organized as a heap, then it leads to the faster algorithm. Therefore, the choice of data structure affects the performance of algorithms.

Bubble Sort

Basic idea

Repeatedly step through the array, compare adjacent pairs and swaps them if they are in the wrong order. Thus, larger elements "bubble" to the "top".

1
2
3
4
5
BubbleSort(A):
for i := A.length down to 2
for j := 1 to i - 1
if A[j] > A[j + 1]
Swap(A[j], A[j + 1])

Time complexity: Θ(n2)\Theta(n^2).

Space complexity: O(1)O(1) extra space, thus in-space.

It is stable. Adjacent elements are swapped only if they are in the wrong order.

When we never swap data items in one iteration, then we are done, the array is sorted.

1
2
3
4
5
6
7
8
9
10
BubbleSortImproved(A):
n := A.length
repeat
swapped := false
for i := 1 to n - 1
if A[i] > A[i + 1]
Swap(A[i], A[i + 1])
swapped := true
n := n - 1
until swapped == false

When the input is mostly sorted, this variant performs much better. Particularly, when the input is sorted, this variant has O(n)O(n) runtime.

Nonetheless, the worst case performance is still Θ(n2)\Theta(n^2), e.g., when input is reversely sorted.

We can be more aggressive when reducing nn after each iteration: in A[1...n], items after the last swap are all in correct sorted position.

1
2
3
4
5
6
7
8
9
10
BubbleSortImprovedFurther(A):
n := A.length
repeat
lastSwapIdx := -1
for i := 1 to n - 1
if A[i] > A[i + 1]
Swap(A[i], A[i + 1])
lastSwapIdx := i + 1
n := lastSwapIdx - 1
until n <= 1

Comparison of simple sorting algorithms

  • Insertion
    • n(n1)2\dfrac{n(n-1)}{2} swaps and n(n1)2\dfrac{n(n-1)}{2} comparisons in worst case.
    • n(n1)4\dfrac{n(n-1)}{4} swaps and n(n1)4\dfrac{n(n-1)}{4} comparisons in average case.
  • Selection
    • n1n-1 swaps and n(n1)2\dfrac{n(n-1)}{2} comparisons in worst case.
  • Bubble
    • n(n1)2\dfrac{n(n-1)}{2} swaps and n(n1)2\dfrac{n(n-1)}{2} comparisons in worst case.

Shell Sort

  • Insertion sorting is effective when:
    • Input size is small
    • The input array is nearly sorted (resulting in few comparisons and swaps)
  • Insertion sorting is ineffective when:
    • Elements must move far in array

So we will

  • Allow elements to move large steps
  • Bring elements close to final location
    • Make array almost sorted

Idea

for some decreasing step size h, e.g. (,8,4,2,1\dots, 8, 4, 2, 1), the sequence must end with 11 (to ensure the correctness of sorting)

  • For each step, sort the array so elements separated by exactly hh elements apart are in order

Quick Sort

Here's a unified view of many sorting algorithms: Divide problem into subproblems. Conquer subproblems recursively. Combine solutions of subproblems.

  • Divide the input into size 11 and size n1n-1.
    • InsertionSort: easy to divide, combine needs efforts.
    • MergeSort: divide needs efforts, easy to combine.
  • Divide the input into two parts each of same size.
    • MergeSort: easy to divide, combine needs efforts.
  • Divide the input into two parts of approximately same size.
    • QuickSort: divide needs efforts, easy to combine.

Basic idea

Given an array AA of nn items

  • Choose one item xx in AA as the pivot.
  • Use the pivot to partition the input into BB and CC, so that items in BB are x\le x and items in CC are x\ge x.
  • Recursively sort BB and CC.
  • Output is concatenation of sorted BB, xx, and sorted CC.
1
2
3
4
QuickSortAbs(A):
x := GetPivot(A)
B, C := Partition(A, x)
return Concatenate(QuickSortAbs(B), x, QuickSortAbs(C))

Choosing the pivot

Ideally the pivot should partition the input into two parts of roughly the same size.

Median-of-three technique

Choose median of first, middle, and last element as pivot.

For every simple deterministic method of choosing pivot, we can construct corresponding "bad input".

For now just use the last item as the pivot.

The Partition Procedure

  • Allocate array BB of size nn
  • Sequentially go through A[1...(n-1)], put small items at the left side of BB, and large items at the right side of BB.
  • Finally put the pivot in the (only) remaining position

The time complexity is Θ(n)\Theta(n) and space complexity is Θ(n)\Theta(n). And it's unstable.

1
2
3
4
5
6
7
8
9
10
11
12
13
Partition(A):
x := A[n]
l := 1
r := n
for i := 1 to n - 1
if A[i] <= x
B[l] := A[i]
l := l + 1
else
B[r] := A[i]
r := r - 1
B[l] := x
return <B, l>

In-place Partition Procedure

Basic idea

Sequentially go through AA, use swap operations to move small items to the left part of AA; thus the right part of AA naturally contains large items.

1
2
3
4
5
6
7
8
InplacePartition(A, p, r):
i := p - 1
for j := p to r - 1
if A[j] <= A[r]
i := i + 1
Swap(A[i], A[j])
Swap(A[i + 1], A[r])
return i + 1

Here's part of in-place partition procedure:

Analysis of In-place Partition Procedure:

Performance of InplacePartition:

  • Θ(rp)\Theta(|r - p|) time (i.e. linear time)
  • O(1)O(1) space
  • unstable

Performance

Worst-case

Cost at each level is c0(nm)c_0 (n-m), where mm is number of pivots removed in lower level Partition.

If the partition is balanced, then there will be few levels, and mm will increase rapidly.

Recurrence for the worst-case runtime of QuickSort:

T(n)=max0qn1(T(q)+T(nq1))+c0nT(n) = \max_{0 \le q \le n-1} (T(q) + T(n-q-1)) + c_0 n

Guess T(n)cn2T(n) \le c n^2 and we now verify:

T(N)max0qn1(cq2+c(nq1)2)+c0n=cmax0qn1(q2+(nq1)2)+c0nc(n1)2+c0n=cn2c(2n1)+c0ncn2\begin{aligned} T(N) &\le \max_{0 \le q \le n-1} (c q^2 + c (n-q-1)^2) + c_0 n\\ &= c \max_{0 \le q \le n-1} (q^2 + (n-q-1)^2) + c_0 n\\ &\le c(n-1)^2 + c_0n\\ &= cn^2 - c(2n-1) + c_0n\\ &\le cn^2 \end{aligned}

Thus T(n)=O(n2)T(n) = O(n^2).

Best-case

Balanced partition gives best case performance. T(N)2T(n2)+Θ(n)T(N) \le 2T(\frac{n}{2}) + \Theta(n) implies T(n)=O(nlogn)T(n) = O(n\log n).

Partition does not need to be perfectly balanced, we only need each split to be constant proportionality. T(n)T(dn)+T((1d)n)+Θ(n)T(n) \le T(dn) + T((1-d)n) + \Theta(n) where d=Θ(1)d = \Theta(1). It's still O(nlogn)O(n\log n).

Average-case

Average-case analysis: the expected time of algorithm over all inputs of size nn: A(n)=xXnT(x)Pr(x)A(n) = \sum\limits_{x \in \mathscr{X}_n} T(x) \cdot \Pr(x).

For QuickSort, particular values in the array are not important, instead, the relative ordering of the values is what matters (since QuickSort is comparison-based).

Therefore, it is important to focus on the permutation of input numbers. A readable assumption is that all permutations of the input numbers are equally likely.

To make the analysis simple, we also assume that the elements are distinct (duplicate values will be discussed later).

Randomized QuickSort
1
2
3
4
5
6
7
RandQuickSort(A, p, r):
if p < r
i := Random(p, r)
Swap(A[r], A[i])
q := InplacePartition(A, p, r)
RandQuickSort(A, p, q - 1)
RandQuickSort(A, q + 1, r)

Cost of a call to RandQuickSort:

  • Choose a pivot in Θ(1)\Theta(1) time;
  • Run InplacePartition, whose cost is O(number of comparisons)O(\text{number of comparisons});
  • Need to call RandQuickSort twice, the calling process (not the subroutines themselves) needs Θ(1)\Theta(1) time.

Total cost of RandQuickSort:

  • Time for choosing pivots is O(n)O(n), since each node can be pivot at most once.
  • All calls to InplacePartition cost O(total number of comparisons)O(\text{total number of comparisons}).
  • Total time for calling RandQuickSort is O(2n)O(2n), since each time a pivot is chosen, two recursive calls are made.

Cost of RandQuickSort is O(n+X)O(n + X) where XX is a random variable denoting the number of comparisons happened in InplacePartition throughout entire execution.

Each of pair of items is compared at most once. Cos items only compare with pivots and each items can be the pivot at most once.

For ease of analysis, we let index the elements of the array A by their position in the sorted output, rather than their position in the input. For all elements, we refer them to be z1<z2<<znz_1 < z_2 < \dots < z_n.

Let XijX_{ij} be the indicator random variable that is 11 if ziz_i and zjz_j are compared in InplacePartition, and 00 otherwise. Then we have

E[X]=E[i=1n1j=i+1nXij]=i=1n1j=i+1nE[Xij]=i=1n1j=i+1nPr(Xij=1)\begin{aligned} \mathbb{E}[X] &= \mathbb{E}\left[ \sum_{i=1}^{n-1}\sum_{j=i+1}^{n}X_{ij} \right]\\ &= \sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\mathbb{E}[X_{ij}]\\ &= \sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\Pr(X_{ij} = 1) \end{aligned}

Let Zij={zzA,zizzj}Z_{ij} = \left\lbrace z \mid z \in A, z_i \le z \le z_{j} \right\rbrace where iji \le j, and z^ij\hat{z}_{ij} be the first item in ZijZ_{ij} that is chosen as a pivot.

Then ziz_i and zjz_{j} are compared if and only if z^ij=zi\hat{z}_{ij} = z_i or z^ij=zj\hat{z}_{ij} = z_j. Cos items from ZijZ_{ij} stay in the same split until the pivot is chosen from ZijZ_{ij}.

Therefore

Pr(Xij=1)=Pr(z^ij=zi)+Pr(z^ij=zj)=2ji+1\begin{aligned} \Pr(X_{ij} = 1) &= \Pr(\hat{z}_{ij} = z_i) + \Pr(\hat{z}_{ij} = z_j)\\ &= \dfrac{2}{j - i + 1} \end{aligned}

Substitute this back to the expectation formula, we get

E[X]=i=1n1j=i+1n2ji+1=i=1n1k=1ni2k+1<i=1n1k=1n2k<2nHn<2n(lnn+1)=O(nlogn)\begin{aligned} \mathbb{E}[X] &= \sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\dfrac{2}{j - i + 1}\\ &= \sum_{i=1}^{n-1}\sum_{k=1}^{n-i}\dfrac{2}{k + 1}\\ &< \sum_{i=1}^{n-1}\sum_{k=1}^n \dfrac{2}{k}\\ &< 2n H_n\\ &< 2n(\ln n + 1)\\ &= O(n\log n) \end{aligned}

Combined the fact that in the best case, randomized quick sort is Θ(nlog(n))\Theta(n\log(n)), the expected running time is Θ(nlog(n))\Theta(n\log(n)).

In fact, runtime of RandQuickSort is O(nlog(n))O(n\log(n)) with high probability.

Stop recursion once the array is too small.

  • Recursion has overhead, QuickSort is slow on small arrays.
  • Usually using InsertionSort for small arrays, resulting in fewer swaps, comparisons or other operations on such small arrays.

Comparison

  • HeapSort is non-recursive, minimal auxiliary storage requirement (good for embedded system), but with poor locality of reference, the access of elements is not linear, resulting many caches being missed! It is the slowest among three algorithms.

  • In most (not all) tests, QuickSort turns out to be faster than MergeSort. This is because although QuickSort performs 39% more comparisons than MergeSort,
    but much less movement (copies) of array elements.

  • MergeSort is a stable sorting, and can take advantage of partially pre-sorted input. Further, MergeSort is more efficient at handling slow-to-access sequential media.

External Sorting*

External sorting is required when the data being sorted do not fit into the main memory of a computing device and instead they must reside in the slower external memory, usually a disk drive.

Since I/O is rather expensive, the overall execution cost may be far dominated by the I/O, the target of algorithm design is to reduce I/Os.

External merge sort

External merge problem

  • Input: 2 sorted lists with MM and NN pages
  • Output: 1 merged sorted list with M+NM+N pages

We can efficiently (in terms of I/O) merge two lists using a memory buffer of size at least 3 by using only 2(M+N)2(M+N) I/Os.

To find an element that is no larger than all elements in two lists, one only needs to compare minimum elements from each list Each time put the current minimum elements back to disk.

Deal with unsorted large files:

  1. Split into chunks small enough to sort in memory.
    • each sorted sub-file is called a "run".
  2. Merge pairs (or groups) of runs using the external merge algorithm.
  3. Keep merging the resulting runs (each time = a "pass") until left with one sorted file.

The 2-way merge algorithm only uses 3 buffer pages

  • What if we have more available memory?
    • Use as much of the available memory as possible in every pass
    • Reducing the number of passes reduces I/O

Suppose we have B3B \ge 3 buffer pages available, we can increase length of initial runs.

IO cost becomes 2N(log2NB+1)2 N \left(\left\lceil \log_2 \dfrac{N}{B} \right\rceil + 1\right)(starting with runs of length BB) from 2N(log2N+1)2 N \left(\left\lceil \log_2 N \right\rceil + 1\right)(starting with runs of length 11).

Then, we can perform a (B1)(B-1)-way merge.

On each pass, we can merge groups of (B1)(B-1) runs at a time, instead of merging pairs of runs.

IO cost becomes smaller, 2N(logB1NB+1)2N(\left\lceil \log_{B-1}\dfrac{N}{B} \right\rceil + 1)(performing (B1)(B-1)-way merge).

Complexity of a problem

Upper bound and Lower bound

Consider a problem PP.

  • Upper bound: how fast can we solve the problem?
    • 需要多少代价就能解决这个问题?
    • The (worst-case) runtime of an algorithm AA on input size nn is TA(n)=maxI=n{costA(I)}T_A(n) = \max\limits_{|I| = n}\left\lbrace \operatorname{cost}_A(I) \right\rbrace
    • TA(n)T_A(n) upper bounds the complexity of solving problem PP.
    • Every valid algorithm gives an upper bound on the complexity of PP.
  • Lower bound: how slow solving the problem has to be?
    • 少于多少代价就一定解决不了这个问题?
    • The worst-case complexity of PP is the worst-case runtime of the fastest algorithm that solves PP.
    • TP(n)=minA solves P{TA(n)}T_P(n) = \min\limits_{A \text{ solves }P}\left\lbrace T_A(n) \right\rbrace
    • TP(n)T_P(n), usually in the form of Ω(f(n))\Omega(f(n)), lower bounds the complexity of solving problem PP.
    • TP(n)=Ω(f(n))T_P(n) = \Omega(f(n)) means any algorithm has to spend Ω(f(n))\Omega(f(n)) time to solve problem PP.

Larger lower bound is a stronger lower bound. (On the other hand, smaller upper bound is better.)

How to prove a lower bound?

  • It is usually unpractical to examine all possible algorithms
  • Instead, rely on structures/properties of the problem itself

Trivial Lower Bounds

Lower Bound based on output size

  • Any algorithm that for inputs of size nn has a worst-case output size of f(n)f(n) needs to have a runtime of Θ(f(n))\Theta(f(n)).

However, Lower Bound may not based on input size.

Consider output the first element of a nn-length array, the lower bound is Ω(1)\Omega(1) while its input size is nn.

The problem with trivial techniques is that it often suggests a lower bound that it is too low.

Adversary Argument

The lower bound is TP(n)=minA solves P{maxI=n{costA(I)}}T_P(n) = \min\limits_{A \text{ solves }P}\left\lbrace \max\limits_{|I| = n}\left\lbrace \operatorname{cost}_A(I) \right\rbrace \right\rbrace.

The key is to design the worst input for an algorithm, and this algorithm must solve it.

We first have to specify precisely what kinds of algorithms we will consider by using key operations. And then devise an adversary strategy to construct a worst case input.

Key operation

A key operation is a step that is representative of the computation overall.

Properties:

  • Can be constant-time operations
  • Represent or dominate other operations
  • Number of key operations should give a function of the input size

A model of computation might be designed around key operations (other computation are omitted)

Examples:

  • Access one data item (e.g. query the ii-th value in an array)
  • Compare two categorical items (outcome == or \ne)
  • Compare two ordinal items (outcome =,<=, < or >>)
  • Determine if two vertices in a graph are adjacent

Adversary Strategy

Consider a game similar to nn-card Monte:

  • Suppose we have an array of nn bits and we want to determine if any of them is a 11.

Can we do better, without looking at every bit?

Basic idea:

  • An all-powerful malicious adversary pretends to choose an input for the algorithm.
  • When the algorithm wants checks a bit, the adversary sets that bit to whatever value will make the algorithm do the most work.
  • If the algorithm does not check enough bits before terminating, then there will be several different inputs, each consistent with the bits already checked, and should result in different outputs.
  • Whatever the algorithm outputs, the adversary can "reveal" an input that is has all the examined bits but contradicts the algorithm's output, and then claim that that was the input that he was using all along.
  • One absolutely crucial feature of this argument is that the adversary makes absolutely no assumptions about the algorithm.
  • The adversary strategy can’t depend on some predetermined order of examining bits, and it doesn't care about anything the algorithm might or might not do when it's not looking at bits.
  • However, as long as there are at least two possible answers to the problem that are consistent with all answers given by the adversary, the algorithm cannot be done.
Comparison-based sorting

Setup for comparison sorting

  • The input to the problem is nn elements in some initial order.
  • The algorithm knows nothing about the elements.
  • The algorithm may compare two elements at a cost of 11. Particularly, the algorithm can not inspect the values of input items.
  • Moving/copying/swapping items is free.
  • Assume that the input contains no duplicates.

There are n!n! different permutations that sorting algorithm must decide between.

The adversary maintains a list LL of all of the permutations that are consistent with the comparisons that the algorithm has made so far.

Initially LL contains all n!n! permutations.

The adversary's strategy for responding to the comparison is as follows:

  • The adversary answers in such a way to keep LL as large as possible. That is, the worst input construction.

有个问题,这像是贪心的,此刻的最大一定能导向未来的最大吗?不过不影响后面的推导。

At least half of the permutations in LL remain, and the algorithm can not be done until L=1|L| = 1, the number of comparisons required is at least logn!\left\lceil \log n! \right\rceil.

Therefore, the lower bound of comparison-based sorting is Ω(logn!)=Ω(nlogn)\Omega(\log n!) = \Omega(n\log n).

Information-Theoretic Argument

The amount of the information

Consider the minimum number MM of distinct outputs that a sorting algorithm must be able to produce to be able to sort any possible input of length nn. Then MM must be at least n!n!.

In other words, the algorithm must be capable of outputting at least MM different permutations, or there would exist some input that it was not capable of sorting.

The algorithm is deterministic and its behavior is determined entirely by the results of the comparisons.

If a deterministic algorithm makes cc comparisons, at most 2c2^c different possible outputs can be produced. This is because one comparison can only has two different outcomes.

Therefore, 2cn!2^c \ge n!, and cc is at least logn!\log n!.

An alternative view: Decision Trees

Decision Tree(决策树) can be used to describe algorithms:

  • A decision tree is a tree.
  • Each internal node denotes a query the algorithm makes on input.
  • Outgoing edges denote the possible answers to the query.
  • Each leaf denotes an output.

One execution of the algorithm is a path from root to a leaf.

The worst-case time complexity is at least the length of the longest path from root to some leaf, i.e., the height of the tree.

The comparison-based sorting algorithm can be described by a binary comparison tree:

  • Each internal node has two outgoing edges
  • Each internal node denotes a query of the form aiaja_i \le a_{j}

The tree must have n!\ge n! leaves. The height of the tree must be logn!\ge \log n!, which is Ω(nlogn)\Omega(n \log n).

Information

Information content, or self-information or Shannon information of an event xx:

I(x)logPr(x)I(x) \coloneqq - \log \Pr(x)

or equivalently I(x)log1Pr(x)I(x) \coloneqq \log \dfrac{1}{\Pr(x)}.

It measures the informational value of an event depending on its surprising.

If a highly likely event occurs, it carries very little information. On the other hand, if a highly unlikely event occurs, it is much more informative.

Information entropy

Given a discrete random variable XX, which takes values in the alphabet X\mathscr{X} and is distributed according to Pr ⁣:X[0,1]\Pr\colon \mathscr{X} \to[0, 1], the entropy(熵) of a random variable is the avenge level of information, surprise or uncertainty inherent to the variable's possible outcomes.

H(x)E[I(x)]=xXPr(x)I(x)=xXPr(x)logPr(x)\begin{aligned} H(x) &\coloneqq \mathbb{E}[I(x)] \\ &= \sum_{x \in \mathscr{X}} \Pr(x) \cdot I(x) \\ &= \sum_{x \in \mathscr{X}} - \Pr(x) \log \Pr(x) \end{aligned}

Information-theoretic lower bound

Sorting an array of nn size, the entropy of such a random permutation SS is logn!\log n! bits.

Since a comparison can give only two results, the maximum amount of information it provides is 11 bit.

Therefore, after kk comparisons, the remaining entropy of the permutation given the results of those comparisons, is at least logn!k\log n! - k bits on average.

To perform the sort, complete information is needed, which means the remaining entropy must be 00. It follows kk must be at least logn!\log n! on average.

Non-comparison-based sorting

Bucket Sort (桶排序)

Bucket Sort

  1. Create dd empty lists, each list is a bucket.
  2. Scan through input. For each item, append it to the end of the corresponding list.
  3. Concatenate all lists.

This algorithm takes Θ(n)\Theta(n) time.

It's not a comparison based algorithm. No comparison between items are made.

1
2
3
4
5
BucketSort(A, d):
<L1, L2, ..., Ld> := CreateBuckets(d)
for i := 1 to A.length
AssignToBucket(A[i])
CombinBuckets(L1, L2, ..., Ld)

Total time complexity is Θ(n+d)\Theta(n + d):

  • Θ(d)\Theta(d) time to create buckets.
  • Θ(n)\Theta(n) time to assign items to buckets.
  • Θ(d)\Theta(d) time to combine buckets.

If the range of items' values is too large, allow each bucket to hold multiple values.

Allocate kk bucket each responsible for an interval of size dk\dfrac{d}{k}.

Now we need to sort each bucket before coming them.

1
2
3
4
5
6
7
BucketSort(A, k):
<L1, L2, ..., Lk> := CreateBuckets(k)
for i := 1 to A.length
AssignToBucket(A[i])
for j := 1 to k
SortWithinBucket(Lj)
CombinBuckets(L1, L2, ..., Lk)

Runtime is Θ(n+k)\Theta(n + k), plus cost for sorting within buckets.

If items are uniformly distributed and we use insertion sort, expected cost for sorting is O(k(nk)2)=O(n2k)O(k (\frac{n}{k})^2) = O(\frac{n^2}{k}).

Expected total runtime is O(n+k+n2k)O(n + k + \frac{n^2}{k}), which is O(n)O(n) when we have knk \approx n buckets.

BucketSort can be stable.

Radix Sort (基数排序)

Assuming we want to sort nn decimal integers each of dd-digits.

Recursive bucket sort:

  • Based on most significant bit, assign items to 10 buckets
  • Sort recursively in each bucket

RadixSort is iterative, starting from least significant bit.

1
2
3
RadixSort(A, d):
for i := 1 to d
use-a-STABLE-SORT-to-sort-A-on-digit-i

If we use BucketSort as the stable sorting method, since we only consider decimal numbers, we need 10=Θ(1)10 = \Theta(1) buckets.

RadixSort can sort nn decimal dd-digits numbers in O(dn)O(dn) time.

Lower bound for sorting by querying value

Solving the "Sort nn integers" problem by querying values of input has a time complexity of Ω(n)\Omega(n).

Selection

Given a set of nn items, the ii-th order statistic (顺序统计量) of it is the ii-th smallest element of it.

The Selection Problem: given a set AA of nn distinct numbers and an integer ii, find the ii-th order statistic of AA.

Find MIN/MAX

Strategy

  1. Group items into pairs.
    • The first item becomes a pair if nn is odd.
  2. For each n2\left\lceil \frac{n}{2} \right\rceil pairs, find local MIN and MAX.
    • Costing n2\left\lfloor \frac{n}{2} \right\rfloor comparisons.
  3. Among n2\left\lceil \frac{n}{2} \right\rceil local MIN, find global MIN. Similarly find global MAX.
    • Costing at most 2n22 \left\lfloor \frac{n}{2} \right\rfloor comparisons.

Total number of comparisons is at most 3n23 \left\lfloor \frac{n}{2} \right\rfloor.

It's the best way we can do.

  • An item has ++ mark if it can be MAX, and has - mark if it can be MIN.
  • Initially each item has both ++ and -.
  • An adversary answer queries like "compare xx with yy"
  • The adversary can find input such that: at most n2\left\lfloor \frac{n}{2} \right\rfloor comparisons each remove two marks
    • If in pair a,ba, b, a>ba > b, then aa removes - mark and bb removes ++ mark. It removes two marks in total.
  • Each other comparison removes at most one mark.
    • However if in pair c,dc, d having d>cd > c, which means cc removes ++ mark and dd removes - mark. Then we compare aa with dd and get a>da > d, we can only remove dd's ++ mark. It removes one mark at most.
  • In total need to remove 2n22n-2 marks.

So at least 2n22n2+n2=2n2n22n-2 -2 \left\lfloor \frac{n}{2} \right\rfloor + \left\lfloor \frac{n}{2} \right\rfloor = 2n-2-\left\lfloor \frac{n}{2} \right\rfloor comparisons needed, which can be 3n23 \left\lfloor \frac{n}{2} \right\rfloor.

General Selection Problem

Find ii-th smallest element

1
2
3
4
5
RndQuickSort(A):
if A.size > 1
q := RandomPartition(A)
RndQuickSort(A[1...q-1])
RndQuickSort(A[q+1...n])

Cases:

  • What if i=qi = q?
    • A[q] is what we need.
  • What if i<qi < q?
    • Find ii-th order statistic in A[1...q-1]
  • What if i>qi > q?
    • Find (iq)(i-q)-th order statistic in A[q+1...n]

Then we get Reduce-and-Conquer.

1
2
3
4
5
6
7
8
9
10
11
RndSelect(A, i):
if A.size == 1
return A[1]
else
q := RandomPartition(A)
if i == q
return A[q]
else if i < q
return RndSelect(A[1...q-1], i)
else
return RndSelect(A[q+1...n], i-q)

Complexity

  • Best case
    • Choosing the answer as the pivot in the first call.
    • Θ(n)\Theta(n)
  • Worst case
    • Partition reduces array size by one each time.
    • Θ(n2)\Theta(n^2)
  • Average case
    • Analysis will be delivered later.

Average performance of Randomized Selection

What's unlikely to happen is either get the exactly right pivot or reduces the size just by one.

Instead, what's likely to happen is: partition process reduces problem size by a constant factor.

Call a partition good if it reduces problem size to at most αn\alpha n for some constant α<1\alpha < 1, says 0.80.8.

Let the random variable CiC_i be the cost since the last good partition to the ii-th good partition.

Then we have 0.8kn=10.8^{k}n = 1 and get at most log1.25n\log_{1.25} n good partitions can occur.

E[Ci]Θ(1)0.8i1n\mathbb{E} [C_i] \le \Theta(1) \cdot 0.8^{i-1}n

The times that we get the first good partition, as a random variable, follows Geometric Distribution.

As we all knows, its expectation has nothing to do but with its probability (1p\dfrac{1}{p}).

As a result, The cost of not good partition can be seen as constant, Θ(1)\Theta(1) above.

Then the total time cost

E[T(n)]E[i=1log1.25nCi]=i=1log1.25nE[Ci]=O(n)\begin{aligned} \mathbb{E} [T(n)] &\le \mathbb{E} \left[ \sum_{i=1}^{\log_{1.25}n}C_i \right] \\ &= \sum_{i=1}^{\log_{1.25}n}\mathbb{E}[C_i] \\ &= O(n) \end{aligned}

Median of medians

Can we guarantee worst-case runtime of O(n)O(n)?

The reason that RndSelect could be slow is that RandomPartition might return an unbalanced partition.

We needs a partition procedure that guarantees to be balanced without using to much time (O(n)O(n) time to be specific).

Procedure:

  1. Divide elements into n/5n / 5 groups, each containing 55 elements, call these groups G1,,Gn/5G_1, \dots, G_{n / 5}.
    • O(n)O(n) time
  2. Find the medians of these n/5n / 5 groups, let MM be this set of medians.
    • n5Θ(1)=Θ(n)\frac{n}{5} \Theta(1) = \Theta(n) time
  3. Find the median of MM, call it mm^{*}.
    • Use QuickSelect recursively

Obviously, mm^{*} is not the median of the original array, but it is guaranteed to be between the 3030-th and 7070-th percentile(百分位数) of the original array.

Then we can use mm^{*} as the pivot to partition the array.

1
2
3
4
5
6
7
8
MedianOfMedian(A):
if A.size == 1
return A[1]
<G1, ..., Gn/5> := GreateGroups(A)
for i := 1 to n/5
Sort(Gi)
M := GetMediansFromSortedGroups(G1, ..., Gn/5)
return QuickSelect(M, (n/5)/2)
1
2
3
4
5
6
7
8
9
10
11
QuickSelect(A, i):
if A.size == 1
return A[1]
m := MedianOfMedian(A)
q := PartitionWithPivot(A, m)
if i == q
return A[q]
else if i < q
return QuickSelect(A[1...q-1], i)
else
return QuickSelect(A[q+1...n], i-q)

  • QuickSelect uses O(n) time/comparisons.
  • Solving general selection needs at least n1n - 1 comparisons.
  • Since finding MIN/MAX needs at least n1n - 1 comparisons.
  • So the lower and upper bounds match asymptotically.
  • But if we care about constants, needs (much) more efforts, which is not the topic of this course.

  1. Consider [2a, 2b, 1], it's already a max-heap. At first 2a is extracted and placed in the end. Then 2b is extracted and placed in the end but one index. At last we get [1, 2b, 2a]. ↩︎