Fast incremental sort

Update: Continued in this post.

I recently came across the need for an incremental sorting algorithm, and started to wonder how to do it optimally.

The incremental sorting problem is described here, and is an “online” version of partial sort. That is, if you have already sorted $k$ elements, you should be able to quickly sort $k+1$ elements, and so on.

Use Cases

Incremental sorts can be useful for a number of cases:

You want sorted items, but you don’t know how many elements you’ll need.
- This could often happen when you are filtering the resulting sequence, and you don’t know how many items will be filtered out.
You are streaming the sequence, so even though you want the whole sequence, you want the first elements as quickly as possible.

We’ll see how branch misprediction and other constant factors can make the naive asymptotically optimal version far slower than a practical implementation.

Measuring Performance

Measuring the performance of an incremental sort is a bit more involved than full sort, because for a given $N$ we care about how fast we reach each $k^{th}$ element. To keep it simple, I’ll keep $N$ fixed at 10 million, and measure the time taken to reach each $k^{th}$ element. The input range is a vector of random integers.

Pre-Sorting

Perhaps the simplest possible solution is pre-sorting the entire range, and then just using plain iterators to iterate over the sorted range.

template<typename I>
class pre_sorter {
public:
  pre_sorter(I i1, I i2) : first(i1), last(i2) {
    std::sort(first, last);
  }

  I begin() { return first; }
  I end() { return last; }

private:
  I first;
  I last;
};

Unsurprisingly, the performance of this algorithm is virtually independent of $k$ , as all the work is done at the start. This is obviously, not a good solution, but it’s a nice reference point, especially when $k$ approaches $N$ .

Pre-Sorting (ms)

k	Pre-Sorter
8	1083.18
16	1083.18
32	1083.18
64	1083.18
128	1083.18
256	1083.18
512	1083.18
1024	1083.18
2048	1083.18
4096	1083.18
8192	1083.19
16384	1083.19
32768	1083.22
65536	1083.24
131072	1083.28
262144	1083.36
524288	1083.48
1048576	1083.71
2097152	1084
4194304	1084.61
8388608	1085.75
10000000	1086.14

Partial Sorting

To avoid doing all the sorting work up front, the next logical step is trying to use something like std::partial_sort to do the work. The idea is to to start by partially sorting a constant small part of the range and remember how much of the range is sorted. When you run out of sorted elements, do another partial sort.

Since a partial sort must look at all the remaining elements of the range, we can’t keep partially sorting small ranges, that would lead to quadratic complexity. Instead, we multiply the number of elements to sort each time by some constant, giving us a logarithmic number of partial sorts.

template<typename I>
class partial_sorter {
public:
  partial_sorter(I i1, I i2) : first(i1), last(i2) {}

  class iterator {
  public:
    ...
    iterator& operator++() {
      ensure_sorted_at_current();
      ++current;
      return *this;
    }

    value_type& operator*() {
      ensure_sorted_at_current();
      return *current;
    }

  private:
    void ensure_sorted_at_current() {
      if (current == sort_end) {
        sort_end = 
            sort_size < end - sort_end ?
            sort_end + sort_size :
            end;
        std::partial_sort(current, sort_end, end);
        sort_size *= 2;
      }
    }
    I current;
    I sort_end;
    I end;
    int sort_size = 100;
  };
  ...
};

This method is a huge improvement for small $k$ , but it performs worse for $k\gt$ 1 million.

Partial Sorting (ms)

N	Partial Sorter	Pre-Sorter
8	7.8649	1083.18
16	7.86566	1083.18
32	7.86566	1083.18
64	7.86566	1083.18
128	15.5341	1083.18
256	15.5348	1083.18
512	23.1146	1083.18
1024	31.8237	1083.18
2048	41.0193	1083.18
4096	52.1542	1083.18
8192	68.4948	1083.19
16384	93.337	1083.19
32768	135.449	1083.22
65536	202.845	1083.24
131072	308.542	1083.28
262144	496.541	1083.36
524288	832.594	1083.48
1048576	1459.07	1083.71
2097152	2387.17	1084
4194304	3569.34	1084.61
8388608	4102.78	1085.75
10000000	4105.16	1086.14

Since the std::partial_sort algortithm costs $O(N \log k)$ , the $O(\log N)$ calls to std::partial_sort has a total worst case run time of $O(N \log^2 N )$ . This means that in total we are doing more work than we should, time to fix this.

Incremental Quicksort

In a paper from 2006, Paredes and Navarro describes an algorithm which uses an incremental version of quicksort to achieve the optimal bound of $O(N + k\log k)$ for incremental sorting.

This method keeps a stack of partition points for the range. Each partition point is an iterator to a position in the range where everything before the position is smaller than everything after. The stack is sorted such that the top of the stack is the position closest to the start of the range.

This stack has the wonderful property that when you are looking for the next element in the sorted range, and your current position is not equal to the top of the stack, the next element must be between the current position and the top of the stack.

If the distance to the next partition point is large, you cut it in half with std::partition, and push the new partition point on the stack. If the distance is small, we can sort the small range with std::sort, and pop the element from the stack. A simple version is shown here:

template<typename I>
class simple_quick_sorter {
public:
  simple_quick_sorter(I i1, I i2) : first(i1), last(i2) {}

  class iterator {
  public:
    ...
    iterator& operator++() {
      ensure_sorted_at_current();
      ++current;
      return *this;
    }

    value_type& operator*() {
      ensure_sorted_at_current();
      return *current;
    }

  private:
    void ensure_sorted_at_current() {
      if (current == sort_end) {
        while (stack.back() - sort_end > sort_limit) {
          auto range_size = stack.back() - current;
          value_type pivot = *(current + (mt() % range_size));
          auto it = std::partition(
            current,
            stack.back(),
            [=](const value_type& v) { return v < pivot; });
          while (it == current) {
            pivot = *(current + (mt() % range_size));
            it = std::partition(
              current,
              stack.back(),
              [=](const value_type& v) { return v < pivot; });
          }
          stack.push_back(it);
        }

        std::sort(sort_end, stack.back());
        sort_end = stack.back();
        stack.pop_back();
      }
    }
    I begin;
    I current;
    I sort_end;
    std::vector<I> stack;
    static const int sort_limit = 100;
    std::mt19937 mt{ std::random_device{}() };
  };
  ...
};

Looking at the results for this algorithm, it’s clear that the total amount of work done when $k=N$ is now optimal, because the time to sort the whole range is now as fast as a single call to std::sort. This has the benefit that you don’t need to worry about switching over to std::sort if you know that you’ll need a large part of the result, you can just use the incremental version anyway.

Simple Quick Sorting (ms)

k	Partial Sorter	Pre-Sorter	Simple Quick Sorter
8	7.8649	1083.18	53.6251
16	7.86566	1083.18	53.6259
32	7.86566	1083.18	53.6263
64	7.86566	1083.18	53.627
128	15.5341	1083.18	53.6305
256	15.5348	1083.18	53.6377
512	23.1146	1083.18	53.646
1024	31.8237	1083.18	53.6704
2048	41.0193	1083.18	53.7103
4096	52.1542	1083.18	54.2732
8192	68.4948	1083.19	54.7213
16384	93.337	1083.19	55.2367
32768	135.449	1083.22	56.2064
65536	202.845	1083.24	57.8833
131072	308.542	1083.28	65.7102
262144	496.541	1083.36	74.766
524288	832.594	1083.48	92.7102
1048576	1459.07	1083.71	142.352
2097152	2387.17	1084	219.68
4194304	3569.34	1084.61	428.108
8388608	4102.78	1085.75	796.165
10000000	4105.16	1086.14	919.948

The downside of this algorithm, though, is that it does spend considerably more time than std::partial_sort coming up with the first few elements. If I need between 1 and 5000 elements, partial sorting is still the way to go by the looks of it. Why is that?

Number of comparisons

Even though std::partial_sort is only guaranteed to have $O(N \log k)$ , a typical implementation will have closer to $N$ comparisons for small $k$ and random data. This is because the impementation typically makes a max heap of the first $k$ elements, and then walks through the rest of the range, inserting elements if they are smaller than the current max.

When $k$ is small, the elements in the heap will quickly become much smaller on average than most elements in the rest of the range, so insertions into the heap become less frequent as the range is processed. That means that for most of the elements in the range, the algorithm will just do a single comparison to verify that the element should not be in heap.

The incremental quick sort implementation, on the other hand, does a few more comparisons to find the first element. The first partition operation does $N$ comparisons to cut the range in two, the next partition takes $\frac{N}{2}$ comparisons to cut the first half down to a quarter, and so on. In total, it does about $2\cdot N$ comparisons to find the first element, twice that of a good partial sort.

The difference in performance is much more than the difference in comparisons can explain, however, which brings us to the next point.

Branch Mispredictions

Few comparisons is not the only benefit of partial sort, it also has the benefit that most of the comparisons gives the same result. This gives a very fast inner loop, because it avoids branch misprediction.

For a partition operation that cuts the range in half, the number of branch mispredictions is at a peak, about every other comparison will be a misprediction. To verify that this was a big contributing factor, I tested std::partition with various pivot elements, that cut the range at various percentiles, to see the effect. I also compared with two hand-written loops (one normal branching and one branchless) to see that effect added up.

We see that partition speed is heavily impacted by the branch mispredictions, it seems to account for up to $4x$ difference in speed.

(ms)

%	std::partition	branching	branchless
0	11.7968	16.0224	36.7653
1	15.7772	16.8826	33.9671
2	14.6577	17.0579	34.0169
3	17.5502	20.1389	35.739
4	14.3418	17.9083	31.4708
5	14.5596	18.1637	26.7742
6	15.3439	17.9451	23.1884
7	17.2111	20.018	27.8515
8	18.2778	21.125	24.2756
9	18.0675	23.2834	26.2831
10	20.2005	21.3348	23.0222
11	19.2368	22.0339	22.3802
12	18.808	21.2295	24.2208
13	19.1338	21.9753	22.3669
14	20.8068	26.4336	23.3153
15	24.336	27.0468	21.5044
16	27.4645	29.8541	23.4868
17	28.8201	31.1758	21.2261
18	28.8148	30.1289	21.0718
19	31.28	30.2661	20.9075
20	29.6518	31.3142	21.3401
21	29.8271	30.7174	22.6782
22	34.0443	34.7833	23.1051
23	35.8298	34.5951	22.7919
24	32.7097	35.3106	20.9938
25	31.8623	34.2561	21.1079
26	32.1733	32.9834	20.6646
27	35.5523	37.6043	21.0353
28	33.6041	35.7314	22.0145
29	35.6302	37.6708	22.6771
30	35.772	36.4316	20.9463
31	36.0035	36.7326	20.8304
32	36.194	36.2179	22.7599
33	39.0462	38.6444	31.429
34	39.7669	42.4499	23.0538
35	43.1285	46.0829	26.6244
36	37.2333	39.8859	20.7182
37	37.09	38.6409	20.9155
38	37.7157	39.1157	20.6312
39	38.1182	40.1368	20.7224
40	39.8334	40.4454	20.4137
41	40.6355	46.655	23.8106
42	41.9736	43.0969	24.2189
43	40.7712	40.3037	20.5175
44	42.2876	41.7527	20.1051
45	42.1777	41.9067	20.5757
46	40.6701	43.536	20.2902
47	41.7398	45.5051	24.6454
48	43.1851	42.8141	20.4784
49	41.3509	41.6771	22.1609
50	42.4328	40.539	20.1609
51	41.5824	43.2767	20.4134
52	40.3546	40.9339	26.5815
53	44.1628	46.8291	26.9776
54	40.9811	40.3725	20.4259
55	42.1538	40.4812	19.9891
56	39.5023	40.6363	20.3544
57	39.0777	40.1889	20.3947
58	37.921	39.2206	20.2054
59	41.3262	39.4255	20.5152
60	40.3523	38.2475	20.1613
61	39.4046	39.5966	19.9644
62	37.3576	39.0678	21.4956
63	36.8756	36.6924	20.3605
64	38.4406	36.2909	21.4831
65	36.8231	41.622	20.695
66	35.9872	35.2699	23.0834
67	36.0716	36.7053	20.3126
68	37.0869	35.4938	20.5791
69	34.4689	40.1645	21.3112
70	33.6399	34.2032	24.033
71	34.838	37.0558	23.7346
72	30.5919	29.7374	26.7609
73	33.2331	35.1296	23.0093
74	29.699	29.2873	24.3592
75	28.6084	28.5753	20.2883
76	29.4534	28.2107	20.1112
77	27.3216	26.8232	21.4511
78	28.1415	28.4012	25.6908
79	27.8272	30.2566	21.0444
80	25.8417	26.1306	22.6178
81	25.1115	24.6629	20.0351
82	25.311	26.4792	20.3449
83	25.6752	24.4459	20.0788
84	25.4106	25.7623	20.1503
85	23.3389	36.3533	27.6375
86	24.1425	29.1322	22.8158
87	20.8068	22.4148	20.2659
88	19.4649	19.4448	21.3067
89	21.5838	18.8339	19.9542
90	17.4586	17.92	22.1331
91	17.3696	17.5084	19.6497
92	14.3699	15.4465	24.1265
93	13.7956	15.6993	22.4881
94	14.2673	18.4283	21.9248
95	12.5422	13.2299	19.7614
96	11.3406	13.7781	20.9623
97	7.63467	9.61557	19.5337
98	6.10576	10.8198	19.3897
99	5.13869	8.40483	19.2806

Skewed Incremental Quick Sort

After looking at the branch misprediction measurements, I realized that a balanced partition operation could not be the basis for an optimal incremental sort, but a skewed one could. If we skew the first partitions towards the beginning of the range, we get both fewer comparisons and branch mispredictions. The skew comes with a cost, however, as there is more work to do in total if we consistently get very skewed partitions. To counteract this effect, only the first partitions should be skewed. As we iterate further into the range, the partitions should be made less skewed, so we get about the right amount of work.

In the implementation of skewed incremental quicksort, I skewed the partitions by sampling a large number of pivots, sorting them, and choosing the pivot corresponding to the desired amount of skew.

value_type get_pivot(int num_pivots, int pivot_idx) {
  assert(num_pivots <= 100);
  std::array<value_type, 100> pivots;
  for (int i = 0; i < num_pivots; ++i) {
	pivots[i] = *(current + (mt() % (stack.back() - current)));
  }

  std::sort(pivots.begin(), pivots.begin() + num_pivots);
  return pivots[pivot_idx];
}

value_type get_pivot() {
  auto range_size = stack.back() - current;

  if (range_size < 1000) {
	return get_pivot(3, 1);
  }

  double ratio = 
    (current - begin) / static_cast<double>(stack.back() - begin);

  if (range_size < 10'000) {
	int idx = ratio * 10;
	idx = std::min(idx, 5);
	idx = std::max(idx, 1);
	return get_pivot(10, idx);
  }
  
  int idx = ratio * 10;
  idx = std::min(idx, 50);
  idx = std::max(idx, 1);
  return get_pivot(100, idx);
}

Skewed Quick Sorting (ms)

k	Partial Sorter	Pre-Sorter	Simple Quick Sorter	Skewed Quick Sorter
8	7.8649	1083.18	53.6251	6.19743
16	7.86566	1083.18	53.6259	6.19819
32	7.86566	1083.18	53.6263	6.19819
64	7.86566	1083.18	53.627	6.19857
128	15.5341	1083.18	53.6305	6.20123
256	15.5348	1083.18	53.6377	6.20617
512	23.1146	1083.18	53.646	6.21415
1024	31.8237	1083.18	53.6704	6.23734
2048	41.0193	1083.18	53.7103	6.3947
4096	52.1542	1083.18	54.2732	6.48554
8192	68.4948	1083.19	54.7213	6.84245
16384	93.337	1083.19	55.2367	7.52472
32768	135.449	1083.22	56.2064	8.93866
65536	202.845	1083.24	57.8833	11.1668
131072	308.542	1083.28	65.7102	24.4606
262144	496.541	1083.36	74.766	34.4201
524288	832.594	1083.48	92.7102	53.3226
1048576	1459.07	1083.71	142.352	110.043
2097152	2387.17	1084	219.68	216.516
4194304	3569.34	1084.61	428.108	421.007
8388608	4102.78	1085.75	796.165	792.765
10000000	4105.16	1086.14	919.948	927.276

Summary

By combining the incremental quick sort algorithm with smart pivot selection, we can get an incremental sorting algorithm that is as fast as partial sort for small $k$ , and as fast as full sort when $k$ approaches $N$ .

This is my first real blog post, so comments and suggestions are welcome!

Update: Continued in this post.