Update: Continued in this post.
I recently came across the need for an incremental sorting algorithm, and started to wonder how to do it optimally.
The incremental sorting problem is described here, and is an “online” version of partial sort. That is, if you have already sorted elements, you should be able to quickly sort elements, and so on.
Use Cases
Incremental sorts can be useful for a number of cases:
- You want sorted items, but you don’t know how many elements you’ll need.
- This could often happen when you are filtering the resulting sequence, and you don’t know how many items will be filtered out.
- You are streaming the sequence, so even though you want the whole sequence, you want the first elements as quickly as possible.
We’ll see how branch misprediction and other constant factors can make the naive asymptotically optimal version far slower than a practical implementation.
Measuring Performance
Measuring the performance of an incremental sort is a bit more involved than full sort, because for a given we care about how fast we reach each element. To keep it simple, I’ll keep fixed at 10 million, and measure the time taken to reach each element. The input range is a vector of random integers.
Pre-Sorting
Perhaps the simplest possible solution is pre-sorting the entire range, and then just using plain iterators to iterate over the sorted range.
Unsurprisingly, the performance of this algorithm is virtually independent of , as all the work is done at the start. This is obviously, not a good solution, but it’s a nice reference point, especially when approaches .
Pre-Sorting (ms)
k | Pre-Sorter |
---|---|
8 | 1083.18 |
16 | 1083.18 |
32 | 1083.18 |
64 | 1083.18 |
128 | 1083.18 |
256 | 1083.18 |
512 | 1083.18 |
1024 | 1083.18 |
2048 | 1083.18 |
4096 | 1083.18 |
8192 | 1083.19 |
16384 | 1083.19 |
32768 | 1083.22 |
65536 | 1083.24 |
131072 | 1083.28 |
262144 | 1083.36 |
524288 | 1083.48 |
1048576 | 1083.71 |
2097152 | 1084 |
4194304 | 1084.61 |
8388608 | 1085.75 |
10000000 | 1086.14 |
Partial Sorting
To avoid doing all the sorting work up front, the next logical step is trying to use something like std::partial_sort
to do the work. The idea is to to start by partially sorting a constant small part of the range and remember how much of the range is sorted. When you run out of sorted elements, do another partial sort.
Since a partial sort must look at all the remaining elements of the range, we can’t keep partially sorting small ranges, that would lead to quadratic complexity. Instead, we multiply the number of elements to sort each time by some constant, giving us a logarithmic number of partial sorts.
This method is a huge improvement for small , but it performs worse for 1 million.
Partial Sorting (ms)
N | Partial Sorter | Pre-Sorter |
---|---|---|
8 | 7.8649 | 1083.18 |
16 | 7.86566 | 1083.18 |
32 | 7.86566 | 1083.18 |
64 | 7.86566 | 1083.18 |
128 | 15.5341 | 1083.18 |
256 | 15.5348 | 1083.18 |
512 | 23.1146 | 1083.18 |
1024 | 31.8237 | 1083.18 |
2048 | 41.0193 | 1083.18 |
4096 | 52.1542 | 1083.18 |
8192 | 68.4948 | 1083.19 |
16384 | 93.337 | 1083.19 |
32768 | 135.449 | 1083.22 |
65536 | 202.845 | 1083.24 |
131072 | 308.542 | 1083.28 |
262144 | 496.541 | 1083.36 |
524288 | 832.594 | 1083.48 |
1048576 | 1459.07 | 1083.71 |
2097152 | 2387.17 | 1084 |
4194304 | 3569.34 | 1084.61 |
8388608 | 4102.78 | 1085.75 |
10000000 | 4105.16 | 1086.14 |
Since the std::partial_sort
algortithm costs , the calls to std::partial_sort
has a total worst case run time of . This means that in total we are doing more work than we should, time to fix this.
Incremental Quicksort
In a paper from 2006, Paredes and Navarro describes an algorithm which uses an incremental version of quicksort to achieve the optimal bound of for incremental sorting.
This method keeps a stack of partition points for the range. Each partition point is an iterator to a position in the range where everything before the position is smaller than everything after. The stack is sorted such that the top of the stack is the position closest to the start of the range.
This stack has the wonderful property that when you are looking for the next element in the sorted range, and your current position is not equal to the top of the stack, the next element must be between the current position and the top of the stack.
If the distance to the next partition point is large, you cut it in half with std::partition
, and push the new partition point on the stack. If the distance is small, we can sort the small range with std::sort
, and pop the element from the stack. A simple version is shown here:
Looking at the results for this algorithm, it’s clear that the total amount of work done when is now optimal,
because the time to sort the whole range is now as fast as a single call to std::sort
. This has the benefit that you don’t need to worry about switching over to std::sort
if you know that you’ll need a large part of the result, you can just use the incremental version anyway.
Simple Quick Sorting (ms)
k | Partial Sorter | Pre-Sorter | Simple Quick Sorter |
---|---|---|---|
8 | 7.8649 | 1083.18 | 53.6251 |
16 | 7.86566 | 1083.18 | 53.6259 |
32 | 7.86566 | 1083.18 | 53.6263 |
64 | 7.86566 | 1083.18 | 53.627 |
128 | 15.5341 | 1083.18 | 53.6305 |
256 | 15.5348 | 1083.18 | 53.6377 |
512 | 23.1146 | 1083.18 | 53.646 |
1024 | 31.8237 | 1083.18 | 53.6704 |
2048 | 41.0193 | 1083.18 | 53.7103 |
4096 | 52.1542 | 1083.18 | 54.2732 |
8192 | 68.4948 | 1083.19 | 54.7213 |
16384 | 93.337 | 1083.19 | 55.2367 |
32768 | 135.449 | 1083.22 | 56.2064 |
65536 | 202.845 | 1083.24 | 57.8833 |
131072 | 308.542 | 1083.28 | 65.7102 |
262144 | 496.541 | 1083.36 | 74.766 |
524288 | 832.594 | 1083.48 | 92.7102 |
1048576 | 1459.07 | 1083.71 | 142.352 |
2097152 | 2387.17 | 1084 | 219.68 |
4194304 | 3569.34 | 1084.61 | 428.108 |
8388608 | 4102.78 | 1085.75 | 796.165 |
10000000 | 4105.16 | 1086.14 | 919.948 |
The downside of this algorithm, though, is that it does spend considerably more time than std::partial_sort
coming up with the first few elements. If I need between 1 and 5000 elements, partial sorting is still the way to go by the looks of it. Why is that?
Number of comparisons
Even though std::partial_sort
is only guaranteed to have , a typical implementation will have closer to comparisons for small and random data. This is because the impementation typically makes a max heap of the first elements, and then walks through the rest of the range, inserting elements if they are smaller than the current max.
When is small, the elements in the heap will quickly become much smaller on average than most elements in the rest of the range, so insertions into the heap become less frequent as the range is processed. That means that for most of the elements in the range, the algorithm will just do a single comparison to verify that the element should not be in heap.
The incremental quick sort implementation, on the other hand, does a few more comparisons to find the first element. The first partition operation does comparisons to cut the range in two, the next partition takes comparisons to cut the first half down to a quarter, and so on. In total, it does about comparisons to find the first element, twice that of a good partial sort.
The difference in performance is much more than the difference in comparisons can explain, however, which brings us to the next point.
Branch Mispredictions
Few comparisons is not the only benefit of partial sort, it also has the benefit that most of the comparisons gives the same result. This gives a very fast inner loop, because it avoids branch misprediction.
For a partition operation that cuts the range in half, the number of branch mispredictions is at a peak, about every other comparison will be a misprediction. To verify that this was a big contributing factor, I tested std::partition
with various pivot elements, that cut the range at various percentiles, to see the effect. I also compared with two hand-written loops (one normal branching and one branchless) to see that effect added up.
We see that partition speed is heavily impacted by the branch mispredictions, it seems to account for up to difference in speed.
(ms)
% | std::partition | branching | branchless |
---|---|---|---|
0 | 11.7968 | 16.0224 | 36.7653 |
1 | 15.7772 | 16.8826 | 33.9671 |
2 | 14.6577 | 17.0579 | 34.0169 |
3 | 17.5502 | 20.1389 | 35.739 |
4 | 14.3418 | 17.9083 | 31.4708 |
5 | 14.5596 | 18.1637 | 26.7742 |
6 | 15.3439 | 17.9451 | 23.1884 |
7 | 17.2111 | 20.018 | 27.8515 |
8 | 18.2778 | 21.125 | 24.2756 |
9 | 18.0675 | 23.2834 | 26.2831 |
10 | 20.2005 | 21.3348 | 23.0222 |
11 | 19.2368 | 22.0339 | 22.3802 |
12 | 18.808 | 21.2295 | 24.2208 |
13 | 19.1338 | 21.9753 | 22.3669 |
14 | 20.8068 | 26.4336 | 23.3153 |
15 | 24.336 | 27.0468 | 21.5044 |
16 | 27.4645 | 29.8541 | 23.4868 |
17 | 28.8201 | 31.1758 | 21.2261 |
18 | 28.8148 | 30.1289 | 21.0718 |
19 | 31.28 | 30.2661 | 20.9075 |
20 | 29.6518 | 31.3142 | 21.3401 |
21 | 29.8271 | 30.7174 | 22.6782 |
22 | 34.0443 | 34.7833 | 23.1051 |
23 | 35.8298 | 34.5951 | 22.7919 |
24 | 32.7097 | 35.3106 | 20.9938 |
25 | 31.8623 | 34.2561 | 21.1079 |
26 | 32.1733 | 32.9834 | 20.6646 |
27 | 35.5523 | 37.6043 | 21.0353 |
28 | 33.6041 | 35.7314 | 22.0145 |
29 | 35.6302 | 37.6708 | 22.6771 |
30 | 35.772 | 36.4316 | 20.9463 |
31 | 36.0035 | 36.7326 | 20.8304 |
32 | 36.194 | 36.2179 | 22.7599 |
33 | 39.0462 | 38.6444 | 31.429 |
34 | 39.7669 | 42.4499 | 23.0538 |
35 | 43.1285 | 46.0829 | 26.6244 |
36 | 37.2333 | 39.8859 | 20.7182 |
37 | 37.09 | 38.6409 | 20.9155 |
38 | 37.7157 | 39.1157 | 20.6312 |
39 | 38.1182 | 40.1368 | 20.7224 |
40 | 39.8334 | 40.4454 | 20.4137 |
41 | 40.6355 | 46.655 | 23.8106 |
42 | 41.9736 | 43.0969 | 24.2189 |
43 | 40.7712 | 40.3037 | 20.5175 |
44 | 42.2876 | 41.7527 | 20.1051 |
45 | 42.1777 | 41.9067 | 20.5757 |
46 | 40.6701 | 43.536 | 20.2902 |
47 | 41.7398 | 45.5051 | 24.6454 |
48 | 43.1851 | 42.8141 | 20.4784 |
49 | 41.3509 | 41.6771 | 22.1609 |
50 | 42.4328 | 40.539 | 20.1609 |
51 | 41.5824 | 43.2767 | 20.4134 |
52 | 40.3546 | 40.9339 | 26.5815 |
53 | 44.1628 | 46.8291 | 26.9776 |
54 | 40.9811 | 40.3725 | 20.4259 |
55 | 42.1538 | 40.4812 | 19.9891 |
56 | 39.5023 | 40.6363 | 20.3544 |
57 | 39.0777 | 40.1889 | 20.3947 |
58 | 37.921 | 39.2206 | 20.2054 |
59 | 41.3262 | 39.4255 | 20.5152 |
60 | 40.3523 | 38.2475 | 20.1613 |
61 | 39.4046 | 39.5966 | 19.9644 |
62 | 37.3576 | 39.0678 | 21.4956 |
63 | 36.8756 | 36.6924 | 20.3605 |
64 | 38.4406 | 36.2909 | 21.4831 |
65 | 36.8231 | 41.622 | 20.695 |
66 | 35.9872 | 35.2699 | 23.0834 |
67 | 36.0716 | 36.7053 | 20.3126 |
68 | 37.0869 | 35.4938 | 20.5791 |
69 | 34.4689 | 40.1645 | 21.3112 |
70 | 33.6399 | 34.2032 | 24.033 |
71 | 34.838 | 37.0558 | 23.7346 |
72 | 30.5919 | 29.7374 | 26.7609 |
73 | 33.2331 | 35.1296 | 23.0093 |
74 | 29.699 | 29.2873 | 24.3592 |
75 | 28.6084 | 28.5753 | 20.2883 |
76 | 29.4534 | 28.2107 | 20.1112 |
77 | 27.3216 | 26.8232 | 21.4511 |
78 | 28.1415 | 28.4012 | 25.6908 |
79 | 27.8272 | 30.2566 | 21.0444 |
80 | 25.8417 | 26.1306 | 22.6178 |
81 | 25.1115 | 24.6629 | 20.0351 |
82 | 25.311 | 26.4792 | 20.3449 |
83 | 25.6752 | 24.4459 | 20.0788 |
84 | 25.4106 | 25.7623 | 20.1503 |
85 | 23.3389 | 36.3533 | 27.6375 |
86 | 24.1425 | 29.1322 | 22.8158 |
87 | 20.8068 | 22.4148 | 20.2659 |
88 | 19.4649 | 19.4448 | 21.3067 |
89 | 21.5838 | 18.8339 | 19.9542 |
90 | 17.4586 | 17.92 | 22.1331 |
91 | 17.3696 | 17.5084 | 19.6497 |
92 | 14.3699 | 15.4465 | 24.1265 |
93 | 13.7956 | 15.6993 | 22.4881 |
94 | 14.2673 | 18.4283 | 21.9248 |
95 | 12.5422 | 13.2299 | 19.7614 |
96 | 11.3406 | 13.7781 | 20.9623 |
97 | 7.63467 | 9.61557 | 19.5337 |
98 | 6.10576 | 10.8198 | 19.3897 |
99 | 5.13869 | 8.40483 | 19.2806 |
Skewed Incremental Quick Sort
After looking at the branch misprediction measurements, I realized that a balanced partition operation could not be the basis for an optimal incremental sort, but a skewed one could. If we skew the first partitions towards the beginning of the range, we get both fewer comparisons and branch mispredictions. The skew comes with a cost, however, as there is more work to do in total if we consistently get very skewed partitions. To counteract this effect, only the first partitions should be skewed. As we iterate further into the range, the partitions should be made less skewed, so we get about the right amount of work.
In the implementation of skewed incremental quicksort, I skewed the partitions by sampling a large number of pivots, sorting them, and choosing the pivot corresponding to the desired amount of skew.
Skewed Quick Sorting (ms)
k | Partial Sorter | Pre-Sorter | Simple Quick Sorter | Skewed Quick Sorter |
---|---|---|---|---|
8 | 7.8649 | 1083.18 | 53.6251 | 6.19743 |
16 | 7.86566 | 1083.18 | 53.6259 | 6.19819 |
32 | 7.86566 | 1083.18 | 53.6263 | 6.19819 |
64 | 7.86566 | 1083.18 | 53.627 | 6.19857 |
128 | 15.5341 | 1083.18 | 53.6305 | 6.20123 |
256 | 15.5348 | 1083.18 | 53.6377 | 6.20617 |
512 | 23.1146 | 1083.18 | 53.646 | 6.21415 |
1024 | 31.8237 | 1083.18 | 53.6704 | 6.23734 |
2048 | 41.0193 | 1083.18 | 53.7103 | 6.3947 |
4096 | 52.1542 | 1083.18 | 54.2732 | 6.48554 |
8192 | 68.4948 | 1083.19 | 54.7213 | 6.84245 |
16384 | 93.337 | 1083.19 | 55.2367 | 7.52472 |
32768 | 135.449 | 1083.22 | 56.2064 | 8.93866 |
65536 | 202.845 | 1083.24 | 57.8833 | 11.1668 |
131072 | 308.542 | 1083.28 | 65.7102 | 24.4606 |
262144 | 496.541 | 1083.36 | 74.766 | 34.4201 |
524288 | 832.594 | 1083.48 | 92.7102 | 53.3226 |
1048576 | 1459.07 | 1083.71 | 142.352 | 110.043 |
2097152 | 2387.17 | 1084 | 219.68 | 216.516 |
4194304 | 3569.34 | 1084.61 | 428.108 | 421.007 |
8388608 | 4102.78 | 1085.75 | 796.165 | 792.765 |
10000000 | 4105.16 | 1086.14 | 919.948 | 927.276 |
Summary
By combining the incremental quick sort algorithm with smart pivot selection, we can get an incremental sorting algorithm that is as fast as partial sort for small , and as fast as full sort when approaches .
This is my first real blog post, so comments and suggestions are welcome!
Update: Continued in this post.