Pl chevron_right

Jussi Pakkanen: Simple sort implementations vs production quality ones

news.movim.eu / PlanetGnome • 19 March 2026 • 2 minutes

One of the most optimized algorithms in any standard library is sorting. It is used everywhere so it must be fast. Thousands upon thousands of developer hours have been sunk into inventing new algorithms and making sort implementations faster. Pystd has a different design philosophy where fast compilation times and readability of the implementation have higher priority than absolute performance. Perf still very much matters, it has to be fast , but not at the cost of 10x compilation time.

This leads to the natural question of how much slower such an implementation would be compared to a production quality one. Could it even be faster? (Spoilers: no) The only way to find out is to run performance benchmarks on actual code.

To keep things simple there is only one test set, sorting 10'000'000 consecutive 64 bit integers that have been shuffled to a random order which is the same for all algorithms. This is not an exhaustive test by any means but you have to start somewhere. All tests used GCC 15.2 using -O2 optimization. Pystd code was not thoroughly hand optimized, I only fixed (some of the) obvious hotspots.

Stable sort

Pystd uses mergesort for stable sorting. The way the C++ standard specifies stable sort means that most implementations probably use it as well. I did not dive in the code to find out. Pystd's merge sort implementation consists of ~220 lines of code. It can be read on this page .

Stdlibc++ can do the sort in 0.9 seconds whereas Pystd takes .94 seconds. Getting to within 5% with such a simple implementation is actually quite astonishing. Even when considering all the usual caveats where it might completely fall over with a different input data distribution and all that.

Regular sort

Both stdlibc++ and Pystd use introsort . Pystd's implementation has ~150 lines of code but it also uses heapsort, which has a further 100 lines of code). Code for introsort is here, and heapsort is here .

Stdlibc++ gets the sort done in 0.76 seconds whereas Pystd takes 0.82 seconds. This makes it approximately 8% slower. It's not great, but getting within 10% with a few evening's work is still a pretty good result. Especially since, and I'm speculating here, std::sort has seen a lot more optimization work than std::stable_sort because it is used more.

For heavy duty number crunching this would be way too slow. But for moderate data set sizes the performance difference might be insignificant for many use cases.

Note that all of these are faster (note: did not measure) than libc's qsort because it requires an indirect function call on every comparison i.e. the comparison method can not be inlined.

Where does the time go?

Valgrind will tell you that quite easily.

This picture shows quite clearly why big O notation can be misleading. Both quicksort (the inner loop of introsort) and heapsort have "the same" average time complexity but every call to heapsort takes approximately 4.5 times as long.