A cache miss is not a cache miss

When writing high performance code, we are careful to avoid cache misses when possible. When discussing cache misses, however, we are usually content with counting the number of cache misses. In this post I will explain why this is not sufficient, as data dependency also make a huge difference.

Consider for example a vector<unique_ptr<int>>. If the vector is large, iterating over the integers will typically incur a cache miss per element, because they are stored at arbitrary locations in memory. If we have a list<int>, iteration will also incur a cache miss for each element, because the nodes of the list are stored at arbitrary locations.

This could lead us to believe that cache misses have the same effect on performance for both of the above cases. That assumption is what I wanted to put to the test in this post.

Setup

To avoid the peculiarities of the memory system, or making my own allocator, I have simplified the linked list structure used for the experiment. The data type I use for iteration is a simple vector<pair<int, int>>. Each element consists of an index and a payload. The indices point to the next element in a linked list of all the elements in the vector. The data is generated by this code:

vector<pair<int, int>> create_random_data(int N) {
  mt19937 mt;

  vector<int> sequence(N);
  std::iota(sequence.begin(), sequence.end(), 0);
  shuffle(sequence.begin(), sequence.end(), mt);

  vector<pair<int, int>> ret;
  generate_n(
	back_inserter(ret), 
	N, 
	[&]{ return pair<int, int>{ 0, mt() }; });

  for (int i = 0; i < N; ++i) {
    auto next_i = (i + 1) % N;
    ret[sequence[i]].first = sequence[next_i];
  }

  return ret;
}

Iterating over the linked list is done by the following code (the write to volatile is done to guard aginst optimizing away the loop):

volatile int volatile_sum;

void iterate_linked_list(const vector<pair<int, int>>& data) {
  int sum = 0;
  int idx = 0;
  do {
    sum += data[idx].second;
    idx = data[idx].first;
  } while (idx != 0);

  volatile_sum = sum;
}

Iterating over a vector of pointers is approximated by simply reading the payload of the next element in the chain, but otherwise going through the vector in normal order:

void iterate_pointers(const vector<pair<int, int>>& data) {
  int sum = 0;
  for (const auto& d : data) {
    sum += data[d.first].second;
  }

  volatile_sum = sum;
}

And finally, we also compare with a simple iteration over the vector, where we expect almost zero cache misses:

void iterate_simple(const vector<pair<int, int>>& data) {
  int sum = 0;
  for (const auto& d : data) {
    sum += d.second;
  }

  volatile_sum = sum;
}

For each $N$ , the operations are run 100 times and the median time is reported. The code is compiled with MSVC 2015 Update 2, with optimization level /O2. Running on an Intel i7-4600U @2.1 GHz.

Results

We can see clear jumps in the run times when the data set grows out of the various cache levels. Furthermore, we see that the effect of cache misses is much worse for the linked list. I have also verified with gcc and perf that the cache miss count is roughly equal for the pointer iteration and linked list iteration.

Iteration (ns/elem)

N	simple	pointers	linked list
1000	0.76	0.76	3.041
2000	0.38	0.3805	2.0905
4000	0.66525	0.76025	2.0905
8000	0.427625	0.617625	4.70375
16000	0.570125	1.069	6.48531
32000	0.570125	0.867094	10.0843
64000	0.587969	1.78763	13.3745
128000	0.602805	2.1024	17.4635
256000	0.439484	1.68815	17.7352
512000	0.521887	3.58936	32.6628
1024000	0.579792	10.2269	74.1561
2048000	0.639181	15.2082	84.6611
4096000	0.655699	18.2579	89.6222
8192000	0.668737	21.606	94.0833

Why do we see this large difference? The answer is simply data dependency. When you are iterating over a range of pointers, the CPU can start reading in several values a the same time, and then add them to the sum as they come in from the memory system. With the linked list structure, we can’t continue to the next node until the current node is fully read from memory, because we don’t know where to go next. This means we can’t have many reads in flight at the same time. Put simply, the pointer iteration is more dependent on memory throughput, the linked list iteration is more dependent on memory latency.

Update: Thanks to @matt_dz for running some performance counter measurements on the above experiments, results can be seen in his gist. Among other things, the results show that cache miss counts are fairly similar for vector of pointers iteration and linked list, and that the benefit of vector of pointers is not due to hardware prefetching.

Conclusion

I hope I have made the argument that all cache misses are not equal, and that considering data dependency between the reads can be just as important as counting cache misses.

Edit: As some people have pointed out, the statement “all cache misses are not equal” is a bit misleading. The cache misses themselves are the same and have the same latency for both workloads. The difference is how many cache misses that can be in flight concurrently. So while the the cache misses are equal in that sense, they do not have an equal effect on latency of the whole iteration, which is what I tend to care about.

Related Material