cacheline导致的bug可以参考《不可忽视的cacheline问题》,本文介绍一个不是属于运行错误的例子,其属于性能损失,cache的伪共享,这类问题比cacheline导致的bug而言更容易出现,也是必须要解决的。
简单来说就是MESI和MOESI协议下,如果一个数据反复的在访问同一个cacheline,那么MOESI就会对bus上的数据进行invalidate,并在bus上进行snooping到每个CPU上。这种行为对于CPU的视角来看是低效的。相当于无意义的吃掉了内存的带宽,浪费了系统资源。
测试代码来自参考里面,如下
#include <iostream> #include <thread> #include <new> #include <atomic> #include <chrono> #include <latch> #include <vector> using namespace std; using namespace chrono; #if defined(__cpp_lib_hardware_interference_size) // default cacheline size from runtime constexpr size_t CL_SIZE = hardware_constructive_interference_size; #else // most common cacheline size otherwise constexpr size_t CL_SIZE = 64; #endif int main() { vector<jthread> threads; int hc = jthread::hardware_concurrency(); hc = hc <= CL_SIZE ? hc : CL_SIZE; for (int nThreads = 1; nThreads <= hc; ++nThreads) { // synchronize beginning of threads coarse on kernel level latch coarseSync(nThreads); // fine synch via atomic in userspace atomic_uint fineSync(nThreads); // as much chars as would fit into a cacheline struct alignas(CL_SIZE) { char shareds[CL_SIZE]; } cacheLine; // sum of all threads execution times atomic_int64_t nsSum(0); for (int t = 0; t != nThreads; ++t) threads.emplace_back( [&](char volatile &c) { coarseSync.arrive_and_wait(); // synch beginning of thread execution on kernel-level if (fineSync.fetch_sub(1, memory_order::relaxed) != 1) // fine-synch on user-level while (fineSync.load(memory_order::relaxed)); auto start = high_resolution_clock::now(); for (size_t r = 10'000'000; r--;) c = c + 1; nsSum += duration_cast<nanoseconds>(high_resolution_clock::now() - start).count(); }, ref(cacheLine.shareds[t])); threads.resize(0); // join all threads cout << nThreads << ": " << (int)(nsSum / (1.0e7 * nThreads) + 0.5) << endl; } }
其思路就是构造一个cacheline大小的数组
struct alignas(CL_SIZE) { char shareds[CL_SIZE]; } cacheLine;
然后对齐自加
threads.emplace_back( [&](char volatile &c) { for (size_t r = 10'000'000; r--;) c = c + 1; }, ref(cacheLine.shareds[t]));
通过不断的增加线程来查看伪共享的负面效果
for (int nThreads = 1; nThreads <= hc; ++nThreads)
编译运行
g++ -std=c++20 falsesharing.cpp -o falsesharing
运行
# ./falsesharing 1: 3 2: 7 3: 12 4: 13 5: 14 6: 19 7: 21 8: 29 9: 30 10: 34 11: 37 12: 39 13: 41 14: 47 15: 49 16: 54
可以看到,随着多线程的加剧,这种伪共享带来的性能损失来到了 54 / 3 = 18倍。
官方的解决方案如下:
其实最简单直接的办法就是每个线程使用不同的数据,这样就规避了多个线程访问同一个cacheline的问题。 按照我的思路,直接让所有线程访问不同的cacheline,运行结果如下
# ./fixfalsesharing 1: 3 2: 3 3: 3 4: 3 5: 3 6: 2 7: 4 8: 3 9: 3 10: 3 11: 3 12: 3 13: 3 14: 3 15: 3 16: 3
在编写多线程或者并发程序的时候,相比于cachelinebug《不可忽视的cacheline问题》,伪共享其实是更容易遇到和需要避免的问题,只要编程时有这样的印象,只需要很简单的让多线程不竞争同一个cacheline即可。