cacheline导致的bug可以参考《不可忽视的cacheline问题》，本文介绍一个不是属于运行错误的例子，其属于性能损失，cache的伪共享，这类问题比cacheline导致的bug而言更容易出现，也是必须要解决的。

简单来说就是MESI和MOESI协议下，如果一个数据反复的在访问同一个cacheline，那么MOESI就会对bus上的数据进行invalidate，并在bus上进行snooping到每个CPU上。这种行为对于CPU的视角来看是低效的。相当于无意义的吃掉了内存的带宽，浪费了系统资源。

测试代码

测试代码来自参考里面，如下


#include <iostream>
#include <thread>
#include <new>
#include <atomic>
#include <chrono>
#include <latch>
#include <vector>

using namespace std;
using namespace chrono;

#if defined(__cpp_lib_hardware_interference_size)
// default cacheline size from runtime
constexpr size_t CL_SIZE = hardware_constructive_interference_size;
#else
// most common cacheline size otherwise
constexpr size_t CL_SIZE = 64;
#endif

int main()
{
    vector<jthread> threads;
    int hc = jthread::hardware_concurrency();
    hc = hc <= CL_SIZE ? hc : CL_SIZE;
    for (int nThreads = 1; nThreads <= hc; ++nThreads)
    {
        // synchronize beginning of threads coarse on kernel level
        latch coarseSync(nThreads);
        // fine synch via atomic in userspace
        atomic_uint fineSync(nThreads);
        // as much chars as would fit into a cacheline
        struct alignas(CL_SIZE) { char shareds[CL_SIZE]; } cacheLine;
        // sum of all threads execution times
        atomic_int64_t nsSum(0);
        for (int t = 0; t != nThreads; ++t)
            threads.emplace_back(
                [&](char volatile &c)
                {
                    coarseSync.arrive_and_wait(); // synch beginning of thread execution on kernel-level
                    if (fineSync.fetch_sub(1, memory_order::relaxed) != 1) // fine-synch on user-level
                        while (fineSync.load(memory_order::relaxed));
                    auto start = high_resolution_clock::now();
                    for (size_t r = 10'000'000; r--;)
                        c = c + 1;
                    nsSum += duration_cast<nanoseconds>(high_resolution_clock::now() - start).count();
                }, ref(cacheLine.shareds[t]));
        threads.resize(0); // join all threads
        cout << nThreads << ": " << (int)(nsSum / (1.0e7 * nThreads) + 0.5) << endl;
    }
}

其思路就是构造一个cacheline大小的数组


struct alignas(CL_SIZE) { char shareds[CL_SIZE]; } cacheLine;

然后对齐自加


            threads.emplace_back(
                [&](char volatile &c)
                {
                    for (size_t r = 10'000'000; r--;)
                        c = c + 1;
                }, ref(cacheLine.shareds[t]));

通过不断的增加线程来查看伪共享的负面效果


 for (int nThreads = 1; nThreads <= hc; ++nThreads)

测试

编译运行


g++ -std=c++20 falsesharing.cpp -o falsesharing

运行


# ./falsesharing
1: 3
2: 7
3: 12
4: 13
5: 14
6: 19
7: 21
8: 29
9: 30
10: 34
11: 37
12: 39
13: 41
14: 47
15: 49
16: 54

可以看到，随着多线程的加剧，这种伪共享带来的性能损失来到了 54 / 3 = 18倍。

如何解决

官方的解决方案如下：

重新排序变量或在变量之间添加填充（未使用的字节）来防止 CPU 缓存中的虚假共享
编译时数据转换还可以减少虚假共享

其实最简单直接的办法就是每个线程使用不同的数据，这样就规避了多个线程访问同一个cacheline的问题。按照我的思路，直接让所有线程访问不同的cacheline，运行结果如下


# ./fixfalsesharing
1: 3
2: 3
3: 3
4: 3
5: 3
6: 2
7: 4
8: 3
9: 3
10: 3
11: 3
12: 3
13: 3
14: 3
15: 3
16: 3

总结

在编写多线程或者并发程序的时候，相比于cachelinebug《不可忽视的cacheline问题》，伪共享其实是更容易遇到和需要避免的问题，只要编程时有这样的印象，只需要很简单的让多线程不竞争同一个cacheline即可。

参考

https://en.wikipedia.org/wiki/False_sharing

目录

False sharing

测试代码

测试

如何解决

总结

参考