Lets say you want to have a mutex for every item in a list with 10k elements. It feels a bit wasteful to use a std::mutex for each of those elements. In Linux std::mutex is 40 bytes, in Windows it’s 80 bytes. But mutexes don’t need to be that big. You can fit a mutex into two bits, and it’s going to be fast. So we could fit a mutex into the low bits of a pointer. If your container already stores pointers, you might be able to store a mutex for each element with zero space overhead, not even any extra operations during initialization. You’d pay no cost until you use a mutex for the first time.
Lets start with a mutex that uses one byte. It’s easy now that C++ has added futexes to the standard:
struct one_byte_mutex
{
void lock()
{
if (state.exchange(locked, std::memory_order_acquire) == unlocked)
return;
while (state.exchange(sleeper, std::memory_order_acquire) != unlocked)
state.wait(sleeper, std::memory_order_relaxed);
}
void unlock()
{
if (state.exchange(unlocked, std::memory_order_release) == sleeper)
state.notify_one();
}
private:
std::atomic<uint8_t> state{ unlocked };
static constexpr uint8_t unlocked = 0;
static constexpr uint8_t locked = 0b01;
static constexpr uint8_t sleeper = 0b10;
};
The futex operations on here are the call to “wait()” and “notify_one()”. These are like simpler versions of condition variables. The “state.wait(sleeper)” call will put the thread to sleep only if state==sleeper. And “notify_one()” will wake one thread that went to sleep on this variable. You can do this on any atomic variable now. There are no special requirements. How can you sleep and notify on anything? The kernel has a hash table to store sleeping threads which is indexed by the address of the variable. So all you need is a unique address. Most operating systems support this, even Windows. (who, when they copied this idea from Linux, had the chutzpah to patent futexes) If an operating system doesn’t have this, this is implemented with a global hash table in the standard library.
This mutex is optimized for the common case when it is unlocked and there is no contention. In that case we only have to do one exchange() when locking, and one exchange() when unlocking. It’s a bit tricky to convince yourself that this is correct. I verified the implementation in TLA+. The main trick to keeping it simple is that we only try to set the state to “locked” once. If that doesn’t work, we instead set it to “sleeper”. This is necessary because we don’t know how many threads are sleeping, and unlock() clears both bits. So if there was one sleeping thread, it has to set the “sleeper” flag just in case there are others.
One tricky interleaving is if a new thread comes in and does the initial “exchange()” call at a bad time, clearing the “sleeper” bit and causing an unlocking thread to not call notify_one(). In that case the newly entering thread also sets the sleeper flag, so it will call notify_one() when it unlocks.
Two Bit Mutex
The one_byte_mutex is simple and storing a mutex in one byte is good. Storing it in two bits, say the low bits of a pointer, is trickier. Here is an implementation that does that:
template<typename T>
struct pointer_with_mutex
{
T* get() const
{
uint64_t masked = pointer.load(std::memory_order_relaxed) & ~both_bits;
return reinterpret_cast<T*>(masked);
}
void set(T* ptr)
{
static_assert(std::alignment_of<T>::value >= 4);
uint64_t as_int = reinterpret_cast<uint64_t>(ptr);
uint64_t old = pointer.load(std::memory_order_relaxed);
while (!pointer.compare_exchange_weak(old, (old & both_bits) | as_int, std::memory_order_relaxed))
{
}
}
void lock()
{
uint64_t old = pointer.load(std::memory_order_relaxed);
if (!(old & both_bits) && pointer.compare_exchange_strong(old, old | locked, std::memory_order_acquire))
return;
for(;;)
{
if (old & sleeper)
{
pointer.wait(old, std::memory_order_relaxed);
old = pointer.load(std::memory_order_relaxed);
}
else if (pointer.compare_exchange_weak(old, old | sleeper, std::memory_order_acquire))
{
if (!(old & both_bits))
return;
pointer.wait(old | sleeper, std::memory_order_relaxed);
old = pointer.load(std::memory_order_relaxed);
}
}
}
void unlock()
{
uint64_t old = pointer.fetch_and(~both_bits, std::memory_order_release);
if (old & sleeper)
pointer.notify_one();
}
private:
std::atomic<uint64_t> pointer{ 0 };
static constexpr uint64_t locked = 0b01;
static constexpr uint64_t sleeper = 0b10;
static constexpr uint64_t both_bits = locked | sleeper;
};
This one is significantly larger, but it’s a direct translation from the above, just replacing “exchange” with “compare_exchange” to leave the other bits unaffected. Plus some extra conditionals to skip the compare_exchange when we can. (my first attempt was to just replace exchange() with fetch_or(), which would lead to simpler code, but that just uses compare_exchange internally, in a way that was noticeably slower)
The thing to point out is that none of the code depends on which bits we use, or on what the remaining bits are used for. In this case I use them for a pointer, which has to be at least four-byte-aligned, but you could use any two bits for the mutex and store anything in the remaining bits.
Performance
How does this perform? It’s a bit slow when the mutex is contended, but it’s actually faster than std::mutex for an unlocked mutex. Here are the timings on Windows (compiled with Visual Studio 2019):
lock/unlock single-threaded | lock/unlock many threads | |
std::mutex | 22ns | 50ns |
one_byte_mutex | 10ns | 67ns |
pointer_with_mutex | 18ns | 94ns |
Here are the timings on Linux (compiled with clang-12 against libc++-12)
lock/unlock single-threaded | lock/unlock many threads | |
std::mutex | 12ns | 71ns |
one_byte_mutex | 8ns | 228ns |
pointer_with_mutex | 15ns | 255ns |
This is running a benchmark where I call lock(); unlock(); in a loop. The first column is running the benchmark with a single thread, so we always hit the fast path where we don’t have to go to sleep. The second column is running with sixteen threads, so the mutex will often be locked. I then divide the length of the benchmark by the number of lock() unlock() calls to get the time for the average call. These are running on the same machine, booted into either Windows 10 or Ubuntu 20.04.
So if you expect that you will usually find your mutex unlocked, these will actually be both faster and smaller for you than std::mutex. But if you have lots of contention, you may have to put a bit more work into these to make them fast. (my first guess would be that this particular benchmark would benefit from spinning for a bit before sleeping, because the critical section is small. My second guess is that these futexes aren’t implemented efficiently in the standard library yet. I didn’t look into either of these guesses)
Four Mutexes per Byte
So if we only need two bits for a mutex, can we store four mutexes in a single byte? Maybe, but you can’t control which mutex you want to wake up. futexes work by having a hash table with the address of the futex. And you can’t address bits, you can only address bytes. So all four mutexes would have the same address, so they would all be the same futex.
You could probably still make it work by using notify_all() instead of notify_one(), and that might be fine if you expect the mutex to almost always sit idle. Or alternatively you could have a look at the hash table that your standard library uses to implement futexes in userspace, copy it, and change the key to not just be a pointer. But I’ll leave that as an exercise for the reader.
Code
The code for this is available here. It’s released under the boost license.
Also I’m trying out a donation button for the first time. If open source work like this is valuable for you, consider paying for it. The recommended donation is $20 (or your local cost for an item on a restaurant menu) for individuals, or $1000 for organizations. (or your local cost of hiring a contractor for a day) But any amount is appreciated:
Make a one-time donation
Thanks! I have no idea how much this is worth to people. Feedback appreciated.
Donate