Here is a rough approximation of float multiplication (source):
float rough_float_multiply(float a, float b) {
constexpr uint32_t bias = 0x3f76d000;
return bit_cast<float>(bit_cast<uint32_t>(a) + bit_cast<uint32_t>(b) - bias);
}
We’re casting the floats to ints, adding them, adjusting the exponent, and returning as float. If you think about it for a second you will realize that since the float contains the exponent, this won’t be too wrong: You can multiply two numbers by adding their exponents. So just with the exponent-addition you will be within a factor of 2 of the right result. But this will actually be much better and get within 7.5% of the right answer. Why?

It’s not the magic number. Even if you only adjust the exponent (subtract 127 to get it back into range after the addition) you get within 12.5% of the right result. There is also a mantissa offset in that constant which helps a little, but 12.5% is surprisingly good as a default.
I should also say that the above fails catastrophically when you overflow or underflow the exponent. I think the source paper doesn’t handle that, even though underflowing is really easy by doing e.g. 0.5 * 0. It’s probably fine to ignore overflow, so here is a version that just handles underflow:
float custom_multiply(float a, float b) {
constexpr uint32_t sign_bit = 0x8000'0000;
constexpr uint32_t exp_offset = 0b0'01111111'0000000'00000000'00000000;
constexpr uint32_t mantissa_bias = 0b0'00000000'0001001'00110000'00000000;
constexpr uint32_t offset = exp_offset - mantissa_bias;
uint32_t bits_a = std::bit_cast<uint32_t>(a);
uint32_t bits_b = std::bit_cast<uint32_t>(b);
uint32_t c = (bits_a & ~sign_bit) + (bits_b & ~sign_bit);
if (c <= offset)
c = 0;
else
c -= offset;
c |= ((bits_a ^ bits_b) & sign_bit);
return std::bit_cast<float>(c);
}
Clang compiles this to a branchless version that doesn’t perform too far off from float multiplication. Is this ever worth using? The paper talks about using this to save power, but that’s probably not worth it for a few reasons:
- Most of the power-consumption comes from moving bits around, the actual float multiplication is a small power drain compared to loading the float and saving the result
- You wouldn’t be able to use tensor cores
- I don’t think you can actually be faster than float multiplication because there are so many edge cases to handle
It feels close to being worth it though, so I wouldn’t be surprised if someone found a use case.
But there is still the question of why this works so well. The mantissa is not stored in log-space, it’s just stored in plain old linear space where addition does not do multiplication. But lets think about how to get the exponent from the mantissa.
In general how do you get the remaining exponent-fraction from the remaining bits? This is easier to think about for integers where you can get the log2 by determining the highest set bit:
log2(20) = log2(0b10100) ~= highest_set_bit(0b10100) = 4
The actual correct value is log2(20)=4.322. The question we need to answer is: How do you get the remaining exponent-fraction, 0.322 from the remaining bits, 0b0100?
To make this work for any number of bits we should normalize the remaining bits into the range 0 to 1, which in this case means doing the division 0b0100/float(1 << 4)=0.25. (in general you divide by the highest set bit, which you already had to calculate for the previous step)
After we brought the numbers into the range from 0 to 1, you can get the remaining exponent fraction with log2(1+x). In this case it’s log2(1+0.25) = 0.322.
If you plot y=log2(1+x) for the range from 0 to 1 you will find that it doesn’t deviate too far from y=x. So if you just want an approximate solution you might as well skip this step.

And then the mantissa is already interpreted as a fraction on floats, so you also don’t have to divide. So the whole operation cancels out and you can just add.
You still need to handle
- The sign bit
- Overflowing mantissa
- Overflowing exponent
- Underflowing exponent
Number 1 and 2 also work out naturally using addition because of how floats are represented:
- Since the sign bit is the highest bit, overflow is ignored so addition is the same as xor, which is what you want
- When the mantissa overflows you end up increasing the exponent, which is what you want. (e.g. 1.5 * 1.5 = 2.25, which has a higher base-2 exponent)
Number 3 can be ignored for most floats you care about.
Number 4 is the one that required me to write that more complicated version of the code. It’s really easy to underflow the exponent, which will wrap around and give you large numbers instead. In neural networks lots of activation functions like to return 0 or close to 0 and when you multiply with that you will underflow the initial function and get very wrong results. So you need to handle underflow. I have not found an elegant way of doing it because you only have a few cycles to work in, otherwise you might as well use float multiplication.
The last open question is that mantissa-adjustment: You can see in the graph above that the approximation y=x is never too big, so by default you will always bias towards 0. But you can add a little bias to the mantissa to shift the whole line up. I tried a few analytic ways to arrive at a good constant, but they all gave terrible results when I actually tried this on a bunch of floats. So I just tried many different constants and stuck with the one that gave the least error on 10,000 randomly generated test floats.
So even though this is probably not useful and I haven’t found a really elegant way of doing it, it’s still neat how the whole thing almost works out as a one-liner because so many things cancel out once you approximate y=log2(1+x) as y=x.