Floating Point: Designing a Number System from 32 Bits
Derive IEEE 754 float32 from a blank 32-bit register, then connect representation to arithmetic: alignment, rounding, ULPs, epsilon, and when real-number identities break.
If you've written code for more than a week, you've probably hit this: 0.1 + 0.2 returns 0.30000000000000004. In JavaScript, Python, C, Java... every language. You Google it, find a StackOverflow answer that says "floating point is inherently imprecise," nod, and move on.
But that answer never sat right. Computers are deterministic machines executing exact instructions on exact bit patterns. Nothing about them is "inherently imprecise." Something specific is happening at the bit level, and "floating point is weird", while not incorrect, isn't exactly a meaty explanation.
Here's the thing, though. Not every number has this problem. 0.5 stores perfectly in a float. So do 0.25, 1024, and 6.75. But 0.1 doesn't. Neither does 6.1. Some decimals survive the trip into binary and back without losing a bit. Others pick up noise the moment you store them. What decides which is which?
I ran into this question again while looking into automatic mixed precision (AMP) for training. AMP downcasts your 32-bit weights to 16 bits or less, and the PyTorch docs and NVIDIA papers show loss curves that mostly look fine. But what exactly changes when you cut a number's bit budget in half? Are we losing precision, or range, or both? What decides what we lose?
I couldn't answer any of that. I knew floats were "approximate," the same way I knew compilers "optimize": true, but useless for actually picking between FP16 and BF16. So I went back to the beginning. Not "how does IEEE 754 work" but why does it work that way. What problem does each piece of the format solve? What breaks without it? That detour became this post (and don't worry, we'll probably write about the AMP investigations in the future).
We're going to forget IEEE 754 exists and derive it from scratch. Start with a blank 32-bit register. Try the simplest encoding that could work. Refine it each time we hit a concrete failure. By the end, terms like sign bit, exponent, and mantissa should feel less like arbitrary design choices and more like the only pieces that could survive the constraints.
The first half is representation: how 32 bits encode a real number. The second half is arithmetic, where identities you've relied on your whole life, like , stop holding.
I wrote this assuming no prerequisites beyond basic arithmetic. Let's build everything, starting from a single bit!
The design problem
Storing integers in bits is straightforward. The mapping is exact: 13 is 1101 in binary, and you can recover it perfectly. With 32 bits you get = 4,294,967,296 distinct patterns, so any integer up to about four billion has a perfect representation with neither rounding, nor approximation.
Fractions are a different story. How many values exist between 1.0 and 2.0? Between 1.0 and 1.1? Between 1.00 and 1.01? Infinitely many, every time. You can zoom in forever and never run out. There is no "next number" after 1.0 the way 6 is the next integer after 5.
So. You have 4.3 billion tick marks and the entire real line. Where do you put them?
Wherever you place them, everything in between snaps to the nearest tick. That snap is called rounding, and it's where floating-point weirdness lives. It's also why 0.1 + 0.2 gives you 0.30000000000000004.
This is not unique to computers. You deal with it in decimal all the time. repeats forever. Cut it to 6 digits: . Add three copies: , not . You ran out of digits.
Binary has the same problem, just for different fractions.
Think about which fractions come out clean in decimal. . . . All terminate. But and repeat forever. The pattern comes down to prime factors of the base. , so fractions whose denominators are built from 2s and 5s terminate. Anything with another prime factor (3, 7, 11, ...) repeats.
Binary is base 2. One prime factor. So the only fractions that terminate are those with power-of-two denominators: halves, quarters, eighths, sixteenths. That's it. Every other fraction repeats forever.
Look at 1/10. In decimal: 0.1, clean and simple. In binary: , repeating forever. You type 0.1 without thinking every time you deal with money or percentages, and binary cannot represent it exactly. Why? Because 10 has a factor of 5, and 5 is not a factor of 2. That is all it takes.
Both 0.1 and 0.2 are stored as nearby binary approximations, each slightly off before any arithmetic happens. The error isn't introduced by the addition. It was baked in at storage time.
So rounding is a given. The real question is how we space the tick marks as numbers get larger. Consider the range of things people compute with:
- A bridge architect working at millimeter precision across 0.5 to 500 meters.
- Machined parts like hard disk heads, where a micrometer (0.001 mm) matters, but the whole part fits inside a few hundred millimeters.
- Orbital dynamics, where one equation mixes with Earth's mass .
A format with fixed, evenly spaced tick marks can handle one of these well, but not all three. What you want is to pack marks densely near small values and let them spread out as magnitude grows. Adaptive spacing.
Same program, different answers
Adaptive spacing is the right idea, but it forces real design choices: how many bits for the exponent versus the digits, how negatives work, what happens on overflow, and how rounding breaks ties.
In the 1960s and 70s, every manufacturer answered these questions differently. DEC's VAX had one format. IBM's System/360 used base 16 instead of base 2, which quietly costs precision in ways we'll see later. Cray's supercomputers had yet another scheme.
The consequence: the same Fortran program, compiled on two different machines, could give different numerical answers. The hardware disagreed about how to round, how much precision to carry, and what to do on overflow. Scientific software developed on a VAX might produce subtly (or completely) different results on a Cray.
In the late 1970s, Intel was designing the 8087 floating-point coprocessor, and mathematician William Kahan saw an opportunity to fix this for good. Not "similar answers." Identical ones, bit for bit, across every manufacturer. Think about the ambition of that.
The committee that formed spent nearly eight years debating the tradeoffs we're about to encounter: exponent width, whether to hide the leading 1, what happens near zero, and how ties should round. In 1985, they published IEEE 754. By the 90s it was the default on mainstream CPUs, and most languages expose it today as float/double (or float32/float64).
That's the standard we're about to re-derive. Each piece arrives as the obvious fix to a problem we just encountered.
Part 1: Representation
Bits and Unsigned Integers
Let's start from the absolute basics, a single bit. Most of you know this but let's make it explicit because the rest of the post builds on it.
A bit is either 0 or 1. Eight bits make a byte:
To read this as a number, each position gets a power of two. The rightmost bit is , the next is , then , and so on up to on the left.
Add up the columns where there's a 1:
Eight bits cover 0 (00000000) through 255 (11111111), giving possible values.
Adding works like decimal. Right to left, carry when a column hits 2 (decimal carries at 10). Here's 42 + 23:
Same rule as decimal: add digits right to left and carry when a column totals 2 (just as decimal carries at 10). Every bit sits at a fixed power of two, so you add columns and propagate carries.
With 32 bits the ceiling is . Over four billion values, all positive (there is no minus sign anywhere in our 32 bits). These are unsigned integers.
The Limits of Integers
32 bits give you four billion values. That sounds like a lot. But the real limitation is not the range. It's the step size. Adjacent integers are always exactly 1 apart.
That fixed spacing is the problem. Integers burn bits on every step of 1, even when you only care about a few significant digits. Sometimes that is exactly what we want: counting the number of countries in the world, or counting the number of moons of a planet, etc. But think about a number like the Sun's mass: roughly kg. No one measures that to the kilogram! All that matters is the order of magnitude () and a few significant digits (the in ), yet integers would still insist on stepping by ones all the way up there. That is just a waste of bits! You want range (very big and very small values) and precision (meaningful digits), and are hopefully willing to consider the tradeoffs. But integers give you zero flexibility. They spend all their bits counting by ones.
There's also a more immediate problem with our running example: our format can't represent 6.75 at all. Integers have no fractions. The closest we can get is 6 or 7.
So we need two things: fractions, and a way to trade precision for range.
Fixed Point
Let's start with the simple/brute-force idea: draw a line down the middle of your 32 bits. Left half for the integer part, right half for the fraction.
The integer side works as before: powers of two going up (). The fractional side mirrors it, going down: , , , and so on. Where decimal has tenths, hundredths, thousandths, binary has halves, quarters, eighths.
Think of the binary point as a mirror. To the left, each position doubles: To the right, each position halves: Converting a fraction to binary means finding which combination of these powers adds up to it.
6.75: the integer part is (). The fractional part is , which is 0.11 in binary. Together:
Now we can represent fractions.
Another way to see it: you're storing a scaled integer. The scale factor is . So 6.75 gets stored as the integer , and you divide by 65,536 when you read it back.
But look at the cost. Our maximum value dropped from over 4 billion to roughly 65,535. We traded range for fractional precision, and that tradeoff is locked in. Every number gets exactly 16 bits of integer and 16 bits of fraction, whether it needs them or not.
That rigidity is the problem. 3.14159 wastes most of its integer bits (it only needs 2). 50,000 fits but can't use any fractional bits. 100,000 doesn't fit at all. The split is one-size-fits-all, and no single split works well across different magnitudes.
The Floating Point Idea
Fixed point forces every number into the same rigid split. Small numbers waste their integer bits. Big numbers can't use their fractional bits. What if the split could adapt to each number individually? What if we let the point move?
For example, let's take 32 bits, and instead of dividing it up into the 'before' and 'after' of the decimal points, we do something slightly different. We set aside 8 to record where the binary point goes. The remaining 24 can store digits. So for example, we store the binary number '11' in its entirety (without any point) in the 24 bit section, and then use the remaining 8 bits to indicate the exact position within that number we place the point in (say 1). This means that the stored number will be interpreted as ''
You see how this solves the problem of 16.16 format from the last section? Here, a small number like 0.000125 puts the point far left (more fractional precision). A big number like 6,000,000 puts it far right (more integer range). Each number gets the tradeoff it needs!
This is the entire conceptual idea. Everything else is details.
For 6.75, we store the digits 11011 and record the point position as 3 (after the third digit): 110.11. The 8-bit field encodes the scale of the number. The 24 digit bits encode the value within that scale. Those digits are called the mantissa (formally "significand," though barely anyone uses that word). Because the binary point can float to a different position per value, this is floating point.
More bits for point position means wider range but fewer mantissa bits and coarser precision. More bits for mantissa means finer precision but less range. Eight bits for position and 24 for mantissa turns out to be a good balance, and it's close to what IEEE settled on.
Put differently: floating point is scientific notation in base 2.
| Format | Layout | Max Value | Precision |
|---|---|---|---|
| 32-bit integer | 32 bits: digits | ~4.3 billion | Exact |
| 16.16 fixed point | 16 + 16 bits | ~65,535 | ~0.00002 |
| Floating point (8+24) | 8 + 24 bits | Huge (depends on exponent) | ~24 bits of mantissa |
Okay, so we covered some good distance in this section. Looks like range and fractions are covered. But we still cannot say "negative six" yet! We need negative numbers.
The Sign Bit
This one is the easy part. Take one of our 32 bits and use it to record the sign. 0 for positive, 1 for negative.
This gives us the layout we'll use for the rest of the post:
One bit for the sign. Eight for the point position (we'll start calling this the exponent from here on, for reasons that become clear in the next section). Twenty-three for the mantissa. 1 + 8 + 23 = 32 bits.
Under this layout, 6.75 and -6.75 differ in exactly one bit. Setting 0 for positive is a convention, and a nice one: positive zero is all-zero bits (00000000...).
Sign, range, and fractions. We're getting somewhere. But there is a problem with how we're spending our 32 bits.
The Redundancy Problem and Normalization
But wait. If the point position is stored explicitly, nothing stops multiple bit patterns from encoding the same value! Consider 6.75 (110.11_2):
1.1011 × 2^2(point after first digit)11.011 × 2^1(point after second digit)110.11 × 2^0(point after third digit)0.11011 × 2^3(point before first digit)
Four bit patterns. One number. Every duplicate is a slot that could have represented a different value. We have 4.3 billion patterns and infinitely many real numbers to cover. We cannot afford to spend four of them on the same number.
Zero is even worse. Zero stays zero no matter where the point sits, and the sign bit doesn't help. Dozens of bit patterns all collapse to the same value.
The fix is scientific notation. Write any number as a value between 1 and 10, times a power of ten:
The rule: always place the decimal point after the first non-zero digit. You would not write or , even though they equal the same thing. One standard form per number. No ambiguity.
We use the same rule in binary: place the binary point after the first non-zero digit and adjust the exponent to compensate. This is normalization.
For 6.75:
We moved the point two positions left (dividing by 4), so we multiply by to compensate. The mantissa is 1.1011, the exponent is 2, and every non-zero number now has exactly one representation.
That handles the redundancy. But normalization also gives us a free bit of precision.
The Implicit Leading 1
Think about what normalization guarantees. The binary point always sits right after the first non-zero digit. In decimal, that leading digit could be anything from 1 through 9. Nine possibilities. You have to store which one it is.
But binary has only two digits. If the leading digit cannot be zero (normalization forbids it), it must be 1.
Every normalized binary number looks like this:
That leading 1 is predictable, so we do not store it. Drop it from the 23-bit mantissa field and you get 24 bits of precision from a 23-bit field. Something for nothing. Hardware reconstructs the full mantissa by prepending 1 during decode.
For 6.75, the normalized form is . We store 10110000000000000000000 in the mantissa field (23 bits, the part after the binary point). When computing, the hardware reconstructs 1.10110000000000000000000 (24 bits).
This is the implicit leading 1 (also called the hidden bit). One free bit of precision from a one-line logical argument.
But it creates a problem. If every number starts with 1.something... how do you store zero? We will come back to that. First, the exponent field has its own issue.
Negative Exponents and the Bias
The exponent field is 8 bits: values 0 to 255. All positive. But we need negative exponents too. . . Without negative exponents, we cannot represent anything smaller than 1. That is not going to work.
One idea: dedicate a sign bit to the exponent, leaving 7 bits for magnitude. Range: -127 to +127. The problem: exponent zero gets two bit patterns (00000000 for +0 and 10000000 for -0). We just finished eliminating redundancy in the mantissa. Same waste, different field.
IEEE 754 uses a simpler trick: a fixed offset. The stored exponent is:
To decode, subtract 127. That is it.
- Stored 1 → real exponent -126 (smallest normal)
- Stored 127 → real exponent 0
- Stored 254 → real exponent +127 (largest normal)
The offset is called the bias, and the stored value is the biased exponent. A nice side effect: because the bias is a constant shift, ordering is preserved. A larger stored exponent always means a larger real exponent, which makes hardware comparisons simple.
Why 127 and not 128? Because stored exponents 0 and 255 are reserved for special cases (we will see what for soon), leaving 1 through 254 for normal numbers. Subtracting 127 maps that range to real exponents -126 through +127.
Exponent as a Window, Mantissa as an Offset
We now have a sign bit, a biased exponent, and a mantissa field. You will often see float32 written as:
That formula is correct, but when I first saw it, it told me nothing about why the format behaves the way it does. There is a more intuitive way to think about it. The exponent selects a range, and the mantissa selects a position inside that range.
For normal numbers (stored exponent 1 through 254), the exponent picks a window between two consecutive powers of two. The mantissa picks a position inside that window.
Strip the bias and call the real exponent . Then:
- The window spans .
- The window is wide.
- The mantissa divides that window into evenly spaced slots.
Here are a few windows to make this pattern concrete:
Each window is twice as wide as the last, but every window gets the same slots. Fixed slot count, doubling window width. So each step up in exponent doubles the spacing between adjacent representable floats.
If the mantissa bits decode to an integer in , the value is:
Same formula as before, written as "left edge plus offset." The sign bit flips the result across zero.
This is where the window model pays off. Take 6.75. It lives in . How far across the window is it?
68.75% of the way across. Lands cleanly on a slot boundary. No rounding needed.
Now try 6.1, also in :
Multiply by : . Not an integer, so 6.1 does not land on a slot boundary. Hardware rounds to the nearest one. This is rounding error made concrete. Not a mystery, not a bug. Just a number that fell between two marks on a finite grid.
That is the range-versus-precision tradeoff in action: move from to and you still have slots, but the window doubled in width, so each slot is twice as wide. Range grows, and absolute spacing grows with it.
Some concrete slot sizes to give you a feel for the scale:
- In : slot size is .
- In : slot size is .
The absolute slot size changes, but relative precision stays roughly constant, because every window gets the same number of slots. You always get about the same number of significant bits regardless of magnitude. That is the whole point: floats give constant relative precision.
We have covered one category of bit patterns. Here is the map so far:
Normal numbers cover stored exponents 1 through 254. Stored 0 and 255 are still empty.
The Zero Problem
The implicit leading 1 gives us a free bit of precision. But it also means every normal number starts with 1.something. How do you encode zero? Zero has no leading 1. There is no exponent that makes equal zero.
Every program needs 0. We need a special case.
This is what stored exponent 0 is for. When the stored exponent is 0 and the mantissa is all zeros, the value is zero. The sign bit gives us +0 and -0. (The other reserved exponent, 255, gets its meaning later.)
With zero defined, 6.75 encodes cleanly. Normalized: . Biased exponent: . Stored mantissa: 10110000000000000000000. Sign bit: 0.
The map gains an entry:
One gap remains: between 0 and the smallest normal number.
Subnormal Numbers
How small can a normal float get? The smallest positive normal number uses the smallest normal exponent and a zero mantissa. Let's compute it.
Sign = 0, stored exponent = 1 (the lowest allowed for normals; 0 is reserved for zero and subnormals), mantissa = all zeros:
Decode the exponent: . The implicit leading 1 gives a significand of exactly 1.0:
How small is that? , so roughly . More precisely, .
Tiny. But if float32 stopped here, there would be a gap between this value and zero. Every real number in that gap rounds to 0. And that gap breaks things.
Take two distinct tiny numbers, and . Both are representable normals. Their difference is , which falls into the gap and rounds to zero. So a - b == 0 even though a != b. Two numbers that are not equal, and the hardware can't tell.
We need values in that gap. And we have unused bit patterns sitting right there: stored exponent 0 with a non-zero mantissa. Exponent 0 with all-zero mantissa already means zero. The remaining patterns have no assignment yet.
For these patterns, IEEE drops the implicit leading 1. Instead of , interpret them as . Drop the leading 1, and the gap fills in.
These are subnormal numbers (sometimes denormals). Below normal, because they lack that leading 1. But they exist, and that is what matters.
How small does this go? The smallest positive subnormal has a single 1 in the last mantissa bit: 00000000000000000000001. That's of the mantissa range. No implicit leading 1, so the significand is just . The exponent is fixed at :
In decimal: , so roughly . More precisely, .
All subnormals are evenly spaced apart. Unlike normal floats, this spacing does not change. It is a uniform grid from zero up to the smallest normal number. The cost is that numbers lose precision as they shrink (fewer leading 1-bits in the mantissa), but at least they exist. That is the tradeoff: precision degrades, but values don't vanish.
Without subnormals, everything below drops off a cliff to zero. That is abrupt underflow. Subnormals replace the cliff with a slope: numbers lose precision as they approach zero, but they don't disappear all at once. This is gradual underflow.
The subnormal range is small, but it closes the gap. The guarantee holds: if a != b, then a - b != 0.
Exponent 0 is fully assigned now: mantissa 0 means zero; mantissa non-zero means subnormal.
We have positive and negative numbers, zero, and a smooth path down to zero via subnormals. Before moving on to overflow and the remaining special cases: a different way to think about what we've built.
Floats as Buckets
Each float is not a point. It is a bucket. Every real number closer to this float than to its neighbors rounds here.
Take 6.75. The next representable float is one grid step away, about . Every real number within half that step rounds to the same bit pattern. A neighborhood of real values all become "6.75."
About 4 billion bit patterns standing in for infinitely many real numbers. The design question was always the same: where do the buckets go, and what happens at the edges?
Interactive placeholder: zoomable number line around a selected value showing adjacent floats, half-ULP bucket boundaries, and which reals round to the same stored value.
Once you think in buckets, the unusual parts of the format stop being unusual.
Negative zero. Zero is the bucket for values too small to distinguish from nothing. Small positive and small negative values both land there. The sign bit remembers which side you came from. That's all it is.
Infinity is the bucket for everything beyond finite range. Zero absorbs the smallest values; infinity absorbs the largest. Symmetric.
(for finite nonzero ) follows directly. The denominator sits in the "too small to distinguish from zero" bucket. The numerator is finite. The quotient lands in the "too large to express" bucket. Not a special rule. Just buckets.
Infinity
We know the smallest float. What about the largest?
The largest finite float uses the largest normal exponent and the largest mantissa. Sign = 0, stored exponent = 254 (the highest for normals; 255 is reserved), mantissa = all ones:
Decode the exponent: . Reconstruct the significand: with all 23 fractional bits on. Each additional bit gets you halfway closer to 2. With all 23 set, you're short:
Equivalently: the mantissa encodes , so the significand is . Multiply by the window:
Since is negligible compared to : , so roughly . More precisely, .
A 39-digit number. For scale, the number of atoms in the observable universe is estimated around , so float32 covers up to roughly the square root of that count. Not bad for 32 bits.
What happens when a result exceeds this? Rather than crashing or wrapping around, the format gives overflow a name. Stored exponent 255 with mantissa all zeros means infinity. The sign bit gives and :
Arithmetic with infinity follows clear rules: , for any finite . The result is too large to represent, so it stays in the overflow bucket.
But stored exponent 255 has possible mantissa values, and we have only used one of them (all zeros = infinity). What about the rest?
NaN
What should return? Not zero, not infinity. There is no real number that works. What about ? Or ? Same problem.
Historically, many systems would raise an exception or trap. IEEE 754 takes a different approach: give these results a special bit pattern called NaN (Not a Number). Any bit pattern with stored exponent 255 and a non-zero mantissa is NaN.
Count the possibilities: non-zero mantissa values, each with two sign bits. That is distinct NaN bit patterns. Over 16 million ways to say "not a number." Why so many? The 23 mantissa bits carry a payload: a tag that can identify which operation failed, so debugging tools can trace the origin of the error. IEEE recommends this, but exact encoding is vendor-defined.
NaN propagates. Feed NaN into any arithmetic and NaN comes out. Errors travel as data instead of crashing immediately. That is a design choice worth pausing on.
| Operation | Result |
|---|---|
| NaN | |
| NaN | |
| NaN | |
| NaN | |
| NaN |
That last one () is worth thinking about. Zero was rounded into the "too small" bucket. Infinity was rounded into the "too large" bucket. We no longer know the exact magnitudes that produced them. Zero pushes toward zero, infinity pushes toward infinity, and the result is genuinely indeterminate.
But wait: is not NaN. The denominator sits in the "too small to tell apart from zero" bucket. The numerator is finite. A finite number divided by a vanishingly small number is an astronomically large number. That is the infinity bucket, not the "undefined" bucket.
Now for the strangest property in the entire format. NaN == NaN is false. No other value does this. The classic NaN test is x != x: if that returns true, x is NaN.
What "not equal to anything" means in practice:
- If
xis NaN,x == yis false for everyy(including NaN). - If
xis NaN,x != yis true for everyy. - Every ordered comparison (
<,<=,>,>=) with NaN returns false.
Practical consequence: x < y is not the same as !(x >= y) when NaN is involved. When writing min and max, choose explicitly whether NaNs should propagate or be ignored.
The error-as-data idea is more useful than it sounds. Imagine a zero-finding algorithm probing a function at different points. One probe lands outside the domain: . Instead of crashing, the function returns NaN. The algorithm sees "invalid probe," skips it, continues searching. No special error-handling code. The NaN flows through arithmetic until something checks for it.
The last blank is filled. Every bit pattern has a meaning now:
| Category | Exponent (stored) | Mantissa | Count |
|---|---|---|---|
| Zero | 0 | 0 | 2 (+0 and -0) |
| Subnormal | 0 | non-zero | |
| Normal | 1 to 254 | any | |
| Infinity | 255 | 0 | 2 (+∞ and -∞) |
| NaN | 255 | non-zero |
Together, these definitions account for all 4,294,967,296 bit patterns. Nothing is left undefined. Nothing is wasted.
Putting It All Together
Time to put it all together. Here is 6.75, encoded end-to-end, using every piece we have built.
Step 1: Convert to binary.
and (), so:
Step 2: Normalize.
Move the binary point to after the first 1:
Mantissa is 1.1011, exponent is 2. One form, one number.
Step 3: Sign bit.
6.75 is positive. Sign bit = 0.
Step 4: Bias the exponent.
Real exponent is 2. Add the bias (127):
Exponent field: 10000001. This puts 6.75 in the window. Correct.
Step 5: Store the mantissa.
Drop the implicit leading 1. The stored mantissa is the 23 bits after the binary point:
The hardware puts the 1 back when it reads this.
Step 6: Assemble.
Full 32-bit string: 0 10000001 10110000000000000000000.
Hexadecimal: 0x40D80000.
Decode it back: sign = 0 (positive), exponent = , significand = . Value = . Round trip checks out. Every piece was derived from a specific problem: the sign convention, the bias, normalization, the implicit leading 1, the mantissa encoding. Nothing is arbitrary.
That is the representation half. Next is what happens when you compute with these numbers.
Part 2: Arithmetic
Compute, Then Round
We built the format. Every bit pattern has a meaning. But nobody stores numbers just to look at them. You store them to compute with them. And this is where everything from Part 1 starts to bite.
One rule governs every float operation:
Compute the exact result, then round to the nearest representable float.
Hardware does this perfectly for each individual operation. It works out the true real-number answer internally (with enough extra precision to decide the rounding correctly), then snaps to the nearest representable value. One operation, closest possible answer.
The trouble is chaining. After that first add, you have a rounded result. The second add takes it as input. It does not know, or care, that the value is already slightly off. It computes the exact answer from the slightly-wrong number, then rounds again. And the third operation starts from the second's rounding. Every operation builds on the last one's error.
Change the grouping and you change where the rounding happens. does not always equal . Same numbers. Same addition. Different parentheses, different answer. Addition is not associative in floating point. A lot of code assumes it is.
Think of representable floats as marks on a ruler. You add two numbers and the exact result lands between two marks. You snap to the nearest one. The next add starts from that snapped position, not from where the true answer was. Different snapping order, different final mark.
Same idea as significant figures from science class. Fixed precision budget. Each operation rounds back into that budget. Floats do the same thing, in base 2, with 24 significant bits.
But float addition has a step that integer addition does not: the binary points have to line up first. When you add 42 and 0.001, the hardware shifts the smaller number's mantissa until both exponents match, then adds, then renormalizes and rounds.
That alignment step is where small terms vanish. If the exponent gap is large enough, shifting the smaller operand right pushes all of its significant bits clean off the end of the mantissa. It becomes zero at working precision. The add returns the larger operand unchanged, as if the small term was never there.
1e20f + 1.0f == 1e20f in float32. The 1.0 is not wrong. It is too small to register at that scale. The grid spacing near is on the order of . Adding 1 does not reach the next tick mark.
Addition is the most complex of the four basic operations because of the alignment step. Multiplication and division skip it entirely.
One subtlety worth pulling out. What happens when the exact result lands exactly halfway between two representable floats? If ties always break upward, every tie pushes in the same direction. Over thousands of operations, that bias accumulates. IEEE 754 avoids this with round-to-nearest, ties-to-even: at the exact midpoint, pick the neighbor whose last mantissa bit is 0. Over many operations, ties break up roughly as often as down, and the bias cancels. We will come back to this in the Rounding Modes section.
For the five basic operations (+, -, *, /, √), IEEE 754 guarantees something strong: correct rounding. The result is the representable float closest to the exact real answer. Not "pretty close." The closest one. That puts a hard bound on error: at most 0.5 ULP per operation. How large is 0.5 ULP? That depends on where you are on the number line. We will quantify it next.
IEEE also specifies correct rounding for remainder and integer-float conversions. Decimal conversion (printing floats as decimal strings and parsing them back) is harder; efficient exact conversion is a research problem in its own right, and behavior varies across libraries.
For transcendental functions (sin, cos, exp, log, ...), the standard is less strict. Math libraries usually aim for errors within a few ULPs, and results can differ across platforms and library versions.
How Far Apart Are Adjacent Floats? (ULP)
We said each correctly rounded operation has error at most 0.5 ULP. But how big is a ULP? I kept using this term without quantifying it. Time to fix that.
Each exponent band has evenly spaced slots, as we saw with the window model. The slot size sets the scale for rounding error. Recall the float32 formula:
where is the 23-bit mantissa interpreted as an integer.
Increment by 1 (keeping fixed) and you step to the next representable float. The size of that step:
This step size is 1 ULP (unit in the last place) at that exponent. The ruler's tick spacing at this point on the number line.
The spacing grows with exponent. Same idea as the windows: each window has slots, so doubling window width doubles slot size. Some concrete values:
- Around 1.0 (exponent 0): spacing is .
- Around 6.75 (exponent 2): spacing is . We have used 6.75 for the entire post. Its nearest neighbor is roughly 6.7500005.
- Around 1024 (exponent 10): spacing is .
Subnormals are the exception: constant spacing of , because there is no implicit leading 1 and the exponent is fixed.
Here is something that caught me off guard when I first worked through it. At every power-of-two boundary, grid spacing doubles abruptly. Just below 4.0 (exponent 1), ULP is . Just above 4.0 (exponent 2), ULP is . Twice as large. Cross a power-of-two boundary and the meaning of "4 ULPs of error" changes by a factor of two. Same number of ULPs, different absolute error.
Most languages let you query this directly. In C/C++, nextafterf(x, +∞) returns the next representable float above x. The difference nextafterf(x, +∞) - x is exactly 1 ULP at x.
The spacing at 1.0 comes up so often in error analysis that it has its own name: machine epsilon. The distance from 1.0 to the next representable float. For float32: . In C: FLT_EPSILON. In C++: std::numeric_limits<float>::epsilon().
But epsilon does not scale. The spacing near 1.0 is about . Near 1024 it is about . Near it is 8. Eight. The gap between adjacent floats near a hundred million is larger than most people's intuition for "rounding error." Using bare FLT_EPSILON as a tolerance only works near 1.0. Elsewhere it is either too tight (near large numbers, where grid steps are huge) or too loose (near small numbers, where the grid is fine).
If you can normalize a problem so values stay near 1 (or at least in a narrow magnitude range), you get more usable precision from the same bits. When values drift to or , more bits go to scaling and fewer to useful mantissa detail.
In decimal terms, float32 gives about 7 significant digits of precision. Float64 gives about 16.
This is why printing float32 to 20 decimal places is misleading. Only the first ~7 digits carry real information. The rest are the exact decimal expansion of the stored bit pattern. They look precise. They are not.
The practical rule is about round-trips: 9 decimal digits are enough to recover any float32 exactly when parsing back to binary. For float64: 17 digits. These are about unique recovery, not about how many digits of the original quantity are meaningful.
Now that we know what a ULP is and what "within half a ULP" means, the next question is: when the exact result falls exactly between two representable floats, which one do you pick?
Rounding Modes
Most of the time, "round to the nearest representable float" is straightforward. One neighbor is clearly closer. Pick it. Done.
The interesting case is ties: the exact result lands exactly halfway between two representable floats.
This is not rare. It happens every time the true result has a 1 in the bit position just past the mantissa. The standard's answer: round to the neighbor whose last mantissa bit is 0. Ties to even.
Why even and not up? If you always round ties upward, every tie pushes the same direction. Over thousands of operations, that one-directional bias accumulates into drift you can actually measure. Ties-to-even breaks the pattern. Sometimes up, sometimes down, depending on the last stored bit. The errors cancel more often. One bit of rounding logic, and the statistical bias vanishes.
IEEE 754 defines five rounding modes total. Ties-to-even is the default, and it is what you get unless you explicitly change it. The other four: ties-away-from-zero (less commonly exposed in hardware), round toward zero (truncation), round toward (ceiling), and round toward (floor). The directed modes exist mainly for interval arithmetic, where you round one endpoint up and the other down to trap the true value in a provable bracket.
Switching modes changes more than the last bit. It can affect overflow behavior and whether an exact zero comes out as +0 or -0.
One thing I did not expect: rounding modes affect more than arithmetic. When converting float to int via rint/lrint in C, round-toward- acts like floor and round-toward- acts like ceil. (Plain casts in most languages truncate toward zero, which is neither.) On overflow, the default mode produces , but directed modes can produce the largest finite value in the appropriate direction instead.
Measuring Error
Every time I said "rounding error" up to this point, I waved my hands about how big it actually is. Two ways to make it precise.
ULP error counts grid steps: how many adjacent-float hops separate your result from the true value. "Off by ULPs" means about trailing bits are unreliable. It is intuitive because it measures distance in the float's own units.
Relative error normalizes by magnitude:
For a concrete example: the nearest float32 to is 3.1415927410125732. The gap from real is about , around 0.37 ULP. Relative error is about . Not bad for 23 bits of mantissa. But watch what happens when you pass that slightly-off to a function that cares about exactness at .
Interactive placeholder: interactive ULP explorer around a selected float32 value (especially near powers of two) showing nextafter, spacing jumps, and relative vs absolute error.
Cancellation
Subtract two numbers that are almost equal. The leading digits match, cancel, and disappear. Only the trailing digits survive.
Were those trailing digits any good?
If both operands were exact (small integers, powers of two, anything that fits the mantissa without rounding), yes. The leading digits were accurate, so canceling them just uncovers accurate lower digits. Nothing was destroyed. This is benign cancellation.
But if both operands carry rounding error from earlier computation, the noise lives in those trailing bits. Cancel the leading digits and what remains is rounding error, promoted from the back of the number to the front. That is catastrophic cancellation. Not because many digits vanished, but because the surviving ones are noise.
The question is never "how many digits did I lose?" It is "were the digits I lost any good?" Subtracting where both are exact machine numbers is fine, no matter how close they are. Computing where both terms carry rounding error from the multiplications can be catastrophic. Same shape of subtraction, same number of bits lost, completely different quality of result.
(I once stared at a quadratic discriminant that came back negative for what was obviously a real root. and were both huge, nearly equal, and each slightly rounded from the multiplications. The subtraction wiped out every meaningful bit. The leftover noise happened to be negative. The formula told me the root didn't exist.)
Diagram placeholder: side-by-side comparison of benign vs. catastrophic cancellation. Two float mantissa bars being subtracted in each case. Left (benign): all bits in both operands are accurate; after subtraction, surviving trailing bits are clean. Right (catastrophic): leading bits are accurate but trailing bits carry rounding noise; after subtraction, only the noisy bits survive, now occupying the most significant positions. Same number of canceled bits, opposite outcome.
Two things follow from this. Adding a small term to a large one can erase the small term entirely, because its bits fall below the grid spacing at the larger number's magnitude. And reordering additions can change the answer, because different orderings round at different points.
Why 0.1 + 0.2 ≠ 0.3
We started this post with 0.1 + 0.2 returning 0.30000000000000004. We now have enough machinery to trace what actually happens. "Binary can't represent 0.1 exactly" is true (we covered why in the fraction-termination section). But which bits round in which direction, and why does the error land above 0.3 instead of below?
We will use float64 (double precision), since it is the default in many languages. Float32 behaves the same way.
repeats forever in base 2, just like repeats in decimal:
A float64 mantissa stores 52 explicit bits (53 total with the hidden bit). The rest are rounded away. So the float64 labeled "0.1" is actually:
Already slightly above 0.1. Not by much, but above. Similarly:
Also slightly above. Both inputs rounded up. Add these two stored values and you get an exact sum of:
Now look at the two float64 values that bracket the real number 0.3:
- Lower:
- Upper:
The exact sum lands exactly halfway between them. Ties-to-even picks the upper value. So 0.1 + 0.2 rounds to , while the literal 0.3 rounds to . One ULP apart. That is the whole story.
Three small pushes, all in the same direction: 0.1 rounds up, 0.2 rounds up, and the final tie breaks upward. If any one of those had gone the other way, the result might have landed on 0.3 exactly. It didn't.
More mantissa bits push the discrepancy farther to the right, but still repeats forever in binary. Float128 would have the same problem. No finite float format will represent 0.1 exactly.
The density of representable floats varies across the number line. Near zero, consecutive floats are packed extremely close together (thanks to subnormals). Near large numbers, they can be far apart. Floating point gives relative precision (roughly the same number of significant digits everywhere), not absolute precision (a fixed gap everywhere).
So how do you compare floats safely?
Comparing Floats
The first thing you try is ==. You already know that doesn't work. The second thing you try is an epsilon threshold. Also wrong, just less obviously.
Attempt 1: absolute epsilon. .
This works near 1.0. But float spacing grows with magnitude. At magnitude 2.0, the gap between adjacent floats already exceeds FLT_EPSILON. At magnitude 1000, consecutive floats are roughly FLT_EPSILON apart. Using bare FLT_EPSILON as tolerance effectively marks almost every distinct pair above ~2.0 as "not close." The tolerance does not scale.
Attempt 2: relative epsilon. .
This fixes scaling: "close" means "within a fraction of magnitude." But it fails near zero. Compare and . Relative difference is 100%, yet the absolute gap is tiny. Near zero, any nonzero relative tolerance is either too loose or too tight.
Attempt 3: hybrid. Combine both, letting each cover the other's weakness:
The absolute term handles the near-zero region. The relative term scales with magnitude everywhere else. Many testing frameworks use this shape (NumPy's allclose, PyTest's approx).
Choosing tolerances is domain-specific. Set absolute tolerance from the smallest meaningful value in your problem (for physics, maybe ; for pixel coordinates, maybe ). Set relative tolerance from operation depth: each float operation introduces up to 0.5 ULP of error, growth is often around when errors are roughly unbiased, and can grow linearly in adversarial cases. For 100 operations, a relative tolerance from a few to about FLT_EPSILON is often reasonable.
You can also measure distance in the native units of floats: ULPs. Two floats are ULPs apart if you can walk from one to the other in steps of nextafter. For positive floats, ULP distance is just the integer difference between the two bit patterns reinterpreted as unsigned integers. Many test frameworks offer helpers for both "relative error" and "ULP distance."
If a test only passes after you greatly increase epsilon, that is a diagnostic, not a fix. Cancellation or accumulation is dominating the result. Well-conditioned computations do not need large tolerances.
One last trap: NaN is not equal to itself. NaN == NaN is false. By design. Infinities compare as expected, and under == despite different bit patterns.
Accumulation Error: Time
A 100 Hz simulation running for 10 minutes. That is 60,000 frames of t += 0.01. After all 60,000 frames in float32, the clock reads 600.27. A quarter-second of drift. From rounding.
Here is why. Float32 stores 0.01 as 0.009999999776482582, already slightly off. Each frame adds this slightly-wrong value and rounds the result. Two ways to compute the current time:
- Accumulate: start at
t = 0, dot += Δteach frame. Rounds 60,000 times. - Compute from a counter: keep an integer frame index
i, computet = i * Δt. Rounds once.
After 60,000 frames, accumulation gives t = 600.2744140625. Computing t = i * Δt gives 600.0. Same physics, one multiplication instead of 60,000 additions. The fix is almost embarrassingly simple: keep an integer counter, multiply once.
If your timestep is constant, compute from an integer counter. If it is variable, accumulate in float64, or store time as integer ticks and convert at the edges.
Here is a subtler version of the same trap. Someone tells you their system has "sub-microsecond precision" because the timestep is about 1.2 microseconds (~810 kHz). Sounds impressive. But the timestep is not the problem. Accumulating t += Δt in float32 at 810,000 updates per second compounds error quickly. After a few seconds, accumulated clock time drifts from elapsed time. Drift direction can even change at exponent boundaries.
The distinction matters: "sub-microsecond precision" describes the timestep, not the accumulated result. A small accumulated carelessly can produce worse timing than a larger computed from an integer counter. The precision of your increment says nothing about the precision of your running total.
Stable Algorithms: Triangle Area
So far, errors came from repetition: thousands of operations, each adding a small rounding error that compounds over time. Here the failure is different. One formula. One evaluation. And the answer is completely wrong.
The classic example is Heron's formula for the area of a triangle:
Mathematically correct. Numerically fragile when the triangle is thin: two sides nearly equal, the third much smaller.
Take a triangle with sides , , . The true area is about 50,000,000.
In float32, naive Heron's formula returns zero.
Not "close to zero." Exactly zero. The semi-perimeter s and the side a round to the same float, so s - a becomes 0 and the entire product collapses. A triangle with 50 million square units of area, and the formula says it is flat.
Kahan gave a stable rearrangement. Sort the sides so , then compute:
Same bits. Same hardware. The only difference is where the parentheses go. For the same triangle, this produces 50,000,000 in float32.
| Method | Float32 | Float64 |
|---|---|---|
| Naive Heron's | 0 | 50,000,000 (for this case) |
| Kahan's rearrangement | 50,000,000 | 50,000,000 |
For this specific triangle, float64 rounds both formulas to 50,000,000. That does not make naive Heron stable; it just means this case still sits inside double's margin. With extreme enough inputs, the naive formula fails in float64 too. More bits delay breakdown, not prevent it.
This generalizes beyond triangles: keep big terms with big terms and small terms with small terms as long as possible. Parentheses are not cosmetic. They are part of the algorithm.
When Algebra Breaks
Suppose you refactor an expression. Same inputs, same algebra, just rearranged. In real arithmetic, nothing changes. In floating point, you can get a different answer.
We already saw this with addition: a + (b + c) can differ from (a + b) + c because rearranging changes where rounding happens. Distributivity breaks too. (a + b) * c and a*c + b*c round at different steps, and each intermediate is snapped to the grid independently.
But the damage can be worse than a few bits of disagreement. Take and in float32. These two expressions are algebraically identical:
The first simplifies to . The second computes as an intermediate, which overflows to infinity. Same algebra. Infinity on one side, the correct answer on the other.
Even without overflow, the differences are real. I would have assumed "divide by 7" and "multiply by 1/7" are the same operation. They're not. In float32 with , gives 1763.668212890625 while gives 1763.6683349609375. Two different floats from what looks like the same math.
Your compiler might be making this exact substitution right now. Compilers often prefer the faster reciprocal multiply. Whether that matters depends on what you care about: speed, precision, or reproducibility. Which raises a question: what exactly is your compiler allowed to do with your float code?
Compilers and Hardware Knobs
A common scenario: you write careful float code, tests pass, then someone builds with different optimization flags and gets slightly different results. Nobody made a mistake.
Between source code and final answer sits a stack: language spec, compiler, hardware, and math library. Each layer can change the numerical result.
Even the language standard might not promise IEEE 754 semantics. C++ (as of C++23) does not require IEEE 754 conformance. That surprised me when I first learned it. Most implementations follow IEEE, but the standard doesn't force it. Other languages vary too. If you need bit-for-bit reproducibility across machines, treat it as a configuration problem, not something you get for free.
Compiler flags are the biggest lever. In strict modes, compilers avoid rearranging float expressions unless results are provably identical. In fast modes like -ffast-math, you permit reordering, fused operations, and sometimes subnormal flushing. I've seen -ffast-math passed around Makefiles like a free speedup. It's not free. That flag changes which tests pass.
Four things that commonly trip people up:
Fused multiply-add (FMA). Tests pass on one machine, fail on another, differing only in the last bit. FMA computes a*b + c with a single rounding instead of two. It's usually more accurate. But "more accurate" and "same answer" are different things, and not all hardware supports it. Cross-machine reproducibility gets harder.
Flush-to-zero. A physics simulation runs 2x faster on a GPU but produces different near-zero results. Some hardware handles subnormals on a slow path, so environments snap them to zero for speed. You gain throughput and lose near-zero fidelity. Fine if you know about it. Dangerous if you don't.
Extended precision intermediates. The sneakiest one. Some toolchains evaluate in wider precision than the declared type and round only at store-to-memory. "Obviously equal" expressions can compare unequal. (The 3/7 example below shows exactly how this plays out.)
Algebraic rewrites. Every one of the rewrites from the previous section is algebraically valid and floating-point invalid. Compilers may replace division by constants with reciprocal multiply, reassociate sums, or rewrite x*y - x*z as x*(y - z), which can be catastrophic when y ≈ z.
So the stack between your source code and the final answer is deep. The precision-vs-speed tradeoff extends to the numeric type itself. GPUs often make float32 fast and float64 much slower. Vectorized math (SSE, AVX, NEON) adds another layer: large speedups with their own rules and corner cases.
Choosing Your Number Type
After all the ways floats can surprise you, the natural question: what type should I actually use? Match the representation to the precision you need.
- Integers. If the value is exact and discrete, use integers. "Approximately 7 dollars" is not acceptable. Convert to/from float at system boundaries (I/O, display), not inside the computation.
- Float64. The workhorse. ~16 significant digits handles the vast majority of scientific and engineering problems. When in doubt, this is your default.
- Float32. When memory or throughput matters more than precision: GPU shaders, ML inference, bulk sensor data. ~7 significant digits instead of ~16. You feel the grid sooner.
- Software multiprecision (Boost, mpfr). Reach for this reluctantly. When float64 isn't enough and you can afford the speed cost.
- Fixed-point. Exact uniform spacing across a known range. Formats like 60.40 (60 integer bits, 40 fractional bits) give sub-nanosecond resolution over long durations without drift.
Wider types aren't free. They eat memory, reduce cache efficiency, and on GPUs the speed penalty can be severe.
Testing and Debugging
Floating-point bugs are quiet. They don't crash. They give you a plausible-looking wrong answer, and you might not notice until much later. The first time I chased a NaN backward through a computation, I wished I had set up my debugging differently.
Testing strategy. Use relative-error or ULP-based checks. Avoid == except for values you know are exactly representable (small integers, powers of two). Work through at least one case by hand and keep it as a test. Wrong results look plausible when eyeballed.
Implicit promotion. This one is sneaky. In C/C++, mixing double and float silently computes the result in double precision. You may not notice because the intermediate is more accurate. The narrowing back to float at assignment is where bits get lost. Be explicit about your types.
Floating-point exceptions. Enable them in debug builds. Seriously. Overflow, underflow, divide-by-zero, invalid. Catch the first bad operation instead of chasing NaN backward through a long computation. In C/C++: fenv.h plus platform-specific controls.
Traps vs. NaN propagation. Opposite workflows. Traps stop at the first bad operation. NaN propagation lets the computation finish and you check at the end. I use traps for debugging and NaN propagation for production.
When a result looks wrong, the useful question is not "did we get the exact answer?" but "is this the exact answer to a slightly different input?" If yes, the algorithm is stable: it solved a nearby problem exactly, and the error is bounded by how far that nearby problem is from yours. If no, something in the computation is amplifying error, and you need to find where. This is backward error analysis, and it is how numerical analysts think about algorithms.
Document your assumptions: tolerances, expected value ranges, compiler flags. When a failure surfaces six months later, the investigation should start with data, not guesswork.
Summary
All the pieces we built, at a glance:
| Field | Bits | Position | Purpose |
|---|---|---|---|
| Sign | 1 | Bit 31 | 0 = positive, 1 = negative |
| Exponent | 8 | Bits 30-23 | Biased by 127. Stored range 0-255. |
| Mantissa | 23 | Bits 22-0 | Fractional part after implicit leading 1 |
And the special values that account for every remaining bit pattern:
| Value | Sign | Exponent (stored) | Mantissa | Formula |
|---|---|---|---|---|
| +0 | 0 | 0 | 0 | zero |
| -0 | 1 | 0 | 0 | zero |
| Subnormal | 0/1 | 0 | non-zero | |
| Normal | 0/1 | 1-254 | any | |
| +∞ | 0 | 255 | 0 | positive infinity |
| -∞ | 1 | 255 | 0 | negative infinity |
| NaN | 0/1 | 255 | non-zero | Not a Number |
We arrived here by hitting walls. No fractions, so we added a binary point. Wasted bit patterns, so we normalized. A predictable leading 1, so we stopped storing it. No negative exponents, so we added a bias. No way to represent zero, so we reserved a pattern. A gap near zero that breaks a != b guarantees, so we filled it with subnormals. Overflow, so we defined infinity. Undefined results, so we defined NaN. Each wall had one reasonable fix. The fixes, taken together, are IEEE 754.
Then we computed with these numbers, and the grid made arithmetic messy. Rounding on every operation. Tiny terms vanishing into large ones. Subtraction of nearby values promoting noise into the leading digits. . None of that is a bug. It is what happens when you snap to a finite set of marks after every step.
Precision Cheat Sheet
Float32 is one point in a family of formats. They all use the same structure we derived (sign bit, biased exponent, mantissa with an implicit leading 1) but allocate bits differently. Five formats cover most practical work.
Interactive placeholder: side-by-side format explorer for FP16/BF16/TF32/FP32/FP64 showing field splits, epsilon, min normal/subnormal, and representable range.
Float64 does not simply double each field from float32. Exponent grows from 8 to 11 bits, while mantissa grows from 23 to 52 bits. Most extra bits go to precision because many applications need more significant digits, not more range.
bfloat16 is a clever hack: take float32's exponent range (8 bits, bias 127) but keep only 7 mantissa bits. Same range, much lower precision. Designed for deep learning workloads where dynamic range matters more than per-value accuracy.
float16 takes the opposite approach: 5 exponent bits give smaller range ( max), but 10 mantissa bits give better precision than bfloat16. It works well when values stay in a narrower range.
TF32 is a hardware trick, not a storage format. Data stays in memory as float32. NVIDIA Tensor Cores on Ampere and later GPUs read float32 inputs, round mantissa from 23 bits to 10, multiply rounded operands, and accumulate in float32. You get float32 range with roughly float16 mantissa precision at tensor-core speed. In PyTorch: torch.backends.cuda.matmul.allow_tf32.
A few things the numbers above make concrete:
- Three formats share the same exponent field. bfloat16, TF32, and float32 all use 8 exponent bits with bias 127, so their normal-range magnitudes are similar (roughly to ). They differ in mantissa width (7, 10, 23 bits), which changes epsilon and the subnormal floor.
- bfloat16 is “float32 with fewer mantissa bits.” It keeps the sign + exponent plus the top 7 mantissa bits of float32. Converting float32 → bfloat16 is rounding off the low 16 mantissa bits; converting back is zero-padding.
- Float16's ceiling is low. 65,504. A batch of 256 images with pixel values in [0, 255] already gets close (256 × 255 = 65,280); 257 such values would overflow a naive sum in float16. This is why mixed-precision training typically stores weights in float16 but accumulates gradients in float32.
- Not all hardware implements subnormals for narrow formats. Many GPUs flush bfloat16 subnormals to zero. The subnormal values in the table are theoretical minimums.
The FP16 vs BF16 choice comes down to which limit you hit first. FP16 has more mantissa bits (10 vs 7), so per-value rounding is smaller. BF16 has more exponent bits (8 vs 5), so it handles larger and smaller magnitudes before overflow or underflow. Same 16-bit budget, opposite tradeoff. This was the question I couldn't answer when I first looked at AMP. Now it is just bit allocation.
IEEE also defines extended formats. Some FPUs use 80-bit extended internally, not as a user type, but to give implementations extra headroom for functions like exp, log, and decimal conversion before rounding back to float32/float64.
Try it yourself: enter a decimal number and watch it snap to the nearest representable float. Switch formats to see how the same value is stored with different precision:
Interactive placeholder: decimal-to-float snap visualizer with live bit fields, neighboring representable values, and ULP distance to the typed decimal.
This post started with two questions. Why does 0.1 + 0.2 return 0.30000000000000004? And what exactly changes when you cut a number's bit budget in half for a training run?
They have the same answer: the grid.
Binary can only terminate fractions whose denominators are powers of two. 0.1 and 0.2 are not among them, so both snap to nearby representable values before any arithmetic happens. The addition is exact. The inputs were not. That is the whole 0.1 + 0.2 story.
Cut the mantissa from 23 bits to 10 (float16), or 7 (bfloat16), or 3 (FP8 E4M3), and each exponent band gets fewer slots. The windows stay the same width, but the tick marks are farther apart. You lose precision per value, not necessarily range. Whether that matters depends on whether your computation is sensitive to those low-order bits. Gradient descent, it turns out, usually is not.
If you remember one thing from this post: floats are a finite grid, not a continuous line. Every operation snaps to the nearest mark. The format decides where those marks go. Everything else follows from that.
References and Further Reading
If you read one thing beyond this post, make it Goldberg's survey. It's dense but foundational. The error model and guard digit examples we used come from there.
Primary sources and further reading, roughly ordered from most accessible to most technical.
Visual guides and explainers
-
Fabien Sanglard, Floating Point Visually Explained. Uses number-line diagrams to show how float spacing changes with magnitude and where precision gets lost. This is the one that first made float spacing click for me.
-
The Floating-Point Guide: What Every Programmer Should Know. Concise answers to the "why does 0.1 + 0.2 != 0.3?" question, with language-specific guidance.
-
Svar.dev, Demystifying Floating Point. Walkthrough from binary fractions to subnormals and special values.
-
John D. Cook, Anatomy of a Floating Point Number.
-
Bruce Dawson, Comparing Floating Point Numbers, 2012 Edition. Thorough treatment of absolute epsilon, relative epsilon, and ULP-based comparisons. Covers exactly the tradeoffs discussed in our "Comparing Floats" section.
-
Bruce Dawson, That's Not Normal: the Performance of Odd Floats. Measured performance costs of subnormals on x86 hardware.
Videos and talks
-
Computerphile, Floating Point Numbers.
-
John Farrier, Demystifying Floating Point (CppCon 2015).
Interactive tools
- Float Exposed. Type a float, see the exact bits, sign/exponent/mantissa breakdown, and the stored value. I used this constantly while writing this post.
Technical references and papers
-
David Goldberg (1991). What Every Computer Scientist Should Know About Floating-Point Arithmetic. ACM Computing Surveys, 23(1), 5-48. Long survey that introduces the standard error model (), guard digits, and classic examples (including the cancellation example we used earlier).
-
Hans-J. Boehm (2017). Small-data computing: correct calculator arithmetic. Communications of the ACM, 60(8), 44-49. Case study of Android’s “Exact Calculator” and demand-driven (“constructive real”) arithmetic for fully accurate displayed results.
-
T.J. Dekker (1971). A Floating-Point Technique for Extending the Available Precision. Numerische Mathematik, 18(3), 224-242. The original double-double splitting technique referenced in our "Double-double arithmetic" section.
-
IEEE Computer Society (2019). IEEE Standard for Floating-Point Arithmetic (IEEE 754-2019). The standard itself. Defines all the formats, rounding modes, and special values we derived. Surprisingly readable for a standards document.