Floating point numbers

This is an introduction to some of the basic concepts of floating point numbers in the IEEE 754 floating point standard. We will build up to prove some interesting facts about their representation.

Let's start simple. How about a basic refresher on binary numbers (the subscript at the end of a number indicates its base):

Binary review

Converting a binary number to decimal:

With a binary fraction, we simply continue the pattern:

The floating point format

Single precision floating point numbers are laid out as follows:

The 31st bit is the sign bit. A value of 0 indicates a positive number. A value of 1 indicates a negative number.
The next 8 bits (30 through 23) give us the exponent exp
The last 23 bits (22 through 0) give us the fraction

There are 32 bits in all.

More succinctly, here is the formula:

The first term in the formula gives the sign of the float via the sign bit.

The second term in the formula is a binary fraction.

The exponent exp is biased by subtracting 127. This allows our biased exponent to take both positive and negative values.

Here are some examples of converting from the single precision floating point format to a decimal number:

Ex1

Ex2

Ex3 (a different way of looking at Ex2)

In the last step, we simply move the decimal over four places. This is binary scientific notation!

Recall the single precision floating point number formula:

We can now see that floating point numbers are essentially represented in binary scientific notation. In particular, floating point numbers are expressed in normalized scientific notation.

In binary scientific notation, all numbers are written in the form . In normalized binary scientific notation, the exponent n is chosen so that the absolute value of m remains at least one but less than two (1 ≤ |m| < 2).

With this knowledge, we can prove some interesting facts.

Floating point representation uniqueness

Claim: Every number has a unique representation in the floating point binary format

Proof: This follows from the normalized scientific notation representation of floating point numbers. There is only one possible value for m, which determines the value of n.

Lexicographical ordering of the floating point format

From Bruce Dawson:

The IEEE float and double formats were designed so that the numbers are “lexicographically ordered”, which – in the words of IEEE architect William Kahan means “if two floating-point numbers in the same format are ordered ( say x < y ), then they are ordered the same way when their bits are reinterpreted as Sign-Magnitude integers.”

Let's try to prove this fact!

Claim: Let f₁ < f₂ where f₁ and f₂ are both positive floats. When their bits are interpreted as integers i₁ and i₂ respectively, i₁ < i₂.

Proof: Recall the floating point representation:

Since f₁ and f₂ are positive, we know the sign bit will be zero for both.

We have two cases:

The exponent of f₁ is less than the exponent of f₂
The exponent of f₁ is equal to the exponent of f₂

Note that we cannot have the exponent of f₁ be greater than the exponent of f₂ -- that would violate our assumption that f₁ < f₂.

Let and represent the eight exponent bits of f₁ and f₂ respectively. In case (1), we know that there exists some i between 23 and 30 (inclusive) such that b_i < c_i, and for all j > i, b_j = c_j. In other words, if the exponent of f₂ is larger, there is a first digit where the exponent bits differ and f₂'s is larger.

It follows that i₁ < i₂, because the eight exponent bits precede the 23 fraction bits.

In case (2), since the exponents are equal, we know that the f₂'s fraction must be larger than f₁'s fraction. It follows that i₁ < i₂.

Addendum

I have not covered special cases in the floating point format, in particular the representation of zero and the representation of subnormal numbers (that link does not give a great explanation of the concept, but I am unaware of a better source). Observant readers may have noticed there is no way to represent zero in the floating point format given the formulas I have provided!

Extending these proofs to cover the zero and subnormal cases is left as an excercise for the reader.

Another interesting fact that I will not go into extreme detail here is that for floats of the same sign:

Adjacent floats have adjacent integer representations
Incrementing the integer representation of a float moves to the next representable float, moving away from zero

(credit: Bruce Dawson for the wording of these facts)

Hopefully this introduction to floating point numbers is enough to give you an intuition on why these facts are true.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
img		img
readme.md		readme.md
tex.tex		tex.tex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

readme.md

readme.md

tex.tex

tex.tex

Repository files navigation

Floating point numbers

Binary review

The floating point format

Ex1

Ex2

Ex3 (a different way of looking at Ex2)

Floating point representation uniqueness

Lexicographical ordering of the floating point format

Addendum

About

Releases

Packages

Languages

dasl-/floating-point-numbers

Folders and files

Latest commit

History

Repository files navigation

Floating point numbers

Binary review

The floating point format

Ex1

Ex2

Ex3 (a different way of looking at Ex2)

Floating point representation uniqueness

Lexicographical ordering of the floating point format

Addendum

About

Resources

Stars

Watchers

Forks

Languages