What is Floating point Arithmetic?

Floating-point arithmetic is considered an esoteric subject by many people. Floating point arithmetic derives its name from something that happens when you use exponential notation. It's also surprising that floating-point is ubiquitous in computer systems.

Almost every computers language has a floating-point datatype - from PCs to supercomputers have floating-point accelerators, most compilers will be called upon to compile floating-point algorithms from time to time, even virtually every operating system must respond to floating-point exceptions such as overflow.

Consider the number 158 - it can be written using exponential notation as:
1.58 * 102
15.8 * 101
158 * 100
.158 * 103
1580 * 10-1 etc.

All of these representations of the number 158 are numerically equivalent. They differ only in their normalization - where the decimal point appears in the first number. In each case, the number before the multiplication operator "*" represents the significant figures in the number, which call this number the significand.

Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits.

In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation.

Comments