★★ Numerical Computing with IEEE Floating Point Arithmetic — Michael L. Overton

2025/02/12

I started reading this book to fulfil a curiosity in how arithmetic operations in floating point (FP) is done. Though it didn’t exactly cover this, I read it all anyway. It’s one of those rare books where it’s the sort you wish existed, but just haven’t found yet.

It should be noted this book was published in 2001, since then there were two new versions, 2008 and 2019, of the standard. I’m not sure to what extent these new standards are implemented in hardware. In any case, I doubt it makes what this book covers irrelevant.

The book covers the IEEE 754-1985 standard in good detail, and touches a little bit of numerical analysis as it relates to FP. It’s perhaps more useful for a programmer than if they reead the standard itself. Peppered throughout are bits of FP history before the standard was introduced in 1985, prior to this it was just a free-for-all. How awful it must have been! This history serves to make us understand the purpose of some feature or decision in the standard, and the problems you would run into otherwise. It is something to behold how nonsensical these problems would be today, mostly of not handling exceptions well and bizarre results in edge cases.

I’ve wondered for some time why in the age of cheap memory, fixed point numbers are not preferred, since they might be faster to compute. But this book really built my understanding of floating point numbers, such that it seems to me that fixed point arithmetic mightn’t be more useful, save for some hypothetical, extremely narrowly defined embedded application.

Chapters 1-7 should be recommended reading for every experienced programmer, as it covers the essentials of floating point arithmetic. It is altogether 47 pages, which isn’t much but I’m sure everyone would gain significantly from these pages. Chapters 8-10 could be skipped, they mostly discuss implementation but this book is fairly aged, so it’s not relevant. Chapters 11-13 are about problems that come up when performing FP arithmetic, touching on numerical analysis a little. This part was good, but it was more difficult and more mathematical.

I believe it is the business of programmers on every level to be familiar with the concepts of how a computer works ’low-level’ as opposed to an abstract machine their language and it’s environment provide. You can get away with not knowing a lot of it, sure, but you can’t escape the issues of FP arithmetic, no matter how high level your language and environment are.