Floating point quirks

Table Of Content

1. A closed form equation for

function S = S(n)
    S = 1 - 1/(1+n);
end

It’s a really simple problem, and if you do a little bit of algebra, you can make the following simplification

$S_n = \sum_k^n \frac{1}{k(k+1)} = \sum_k^n \frac{k+1 - k}{k(k+1)} = \sum_k^n \frac{1}{k} - \frac{1}{k+1}$

From this, it follows immediately that $S_n = 1 - \frac{1}{2} + \frac{1}{2} - \frac{1}{3} + \frac{1}{3} - \dots - \frac{1}{n} + \frac{1}{n} - \frac{1}{n+1}$ , or $S_n = 1 - \frac{1}{n+1}$ . Great, now that we have a closed form expression for S_n , what else do we need to do? Of course, just to make sure that my derivation is sound, I wrote out the full sum program anyways. (Note: this behavior will only happen when you use single precision floating point storage)

n = 1e6;
S = 0;
for k=1:n
    S = single(double(S) + 1/(k*(k+1)));
end
(1 - 1/(n+1)) - S

After computing S_n for a few small values and comparing them to the value of $1 - \frac{1}{n+1}$ , I found that seems to compute the same value as my expression; but for large values of , S'_n rapidly gets farther and farther away from the value computed by S_n . I just attributed this to roundoff error, afterall, this is roundoff. I concluded that the more expensive summation isn’t a good solution at all and thought that was the end of it.

2. Uh oh

Later that day, I showed my friend this question and he told me that while he agrees that the closed form formula is indeed better, he sees no significant loss of accuracy in his implementation. For equals to a million, his solution only has $10^{-8}$ amount of error (on par with $\epsilon_{mach}$ for single precision) using only single precision storage types.

“What? No, that’s not possible, I already tried it on mine and I got an error of around $10^{-4}$ ”
“Well, try it yourself”

So I ran his code in $\textsc{Matlab}$ . At first I saw no difference in our two implementations, but then it hit me; “he’s computing the series backwards”

n = 1e6;
S = 0;
for k=n:-1:1
    S = single(double(S) + 1/(k*(k+1)));
end
(1 - 1/(n+1)) - S

and sure enough, I get the same error he does. What the hell is going on?

3. An explanation

It turns out that the problem stems from how roundoff works under certain rounding mode for addition. First, the IEEE guarantees that addition for floating point numbers that are within $\frac{1}{2}$ of each other will be perfectly computed.

Now, a single precision floating point number can store about 8 digits accurately, therefore, adding 1 to $10^{-10}$ will only return 1 since the number $1+10^{-10}$ will take more than 8 digits to store and hence will truncate the lower few digits. In general, you can assume that if two numbers differ by 8 orders of magnitude, then adding them will have no effect.

Now, when we compute the series in the forward direction, what we’re really doing is computing an approximation $\hat S_n \approx S_n$ such that.

$\begin{align*} \hat S_0 &= 0 \\ \hat S_1 &\approx \hat S_0 + \frac{1}{2} \\ \hat S_2 &\approx \hat S_1 + \frac{1}{6} \\ \hat S_3 &\approx \hat S_2 + \frac{1}{12} \\ \hat S_4 &\approx \hat S_3 + \frac{1}{20} \\ &\vdots \\ \hat S_n &\approx \hat S_{n-1} + \frac{1}{n(n+1)} \end{align*}$

As we’ve already shown, $\hat S_k \approx S_k = 1 - \frac{1}{k+1} = \frac{(k-1)(k+1)}{k(k+1)}$ , so at each iteration, we’re making the floating point computation $\textsc{Flt}\left(\frac{k^2 - 1}{k(k+1)} + \frac{1}{k(k+1)}\right)$ . Relatively speaking, for large values of the expression k^2 - 1 tends towards just k^2 ; futhermore, k^2 will become more than 8 orders of magnitude away from 1 when is merely 10-thousand (however round off will be noticeable much earlier) therefore, for the last 999000 operations, our method is effectively doing nothing. You can actually verify this in $\textsc{Matlab}$ by seeing that both S(1e4) and S(1e6) output the same long-formatted float when the true answer is about $10^{-4}$ away.

Okay, so how does doing this summation in reverse help in any way or form? Well, suppose that each iteration of the reverse method produces a number R_k at step , then it’s obvious that $R_k = S_n - S_{n-k-1}$ ; furthermore, for large enough value of , $S_n \approx 1$ , so we can just have $R_k = 1 - S_{n-k-1} \approx 1 - \left(1 - \frac{1}{n-k-1+1}\right) = \frac{1}{n-k}$ . The inductive definition also states that $R_k = R_{k-1} + a_{n-k+1}$ $= R_{k-1} + \frac{1}{(n-k+1)^2 + (n-k+1)}$ . Now, there’s something intriguing about this expression: $(n-k+1)^2 + (n-k+1) \approx (n-k+1)^2$ so that the relative difference between $R_{k-1}$ and $a_{n-k+1}$ is around (n-k+1) which is O(n-k) and can never exceed $10^{8}$ for $k < n < 10^{6}$ (so the roundoff is at the worst in the beginning, but that’s also where this difference formula make assumptions that no longer hold, namely that the linear factor dominates the quadratic factor in (n-k+1)^2 + (n-k+1) , so the relative magnitude between the two terms isn’t really that big of a deal. Anyways, the roundoff will progressively become better and better as grows). Therefore, rather than having a significant vast majority of the computations do nothing, every step now counts. This is largely the reason why summing the series backwards gives less roundoff overall than doing the “same” set of operations in the other direction. This is one case where the “non-associativity” of arithmetic in floating point will give you trouble.