You might have heard that the division operation on a modern FPU is slower than a multiplication operation. If you have not heard this before; it’s true – divisions are indeed slower. But why?
Without going into the technical details, the divide operation is an iterative algorithm and the multiplication operation is computed directly. One of the more common ways of doing divide operations on a FPU is to use a divide-free Newton-Raphson approximation to get the reciprocal of the denominator and then multiply with the numerator. More on this will come in a later post.
Let’s put it to the test.
Here we run 10.000.000 iterations where we divide 500 by 11.5
Here we multiply 500 by the multiplicative inverse (reciprocal) of 11.5
From what I described earlier our multiplication example (code example 2) should perform better than the division example. Here are the results:
Something is clearly wrong. First off all, they have virtually no difference in the time it took to compute the results. Second of all, the time it took to do the operations is really small. Don’t worry, there is a perfectly good reason for this:
We used constants to represent the numbers we needed to divide and multiply. The compiler knows this and use the approximation instructions on the FPU – we also did the same operation 10.000.000 times. This leads us to the next section of this post:
Micro-benchmarking is where you test a single or multiple operations to see what takes the most time. In this case, all we tested was that the division and multiplication examples took the same amount of time and we simply timed the cache and not the operations themselves.
There are a bunch of examples on the internet that shows us exactly how micro-benchmarking can deceive us. In this case, the compiler simply optimizes the code and CPU cache helped us to get the results fast. To truly test the performance of the divide and multiply operators, we need to make the compiler second guess; simply remove the const keyword and rerun the code. You will see that the multiply operator is up to 10 times faster (depending on platform and compiler mode).
Divisions are really slower than multiplication, but the CPU uses precise approximations by either using lookup tables or iterative algorithms. It is a clever piece of hardware that always tries to speed up your the execution of your code where ever possible. When you do micro-benchmarking, make sure the compiler generates the code you actually want to test. Often the compiler even strips away your for-loop and simply run the code once. If there is no way to get around this, simply run the code once yourself and use StopWatch.ElapsedTicks as measuring tool and it will tell you of any speed differences.