I had put off posting this, but as I see development has begun again....
This is again from Agner Fog. Agner lives in Denmark, so perhaps he could be contacted (or visited?) for comment and/or assistance with the current design philosophy.
Again, only trying to encourage re-evaluation of design / testing philosophy. I think the product is good, but it could be made better.
From
http://www.agner.org/optimize/optimizing_cpp.pdf
Pages 161 and 162:
16.1 The pitfalls of unit-testing
It is common practice to test each function or class separately in software development. This unit-testing is necessary for verifying the functionality of an optimized function, but unfortunately the unit-test doesn't give the full information about the performance of the function in terms of speed.
Assume that you have two different versions of a critical function and you want to find out which one is fastest. The typical way to test this is to make a small test program that calls the critical function many times with a suitable set of test data and measure how long time it takes. The version that performs best under this unit-test may have a larger memory footprint than the alternative version.
The penalty of cache misses is not seen in the unit-test because the total amount of code and data memory used by the test program is likely to be less than the cache size.
When the critical function is inserted in the final program, it is
very likely that code cache and data cache are critical resources. Modern CPUs are so fast that the clock cycles spent on executing instructions are less likely to be a bottleneck than memory access and cache size.
If this is the case then the optimal version of the critical function may be the one that takes longer time in the unit-test but has a smaller memory footprint.
If, for example, you want to find out whether it is advantageous to roll out a big loop then
you cannot rely on a unit-test without taking cache effects into account.
You can calculate how much memory a function uses by looking at a link map or an assembly listing. Use the "generate map file" option for the linker. Both code cache use and data cache use can be critical. The branch target buffer is also a cache that can be critical. Therefore, the number of jumps, calls and branches in a function should also be considered.
A realistic performance test should include not only a single function or hot spot but also the innermost loop that includes the critical functions and hot spots. The test should be performed with a realistic set of data in order to get reliable results for branch mispredictions. The performance measurement should not include any part of the program
that waits for user input. The time used for file input and output should be measured separately.
The fallacy of measuring performance by unit-testing is unfortunately very common. Even some of the best optimized function libraries available use excessive loop unrolling so that the memory footprint is unreasonably large.