Quote:
I think the problem is that Hans is using a very powerful Computer to test "loads" and this gives values which are very high range for the "old" CPUs. I mean 1% on his side is equal may be to 10% for some CPUs on other users. So that is why other users reports "Confirmed. Slightly more CPU usage here also".
May be optimizations are for a small range of CPUs around his CPU and for others this optimizations are not that effective.
I know that he operates his computer to the lowest possible frequency - he mentioned this a couple of times, but looks like it`s not enough for the differences .
I think the problem lies elsewhere - it lies in the CPU architecture itself - you can't simply optimize the code for all kinds of CPUs

You just can't do it. And it's not really a problem, but this is what I call research and development

. Core 2 architecture is a totally different beast from Sandy Bridge and its successors. Hence, more instructions per cycle can be executed, and this results from CPU optimizations itself, but also implementations of new instruction set that can be used.
Having tested various versions of ST on multiple Intel and AMD platforms, I'd say Hans is optimizing for Intel. You can see it for yourself when comparing ST performance on Intel/AMD with synthethic benchmark results, such as passmark. The two can do equally well in passmark, while ST will work significantly better on Intel. That's for sure. As an example, Athlon II 250 dual core (3.0 GHz, 2M, 4 Gb DDR3 1333, Win7Prox64SP1, Passmark score 1.7k) performed only marginally better than C2D E6400 (2.13 GHz, 2M, 3Gb DDR2 800, Win7Prox64Sp1, Passmark score 1.3k). The latter on the other hand performed much better than E2220 (2.40 GHz, 1M, Passmark score 1.3k) on the same system config. That's why I asked how much ST depends on CPU cache, it looks like it uses cache a lot.
The same happens when older and newer generations of Intel CPUs are compared.