@Brian: I have checked all the vectorization reports of performance sensitive functions, and for nearly all of them I either made them vectorize or I understand why they aren't.
I'm trying to squeeze the last bit of optimization out of a program using Intel C++ 10.1 (because with later versions I'm getting slower code - I'll look into that later).
When looking at the vectorization reports, I noticed 2 things I hadn't expected, and I wonder if they can be solved (without rewriting lots of code - total code base is over 2 MB and I'm working on it alone). I've tried to google them but didn't find any useful answers.
This one seems to be the most important:
fft_abs_sse2[2*cc] = max(fft_abs_sse2[2*cc], strength * m);
.\Clip1Ch.cpp(1999): (col. 13) remark: vector dependence: proven ANTI dependence between fft_abs_sse2 line 1999, and fft_abs_sse2 line 1999.
.\Clip1Ch.cpp(1999): (col. 13) remark: vector dependence: proven ANTI dependence between fft_abs_sse2 line 1999, and fft_abs_sse2 line 1999.
.\Clip1Ch.cpp(1999): (col. 13) remark: vector dependence: proven FLOW dependence between fft_abs_sse2 line 1999, and fft_abs_sse2 line 1999.
.\Clip1Ch.cpp(1999): (col. 13) remark: vector dependence: proven FLOW dependence between fft_abs_sse2 line 1999, and fft_abs_sse2 line 1999.
.\Clip1Ch.cpp(1999): (col. 13) remark: vector dependence: proven ANTI dependence between fft_abs_sse2 line 1999, and fft_abs_sse2 line 1999.
...
While I know that there's an _mm_max_ SIMD instruction. Problem might be the definition of max, I'm using:
#define max(a,b) (((a)>(b)) ? (a) : (b))
The compiler might see this as an if instruction if it's unable to optimize everything out. Is there a better definition for max that doesn't cause the compiler to see dependencies where there are none?
Another situation that occurs very frequently in my code is this:
for (int c=0; c<f1; c++)
{
temp[2*c] *= one_DIV_bass_static_clip_level_dynamic;
temp[2*c+1] *= one_DIV_bass_static_clip_level_dynamic;
}
Clearly, there are no dependencies between temp[2*c] and temp[2*c+1], but the compiler thinks otherwise:
.\Clip1Ch.cpp(797): (col. 9) remark: loop was not vectorized: existence of vector dependence.
.\Clip1Ch.cpp(800): (col. 13) remark: vector dependence: proven FLOW dependence between temp line 800, and temp line 799.
.\Clip1Ch.cpp(800): (col. 13) remark: vector dependence: proven ANTI dependence between temp line 800, and temp line 799.
.\Clip1Ch.cpp(800): (col. 13) remark: vector dependence: proven OUTPUT dependence between temp line 800, and temp line 799.
I think if these two situations are solved at least 50% of the loops that currently don't get vectorized will be. Your help is greatly appreciated

The changes I made today should give a reduction of about 4% in the total CPU load (on the most active CPU core; reduction should be bigger on a single core system).
BETA052 resulted in 3-5% decrease. So, no, not any bigger than what you said for that one. You should also note that you said 4% via your writing to a file method of testing, but my 3-5% (average of 4) was from simply looking at Task Manager. IOW, it's not always inaccurate... Looking at things through ProcExp, I'm not seeing any DPC activity either. Never have. There's some interrupt stuff every now and then, but it's minimal.
I still believe that your code on K8 is cache (aka core) clock dependent for most things, with benefit to the Opteron and X2 line with whatever supports multicore processing.
The reason why I am wanting to profile on my system is because my speculation is that there are alternative SIMD / assembly instructions that would help K8, but be performance-neutral to newer processors.
That's why I asked about Scalar vs. Packed. It's also why I've mentioned the MOVNTPS instruction. MOVNTPS helps minimize cache pollution, which would be beneficial to all systems, if you can use it, which you may or may not be able to.