Quote:
That really must have some effect. In fact, this effect is probably *LESS* than to the gain from removing ALL those 'unnecessary' actions - which would probably take me weeks.
Hans, I just don't think you understand that I believe there's a cache capacity limit that my K8 slams into which your Core i-whatever doesn't. Please read all of this, even though I know it's lengthy and you may think I'm just going "blah, blah, blah, blah, blah..."
The more computations that have to be done, the more it taxes the cache subsystem. My speculation is that there are non-trivial cache misses happening for me, and that means that main memory has to be accessed, and possibly disk, where a core i-whatever doesn't have that happen.
Further, once the "saturation point" / "tipping point" / "Jenga point" has been reached, improvements beyond that point have less relative improvement due to the cache still having a capacity limit which creates cache misses. Note: I don't know the official term, but when the capacity is hit, things go south...and Jenga is a reference to the game where the wood piece that makes the thing collapse has been pulled...
What the above means is that I only see a fraction of the processing savings made once my processor has hit the cache wall. Do you not remember that when you talk about single-digit load differences that I
always see double-digit differences, usually on the order of 3-4X?
I also don't think you understand that your core-based system has what's called a Loop Stream Detector. What the Loop Stream Detector does is detect when the system has entered a loop and then feeds data out of the stream cache without using the branch predictor algorithms. Google it (Loop Stream Detector) and read about how Core handles branches and loops.
What that means is that performance on core-based systems is faster in loops vs. say, a K8 that mispredicts a branch, incurring both a branch misprediction penalty along with a cache miss penalty.
In other words, your system's microarchitecture compensates for what you're doing with all the multiple loops, especially with the Loop Stream Detector.
My system does not have a Loop Stream Detector.
Your system handles your extra loopy code better by default, thus you're not going to notice all that much of a difference by removing the extra loops.
For my system, I need all the extra loopy-loop gone so that the branch predictor has less and less of a chance to mispredict and cause cache misses and execution unit stalls.
This is all so much like when I tried, and tried, and tried, and tried to get my direct management to take a stronger stand for getting us a support team inside the department so that developers weren't having to do as much support, so we could make code changes for both capital projects and bug fixes to reduce support load. They pissed around with it for 2 years before finally making the request to those above them.
The result for the department? My understanding is that instead of the 80% support / 20% development that used to happen for developers, it is now 90% development / 10% support. Things that had needed attention for years finally have started getting attention.
The result for me? I was laid off long before the effects of having that new support group were noticable because they got sick of me pushing for the change, so they said that I wasn't getting enough done. Go figure. They had tried that same thing about 2 years prior in regards to support, but I got around it by demonstrating that my computer was a full minute slower on both the opening and closing of support tickets than other people's computers, so for every ticket I worked on, I had a built-in 2 minute penalty. I was handling about 150 tickets a day, so that translated to 300 additional minutes (5 hours) that I spent working every single day I was on support that others didn't have to spend.
Now, just like then, I kept/keep mentioning it because I believed/believe in what I was/am requesting. I am very tired of having to fight so hard to defend my position / myself though...