More fun with CodeAnalyst
I've started getting a better understanding of what the tool is telling me.
I ran a profiling session with the latest iteration of Cobalt on beta 23. That didn't go so well because I was trying to examine too many things given the already high CPU load, so the audio started stuttering. So, what I did is I backed down to the "generic" version and ran the same profiling.
The Branch Misprediction stuff and the L2 Cache stuff is still interesting, but I found something I think is more interesting - Dispatch Stalls. A Dispatch Stall is when the decoder has an instruction ready to send down the pipe, but it can't do so due to resource limitations in the instruction pathways. I would guess that includes L1 Instruction and Data, L1 Transaction Lookaside Buffers, L2 (and TLBs), Instruction Fetchers, and etc...
For the sampling session, dsp_stereo_tool.dll had 362100 cycles where the dispatcher was stalled.
Can you drill down on those? Why yes, you can...
First category is "Stall due to Branch Abort". This happens when a branch mispredict happened and the pipeline is being flushed. That wasn't so bad, at only 3371, or around 0.9% of the previously listed cycles. This low figure probably would've surprised me early this morning, but as I was reading more about the predictors for K8 vs. Core, I ran across multiple performance analysis reviews that, even though they said the Core's Branch Predictor was much improved, they also stated that the K8's predictor was also pretty good, with accuracy rates in the low-90s percentage-wise. Given the high number of cycles / ops, even single-digit accuracy rate improvements are significant.
There are several other categories, but there must be some not listed because the totals don't match. In fact, they're not even close. Out of the listed 362100, the listed counters only total 93823. Where the rest are, I have no idea. Will need to look into that.
At any rate, the biggest listed stall was for "FPU Full". Per the documentation, this happens from a lack of parallelism in FP-intensive code, or by cache misses on FP operand loads. The cycles stalled for this were 58283.
After that was "Stall reorder full". This is the reorder buffer. This ties back to the out-of-order execution and Integer loading stuff I mentioned earlier.
After that was "Stall LS full". These are the Load/Store units. Per the documentation, it says that stalls for this reason are generally because of heavy cache miss activity. This makes the second reference to cache misses. The other was with FPU Full, which includes misses on FP operand loads. My speculation here - respinning the loops additional times.
A related metric that isn't in the dispatch section is "LS2 Buffer Full". What this means is that the secondary Load/Store Buffer is full. LS2 holds stores waiting to be retired, as well as requests that missed the data cache (L1) and are awaiting a refill. The documentation says that a backup here will stall further data cache accesses, although there may be overlapping executions still happening.
So, I've still got more things to look into, but as I mentioned earlier, my feeling was that there was a cache issue, and the preliminary results from profiling are supporting my theory, as there are (at least) three performance metrics that state that their counters accumulate when there are cache issues / misses.
Also, the reason why testing filters in isolation like what has been done doesn't really give all that much insight is because there is so little going on that most things or everything end up handled by the L1 cache with only possibly minor use of the L2 cache. This means there is little to no pressure on the cache, and since there's little to no pressure on the cache, there's little to no pressure on main memory, which is where the CPU has to cycle waiting for the significantly slower memory.
Edit: The only insight it can give is determining if an individual filter is behaving extraordinarily badly way down at the L1 caches, either data or instruction. The problem is that unless you've coded something extremely sloppy, and I really mean EXTREMELY, because there are no other processes in use, at best you are only making sure the compiler hasn't gone wonky or your CPU has been overclocked too far or is otherwise having over/under-voltage issues to where it is spitting out bad computational results. There has to be enough pressure to have meaningful tests. So, yeah, I can see "a point" for doing this type of testing, but I'm not sure if the time spent (cost) gives enough benefit.
Even if you can effectively argue that it is time well-spent, to do effective testing of this nature you'd need to have a test platform of your own to do best/worst case secenarios. This is the reason behind the suggestion for the CPU data collection so that you can make a more informed decision about what is a reasonable architectural level to performance target.