Stereo Tool https://forums.stereotool.com/ |
|
Stereo Tool 7.03 BETA https://forums.stereotool.com/viewtopic.php?t=4448 |
Page 21 of 102 |
Author: | Brian [ Wed Jan 30, 2013 7:26 am ] |
Post subject: | Re: Stereo Tool 7.03 BETA |
More fun with CodeAnalyst ![]() I've started getting a better understanding of what the tool is telling me. I ran a profiling session with the latest iteration of Cobalt on beta 23. That didn't go so well because I was trying to examine too many things given the already high CPU load, so the audio started stuttering. So, what I did is I backed down to the "generic" version and ran the same profiling. The Branch Misprediction stuff and the L2 Cache stuff is still interesting, but I found something I think is more interesting - Dispatch Stalls. A Dispatch Stall is when the decoder has an instruction ready to send down the pipe, but it can't do so due to resource limitations in the instruction pathways. I would guess that includes L1 Instruction and Data, L1 Transaction Lookaside Buffers, L2 (and TLBs), Instruction Fetchers, and etc... For the sampling session, dsp_stereo_tool.dll had 362100 cycles where the dispatcher was stalled. Can you drill down on those? Why yes, you can... ![]() First category is "Stall due to Branch Abort". This happens when a branch mispredict happened and the pipeline is being flushed. That wasn't so bad, at only 3371, or around 0.9% of the previously listed cycles. This low figure probably would've surprised me early this morning, but as I was reading more about the predictors for K8 vs. Core, I ran across multiple performance analysis reviews that, even though they said the Core's Branch Predictor was much improved, they also stated that the K8's predictor was also pretty good, with accuracy rates in the low-90s percentage-wise. Given the high number of cycles / ops, even single-digit accuracy rate improvements are significant. There are several other categories, but there must be some not listed because the totals don't match. In fact, they're not even close. Out of the listed 362100, the listed counters only total 93823. Where the rest are, I have no idea. Will need to look into that. At any rate, the biggest listed stall was for "FPU Full". Per the documentation, this happens from a lack of parallelism in FP-intensive code, or by cache misses on FP operand loads. The cycles stalled for this were 58283. After that was "Stall reorder full". This is the reorder buffer. This ties back to the out-of-order execution and Integer loading stuff I mentioned earlier. After that was "Stall LS full". These are the Load/Store units. Per the documentation, it says that stalls for this reason are generally because of heavy cache miss activity. This makes the second reference to cache misses. The other was with FPU Full, which includes misses on FP operand loads. My speculation here - respinning the loops additional times. A related metric that isn't in the dispatch section is "LS2 Buffer Full". What this means is that the secondary Load/Store Buffer is full. LS2 holds stores waiting to be retired, as well as requests that missed the data cache (L1) and are awaiting a refill. The documentation says that a backup here will stall further data cache accesses, although there may be overlapping executions still happening. So, I've still got more things to look into, but as I mentioned earlier, my feeling was that there was a cache issue, and the preliminary results from profiling are supporting my theory, as there are (at least) three performance metrics that state that their counters accumulate when there are cache issues / misses. Also, the reason why testing filters in isolation like what has been done doesn't really give all that much insight is because there is so little going on that most things or everything end up handled by the L1 cache with only possibly minor use of the L2 cache. This means there is little to no pressure on the cache, and since there's little to no pressure on the cache, there's little to no pressure on main memory, which is where the CPU has to cycle waiting for the significantly slower memory. Edit: The only insight it can give is determining if an individual filter is behaving extraordinarily badly way down at the L1 caches, either data or instruction. The problem is that unless you've coded something extremely sloppy, and I really mean EXTREMELY, because there are no other processes in use, at best you are only making sure the compiler hasn't gone wonky or your CPU has been overclocked too far or is otherwise having over/under-voltage issues to where it is spitting out bad computational results. There has to be enough pressure to have meaningful tests. So, yeah, I can see "a point" for doing this type of testing, but I'm not sure if the time spent (cost) gives enough benefit. Even if you can effectively argue that it is time well-spent, to do effective testing of this nature you'd need to have a test platform of your own to do best/worst case secenarios. This is the reason behind the suggestion for the CPU data collection so that you can make a more informed decision about what is a reasonable architectural level to performance target. |
Author: | hvz [ Wed Jan 30, 2013 4:35 pm ] |
Post subject: | Re: Stereo Tool 7.03 BETA |
Did some profiling here as well. For the compressor, if anyone knows how to calculate 10 ^ ((10log x) * y) without using pow() I would be really grateful ![]() Current (very expensive) code is: pow(10.0f, log(x) * one_DIV_log10_MUL_y) Since 10 ^ (10 log x) = x, I'm pretty sure that there must be more efficient ways of doing this but I cannot figure out how. This single calculation seems to take about 1/4th of the total CPU load when only the compessor is enabled. |
Author: | Brian [ Wed Jan 30, 2013 4:50 pm ] |
Post subject: | Re: Stereo Tool 7.03 BETA |
I have to go out (I put off going out yesterday to save trips), but when you venture into math territory, I'm out of my element, since Calculus was also over 20 years ago for me... ![]() Might try Google. Google can be your friend... ![]() |
Author: | gpagliaroli [ Wed Jan 30, 2013 5:18 pm ] |
Post subject: | Re: Stereo Tool 7.03 BETA |
Quote: Did some profiling here as well. For the compressor, if anyone knows how to calculate
Try x ^ (10 * y) 10 ^ ((10log x) * y) without using pow() I would be really grateful ![]() Current (very expensive) code is: pow(10.0f, log(x) * one_DIV_log10_MUL_y) Since 10 ^ (10 log x) = x, I'm pretty sure that there must be more efficient ways of doing this but I cannot figure out how. This single calculation seems to take about 1/4th of the total CPU load when only the compessor is enabled. ![]() |
Author: | Brian [ Wed Jan 30, 2013 5:24 pm ] |
Post subject: | Re: Stereo Tool 7.03 BETA |
Before heading out the door: Google-fu action http://martin.ankerl.com/2012/01/25/opt ... c-and-cpp/ |
Author: | hvz [ Wed Jan 30, 2013 5:29 pm ] |
Post subject: | Re: Stereo Tool 7.03 BETA |
Quote: Try x ^ (10 * y)
10 ^ ((10log x) * y)![]() Say, x = 100, y = 3 10log x = 2, so 10^(2*3) = 10^6 100 ^ (10 * 3) = 100 ^ 30... Not good ![]() Hm... But. If I remove the 10* from 10*y it seems fine. 100 ^ 3 = 10^6 Ow. You thought that 10log = 10*log, I meant (small 10 at top) log. Ok, this helps! Unfortunately there's still a ^ in it but the log is gone. Thanks! |
Author: | Brian [ Wed Jan 30, 2013 5:50 pm ] |
Post subject: | Re: Stereo Tool 7.03 BETA |
Quote: but the log is gone.
![]() |
Author: | gpagliaroli [ Wed Jan 30, 2013 6:39 pm ] |
Post subject: | Re: Stereo Tool 7.03 BETA |
Quote: Quote: Try x ^ (10 * y)
Hm... But.![]() If I remove the 10* from 10*y it seems fine. 100 ^ 3 = 10^6 Ow. You thought that 10log = 10*log, I meant (small 10 at top) log. Ok, this helps! Unfortunately there's still a ^ in it but the log is gone. Thanks! ![]() |
Author: | hvz [ Wed Jan 30, 2013 9:05 pm ] |
Post subject: | Re: Stereo Tool 7.03 BETA |
True - but the pow is more expensive than the log. @Brian: That method is far too unprecise, and worse, not continuous (there are large jumps in it). So I'll keep it at Gap's solution - thanks! |
Author: | Brian [ Wed Jan 30, 2013 9:29 pm ] |
Post subject: | Re: Stereo Tool 7.03 BETA |
OK. That was a rushed look before having to leave, so I'll look a little further since I'm back. Edit: Fast pow() With Adjustable Accuracy Also, a question, although one I'm pretty sure I already know the answer... I've switched to time-based profiling, because event-based isn't going to narrow it down. I can likely figure out where my system is bogging down, if I had debug files. I have no interest in making a competing product, or even trying to spend the time exporting the code to build a version myself (I'm far too lazy to do that!). Your position is that you think it would take you too long to do the branch and would distract you from what you're currently doing. Well, I'm offering you an alternative - let me profile it, with debug info, on a system that has the performance problem, and if it's just something that boils down to the code being good and the K8 processor not being beefy enough to handle it, the upside for you is that it'll shut me up... ![]() |
Page 21 of 102 | All times are UTC+02:00 |
Powered by phpBB® Forum Software © phpBB Limited https://www.phpbb.com/ |