All times are UTC+02:00




Post new topic  Reply to topic  [ 1012 posts ]  Go to page Previous 118 19 20 21 22102 Next
Author Message
PostPosted: Tue Jan 29, 2013 1:17 am 

Joined: Sun Dec 12, 2010 2:26 pm
Posts: 885
Quote:
Beta 022:
Stand alone: http://www.stereotool.com/download/ster ... 04-022.exe
Winamp DSP: http://www.stereotool.com/download/dsp_ ... 04-022.exe
VST: http://www.stereotool.com/download/vst_ ... 04-022.dll

- Added Smart Peak mode to compressor, use this instead of Peak for good release behavior!
- Optimized compressor code (not finished)
Smart Peak causes audio stuttering, just like all other modes except plain Peak.

Absolutely no decrease in CPU load. No increase either, but absolutely no decrease. This is the reason why I keep asking for a code branch and overhead removal - so I can at least have a shot of getting a load reduction, as I just don't have faith that you're going to accomplish that with this new stuff. :cry:

TBH, I fully anticipate soon not being able to participate in any aspect that involves using loudness, even after a multiband redesign. So, yes, I'm fully aware and fully willing to accept being on a non-supported one-off branch.

Yes, this also means that I'm 90%+ sure of stopping working on a preset.

So, yes, I'm aware that a branch will get no future updates, and I'm ok with that.


Top
   
PostPosted: Tue Jan 29, 2013 1:37 am 
Site Admin
User avatar

Joined: Mon Mar 17, 2008 1:40 am
Posts: 11425
Quote:
Absolutely no decrease in CPU load. No increase either, but absolutely no decrease.
There was a loop which runs through all samples (88200 per second assuming 44.1 input, and each samples is processed twice), and this whole loop added 45 samples around the current sample. I replaced it by an algorithm that adds a new one and drops an old one - meaning I reduced the number of calculations in each loop iteration from 45 to 2... That really must have some effect. In fact, this effect is probably *LESS* than to the gain from removing ALL those 'unnecessary' actions - which would probably take me weeks.

In the compressor code there's probably more stuff like this that can be improved, will continue tomorrow.


Top
   
PostPosted: Tue Jan 29, 2013 3:27 am 

Joined: Sun Dec 12, 2010 2:26 pm
Posts: 885
Quote:
That really must have some effect. In fact, this effect is probably *LESS* than to the gain from removing ALL those 'unnecessary' actions - which would probably take me weeks.
Hans, I just don't think you understand that I believe there's a cache capacity limit that my K8 slams into which your Core i-whatever doesn't. Please read all of this, even though I know it's lengthy and you may think I'm just going "blah, blah, blah, blah, blah..."

The more computations that have to be done, the more it taxes the cache subsystem. My speculation is that there are non-trivial cache misses happening for me, and that means that main memory has to be accessed, and possibly disk, where a core i-whatever doesn't have that happen.

Further, once the "saturation point" / "tipping point" / "Jenga point" has been reached, improvements beyond that point have less relative improvement due to the cache still having a capacity limit which creates cache misses. Note: I don't know the official term, but when the capacity is hit, things go south...and Jenga is a reference to the game where the wood piece that makes the thing collapse has been pulled...

What the above means is that I only see a fraction of the processing savings made once my processor has hit the cache wall. Do you not remember that when you talk about single-digit load differences that I always see double-digit differences, usually on the order of 3-4X?

I also don't think you understand that your core-based system has what's called a Loop Stream Detector. What the Loop Stream Detector does is detect when the system has entered a loop and then feeds data out of the stream cache without using the branch predictor algorithms. Google it (Loop Stream Detector) and read about how Core handles branches and loops.

What that means is that performance on core-based systems is faster in loops vs. say, a K8 that mispredicts a branch, incurring both a branch misprediction penalty along with a cache miss penalty.

In other words, your system's microarchitecture compensates for what you're doing with all the multiple loops, especially with the Loop Stream Detector.

My system does not have a Loop Stream Detector.

:arrow: Your system handles your extra loopy code better by default, thus you're not going to notice all that much of a difference by removing the extra loops.

:arrow: :arrow: For my system, I need all the extra loopy-loop gone so that the branch predictor has less and less of a chance to mispredict and cause cache misses and execution unit stalls.

This is all so much like when I tried, and tried, and tried, and tried to get my direct management to take a stronger stand for getting us a support team inside the department so that developers weren't having to do as much support, so we could make code changes for both capital projects and bug fixes to reduce support load. They pissed around with it for 2 years before finally making the request to those above them.

The result for the department? My understanding is that instead of the 80% support / 20% development that used to happen for developers, it is now 90% development / 10% support. Things that had needed attention for years finally have started getting attention.

The result for me? I was laid off long before the effects of having that new support group were noticable because they got sick of me pushing for the change, so they said that I wasn't getting enough done. Go figure. They had tried that same thing about 2 years prior in regards to support, but I got around it by demonstrating that my computer was a full minute slower on both the opening and closing of support tickets than other people's computers, so for every ticket I worked on, I had a built-in 2 minute penalty. I was handling about 150 tickets a day, so that translated to 300 additional minutes (5 hours) that I spent working every single day I was on support that others didn't have to spend.

Now, just like then, I kept/keep mentioning it because I believed/believe in what I was/am requesting. I am very tired of having to fight so hard to defend my position / myself though...


Top
   
PostPosted: Tue Jan 29, 2013 10:21 am 

Joined: Sun Dec 12, 2010 2:26 pm
Posts: 885
If by chance you've read my post already, as it is already noon over there, make sure to read the edits I made.


Top
   
PostPosted: Tue Jan 29, 2013 12:42 pm 

Joined: Sun Dec 12, 2010 2:26 pm
Posts: 885
Super-tired. Haven't slept yet.

Downloaded an older version of AMD Code Analyst, where "older" means it works on XP. The latest version requires newer operating systems.

The software does indeed indicate L2 cache misses and retired mispredicted branches for dsp_stereo_tool.dll.

I am not sure I've set up the profiling session properly, so I'm not going to give details right now. I'll try again when I've had some sleep.

Edit: Decided to try something before taking a nap.

Latest iteration of Cobalt that I'm working on. With everything that I normally have enabled, a 45 second profiling session showed about 6100 branch mispredicts. Simply disabling multiband reduced that to 2200 mispredicts.

CPU load was still up, but I might be able to figure out how to profile that. Really wish I had BMC AppSight and a PDB of the application from Hans.

In case Hans wants to check into AppSight:

http://www.bmc.com/products/product-lis ... ution.html


Top
   
PostPosted: Tue Jan 29, 2013 4:22 pm 
Site Admin
User avatar

Joined: Mon Mar 17, 2008 1:40 am
Posts: 11425
BETA023: ONLY performance optimization, NO change in audio!
http://www.stereotool.com/download/ster ... 04-023.exe
http://www.stereotool.com/download/dsp_ ... 04-023.exe

I've made some more optimizations in the compressor code, and now there should really be a clear difference.

On my system, I'm measuring the CPU load by checking the throughput (decoding FLAC, processing, writing to disk). So there are a few things that also take some processing power and I'm also including the disk in the measurement. Measuring it this way, the "system load" (I cannot call it CPU load) is:
All processing disabled: 5.1% ("offset" for the other measurements)
BETA020 with compressor on, RMS mode: 14.5% - I get exactly the same number if I use the multiband compressor instead!
New version: 8.3%

Subtracting the "offset" from both values gives 3.2% for the new version (beta023) vs 9.4% for the beta020 version - almost a factor 3.

I would like to get it down further (the effect of the old compressor is so small that I cannot even measure it, it's lost in the measurement noise) but this should already help a whole lot.


Top
   
PostPosted: Tue Jan 29, 2013 7:56 pm 

Joined: Sun Dec 12, 2010 2:26 pm
Posts: 885
Quote:
I've made some more optimizations in the compressor code, and now there should really be a clear difference.

I would like to get it down further (the effect of the old compressor is so small that I cannot even measure it, it's lost in the measurement noise) but this should already help a whole lot.
All modes are now "usable", meaning they no longer cause audio stuttering.

However, CPU load is pegged between 87 and 98 percent, and doing anything else, including modifying settings in the GUI, is not feasible.

CPU load with "old" compressor: between 75 and 80 percent.

Profiling session with CodeAnalyst again showed approx 6200 branch mispredictions, which is essentially the same as with the old compressor. The variance of 100 mispredictions is likely explained by slight sampling runtime differences (how fast I am at hitting the play button).


Top
   
PostPosted: Tue Jan 29, 2013 8:58 pm 

Joined: Sun Dec 12, 2010 2:26 pm
Posts: 885
Additional information about K8 vs. "P8" (Core):

http://www.anandtech.com/show/1998

Takeaway from that for me is Core can reorder integer loads much more efficiently than K8, as K8 didn't improve on this flaw that was present in K7. K8 benches slower than Pentium-M / Core in interger workloads as a result of this deficiency.

Given that it has already been stated that there are 32-bit integers in ST's code.........

Response to my question about 128-bit SSE...
Quote:
Hm interesting. But nearly everything uses 32-bit floats - are they handled normally? For those unnecessary steps, I'm using 32-bit ints (SSE2 optimized). Most interesting would be the behavior of this call: _mm_shuffle_epi32(). Except for that, all I do is reads, writes ands and ors.

Actually, this could mean that a non-SSE version might run faster on AMD's! (Although I doubt it; the difference in performance if I switch SSE off in the compiler is pretty big). But the Intel compiler often does everything with SSE2 registers because they are faster - even if it needs to do only a single calculation. Now if the AMD would still perform multiple calculations in that case it might in fact be slower!
Also given that additional iterations of a loop increase the chances of a branch misprediction and cache misses, which likely translates to 32-bit integers needing to be re-loaded.................

Also important is the discussion of out-of-order execution capabilities, L1 cache associativity, L1 instruction and data Transaction Lookaside Buffer capabilities, and L2 cache pathway bit-width.

Conclusion: K8 has some integer issues compared to Core. That translates to SSE as well. K8 integer performance can be additionally hindered by larger numbers of integer load operations.

Speculation that I am going to try to see if I can confirm: Branch mispredictions indicated in CodeAnalyst at least roughly map to integer operations and possible reloading of integers, which K8 is slower at because of not being able to reorder the loading process as efficiently as Core.


Top
   
PostPosted: Tue Jan 29, 2013 11:19 pm 
Site Admin
User avatar

Joined: Mon Mar 17, 2008 1:40 am
Posts: 11425
@Brian: Can you check, in BETA020, if the CPU load for Multiband is about the same as that for the compressor (default settings in both cases; all other filters disabled).

On my system they are about equal...


Top
   
PostPosted: Tue Jan 29, 2013 11:57 pm 

Joined: Sun Dec 12, 2010 2:26 pm
Posts: 885
I'll check after a while. I have to go out for a couple of hours.

Edit: To get a feel for what to expect, I tried using the latest beta. The CPU "load" as per Task Manager was somewhere greater than 0% up to around 3%. Having multiband on produced the non-zero values, while having new singleband on (Smart RMS) had the value staying at 0.

I'll check with beta 20 after supper.

Checked 20.

Results: The 0-3% I saw before for multiband was some sort of load anomaly, as I've retested several times and I no longer see non-zero values for either 20 or 23.

020 and 023 are actually the same. Load is greater than zero, but less than one, for both multiband and new singleband.

This is why I have such a hard time understanding your affinity for referencing loads in isolation. I mean, I "get it"; what you're doing and all, but I'm just not sure I see the actual relevance to real-world performance.

It's also why I keep attempting to "lead you to the water" of pulling out all inefficiencies, but you're not being willing to "drink".

I also "get" that. It's a combination of you feeling that you'll address the situation eventually with what you're doing *and* the realization that there's probably little or no money to be made by doing the branch as I'm requesting.

I do not believe that there will be a net load reduction in the future, given the additional functionality being added. Even if multiband gets a reduction in the number of bands, my opinion is that the net load will either be a wash or a net increase.

As such, I'm trying to get a load reduction for "me". That's in quotes because it may or may not be only me that benefits.

Proposal: In future installers, add in a CPU detection routine that gathers the CPUID and some feature flags that runs with the user's permission to report that data back to you. This would allow you to gather information about the platforms where your code is being executed. This may end up finding that there are a non-trivial amount of Turion / A64 (includes X2 and Opteron) / A64-based Sempron systems in use. Alternatively it might find that I'm a unique situation, but I think the odds are greater that I'm not as a unique circumstance as one might think.


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 1012 posts ]  Go to page Previous 118 19 20 21 22102 Next

All times are UTC+02:00


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Limited