It's been a bit quiet, but I've been working on things. Nothing changed though. This version should be identical to the released version 7.40.
What I did: I simplified the code, removed a lot of the 'unnecessary steps' that were mentioned earlier and that were still being performed a lot of times, and reduced the amount of memory used for some lookup tables. Also placed some values in a new lookup table - actually that's the only thing that seems to have any effect on the performance (on most presets the difference is about 1% on my pc). The fact that the reduction of the memory used by lookup tables has no effect on my i7 laptop seems to indicate that at least on that hardware memory speed (caching) is not an issue. It might be on other hardware though.
Why am I posting this version? To get feedback on any new bugs that I may have introduced, and because I'm curious what happens to the performance on other hardware - especially AMD's and older CPU's with less cache.
Windows stand alone:
http://www.stereotool.com/download/ster ... 41-009.exe
Winamp DSP:
http://www.stereotool.com/download/dsp_ ... 41-009.exe
Version number is 009 because I had a lot of in-between build to keep track of stability and performance issues.
Full change log:
52. Check CPU load. Start with checking if there's anything left that uses the 'unnecessary steps'. Sevdah Web preset: Data still gets converted 58 times... I think I need to do this one first, it should have some effect on the CPU load.
28 removed - next convert the 2 IIR filters so they can be optimized and the merge/split around it can be removed. I'm not measuring any effect from this though (but it makes the code simpler which is also good)
53. Noise Gate/Stereo Boost: Pre-calculate 1-cos() and sqrt() values.
55. Check MemoryPool behavior for cache improvements -> No effect measured, and might make behavior less constant.
56. Check if we can go in opposite direction for each next step to improve cache.
57. Check if lazy reverse FFT is an option. -> No, difficult and gain does not even seem to be measurable.
58. Created a separate class that performs the processing chain. Currently the same code is repeated twice (once for normal processing, once for low latency processing) - which means that a lot of code is duplicated and it's difficult to add extra chains. Most, not all, of that code is now moved elsewhere. (although the low latency chain is not yet modified, but that should now be relatively easy)