All times are UTC+01:00




Post new topic  Reply to topic  [ 162 posts ]  Go to page Previous 111 12 13 14 1517 Next
Author Message
PostPosted: Sat Aug 21, 2021 4:42 am 
Site Admin
User avatar

Joined: Mon Mar 17, 2008 1:40 am
Posts: 11185
Thanks for reporting this one.

If anyone wants to know what's going on:
I can now confirm that 1/(1/sqrt(x)) works fine, while x*(1/sqrt(x)) does not.

1/sqrt(x) can return infinity if x is 0. 0 * infinity is apparently still infinity, but 1 / infinity is 0. I guess that makes some sense... Anyway, it does mean that 1/(1/sqrt(x)) works.


Top
   
PostPosted: Sat Aug 21, 2021 7:13 am 

Joined: Sun Feb 03, 2013 2:39 pm
Posts: 333
If 1/inv_sqrt(x) is faster, then only because the result has lower precision. That's probably no problem for Stereo Tool but wouldn't using low precision sqrt(x) achieve the same without any issues?

I did a quick websearch and yeah, inv_sqrt(x) usually uses half the precision. Note there should be a compiler option for low precision sqrt, too. That would have made all sqrts faster at once without replacing it.

By default sqrt needs to follow some IEEE standard which requires that all computed digits are correct. With the compiler option above this is no longer necessary.


Top
   
PostPosted: Sat Aug 21, 2021 8:13 am 

Joined: Wed Dec 04, 2019 6:12 am
Posts: 28
An other, old minor issue - I forgot to mention, probably nobody using this function: the "Fake Stereo" not active when "Audio quality" is more than 100%.


Top
   
PostPosted: Sat Aug 21, 2021 10:26 am 
Site Admin
User avatar

Joined: Mon Mar 17, 2008 1:40 am
Posts: 11185
Quote:
If 1/inv_sqrt(x) is faster, then only because the result has lower precision. That's probably no problem for Stereo Tool but wouldn't using low precision sqrt(x) achieve the same without any issues?

I did a quick websearch and yeah, inv_sqrt(x) usually uses half the precision. Note there should be a compiler option for low precision sqrt, too. That would have made all sqrts faster at once without replacing it.

By default sqrt needs to follow some IEEE standard which requires that all computed digits are correct. With the compiler option above this is no longer necessary.
The compiler option - as far as I have observed - does replace 1 / x by the faster function, and probably 1 / sqrt as well, but it doesn't automatically replace sqrt by 1/(1/sqrt)). Besize that, the heavy parts of the code are in most places hand-written intrinsics (very close to assembly), and the compiler definitely doesn't change those based on that option. I have just re-thought the precision thing though, and while in MOST (actually, all other) places I don't care about precision at all, the Dequantizer is actually an exception. I still don't care about precision, but I do need the error offset to be smooth. Based on the documentation and graphs that I found it should be - but based on my own tests it's not. Information about how this 1/(1/sqrt) instruction is implemented is contradictory, many sources say that it's a lookup table, but that doesn't fit what I saw (jumps if the output value are sometimes bigger than the difference of the input values). It looks like they are actually really using something like the Quake-method. So I have just reverted the 1/(1/sqrt)) change for the Dequantizer.
Quote:
An other, old minor issue - I forgot to mention, probably nobody using this function: the "Fake Stereo" not active when "Audio quality" is more than 100%.
I don't think that anyone uses that anymore (well, obviously you do?). It must have been broken ever since we introduced > 100% Quality, which is years ago. I'll put it on the list for the next version, we're not going to change it for the upcoming release anymore.


Top
   
PostPosted: Sat Aug 21, 2021 11:49 am 

Joined: Wed Dec 04, 2019 6:12 am
Posts: 28
Quote:
I don't think that anyone uses that anymore (well, obviously you do?)
Honestly, not touched long time ago, but I remember, there was some situation when using this function (with careful!) caused a bit comfortable sound than the original, 100% monoaural source. Only when really need some stereo image.


Top
   
PostPosted: Sat Aug 21, 2021 11:50 pm 
Site Admin
User avatar

Joined: Mon Mar 17, 2008 1:40 am
Posts: 11185
Quote:
I have just re-thought the precision thing though, and while in MOST (actually, all other) places I don't care about precision at all, the Dequantizer is actually an exception. I still don't care about precision, but I do need the error offset to be smooth.
Scratch that. It's not a problem.


Top
   
PostPosted: Sun Aug 22, 2021 5:38 am 

Joined: Wed Apr 06, 2016 4:06 am
Posts: 38
Reduced-accuracy rsqrtss and rsqrtps have been available with SSE since the Pentium 3. VEX encodings are available for AVX. Any possible use for your code?

https://www.felixcloutier.com/x86/rsqrtss
https://www.felixcloutier.com/x86/rsqrtps

As tested here:

https://stackoverflow.com/questions/152 ... n-rsqrtx-x

Quake's Newton-Raphson approximation: 4.3ns/float at error rate of ~ 1 x 2^10
SSE rsqrtss: 1.24ns/float at error rate of <= 1.5 x 2^12


Top
   
PostPosted: Tue Aug 24, 2021 4:44 pm 
Site Admin
User avatar

Joined: Mon Mar 17, 2008 1:40 am
Posts: 11185
Quote:
Reduced-accuracy rsqrtss and rsqrtps have been available with SSE since the Pentium 3. VEX encodings are available for AVX. Any possible use for your code?

https://www.felixcloutier.com/x86/rsqrtss
https://www.felixcloutier.com/x86/rsqrtps

As tested here:

https://stackoverflow.com/questions/152 ... n-rsqrtx-x

Quake's Newton-Raphson approximation: 4.3ns/float at error rate of ~ 1 x 2^10
SSE rsqrtss: 1.24ns/float at error rate of <= 1.5 x 2^12
That's what I'm using - at least everywhere where I need a square root in hand-vectorized SSE/AVX code. Divides as well ( a / b --> a * rcp(b) ).

However.. reading this post I saw that even IF compilers use this instruction (with -ffast-math), they will add a Newton iteration. Which makes the instruction almost (not quite) as heavy as a normal sqrt. For what I'm using it for, that's completely unnecessary. So, I'll have a look at how to force the compiler to rcp(sqrt(x)) and a*rcp(b) in my non-hand-optimized code as well. That should reduce the CPU usage a bit further. Most importantly, I need to make sure that the ARM compiler does it as well, since that's where performance gains are the most crucial.

Edit: ARM: The instructions for a single rsqrt of rcp don't exist in 32 bit (they do exist for vectors, which *may* still make it possible to generate faster code.. maybe)

Edit: 2% off, and some more opportunities remain (but I have to be careful where I can use this). I will finish this tomorrow, and also verify that the ARM-versions still work (since they have different code - in fact, 32 and 64 bit is already very different).


Top
   
PostPosted: Tue Aug 24, 2021 8:53 pm 

Joined: Wed Apr 06, 2016 4:06 am
Posts: 38
K, and thanks for the explanation. :)

I never really got into this level or type of code-level programming optimization, and I have no truly practical experience on ARM coding. There are so many versions of ARM, plus from what I gathered, even seemingly similar instructions between x86 SIMD and NEON can produce dramatically different results. Yay?

Most of the what I encountered in a quick skim of more modern discussions has been about improving accuracy of inverse square approximation and less so about the performance. I did note a suggestion of just using -ffast-math with standard square root instructions, which should generate code similar to optimized square root intrinsics with x86 optimizing compilers these days. YMMV, I suppose, and probably more so with ARM.


Top
   
PostPosted: Tue Aug 24, 2021 9:43 pm 
Site Admin
User avatar

Joined: Mon Mar 17, 2008 1:40 am
Posts: 11185
I've tried it on Godbolt and --fast-math didn't cause sqrt to be generated using rsqrt's. Indeed, ARM is very different, for one, where the Intel instruction has about 2^-12 precision, ARM only has 2^-9 precision. More problematic is (I think) that the rsqrt and rcp-instructions *only* exist for at least 2 values simultaneously on 32-bit ARM. On 64-bit ARM they do exist for 1 value as well. Which means that I have 3 different versions of the code now. So, I'm going to test whether I can just use the version that calculates 2 results and then ignore one of the two. Which seems very odd. If that works, great - otherwise I'll have to use the normal sqrt on 32-bit ARM.

By the way, in 32-bit ARM vectorized intrinsics I couldn't find a function for sqrt, only rsqrt, so on ARM I've always used rcp(sqrt()) in my NEON (vectorized) code. But the normal C++ sqrt-function definitely doesn't use that.

(I wonder what non-programmers are thinking of this discussion ;) For some reason it always feels really great if I find some insane optimization like this and manage to get a few % off of the CPU usage.)

Ugh. Ok, I've just discovered that my code is blocking vectorization on ARM.
Ugh #2: The Intel compiler *does* replace sqrt() by rsqrt with a lot of code (probably those Newton steps) after it. Which is still clearly a lot slower than the rcp(rsqrt) code, since I got a performance improvement with the Intel compiler.
Ugh #3: It blocks vectorization on Intel as well. Which is kinda odd since I *do* see a performance improvement. I think it's time to manually vectorize the code where possible again :( which isn't that bad but it adds a mess to the code. The good news is that this might mean that there's more to gain here.


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 162 posts ]  Go to page Previous 111 12 13 14 1517 Next

All times are UTC+01:00


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Limited