Great! That's much better than before
I'll repeat my calculation with your settings:
- 512 / 64 = 8 -> 8 grains for audio to arrive
- 3 ms @ 48 kHz -> 3 (2.06!) grains for audio to be processed
- 1 extra grain for output
Total 12 grains x 64 samples = 768 samples = 16.0 ms latency @ 48 kHz...
According to the ASIO spec there's apparently one extra grain needed: That adds another 1.33 ms giving 17.33 ms total. Which perfectly matches your findings.
In my processing I'm throwing away one grain that I could have used - so it must be possible to reduce the ASIO latency by 1 grain (1.33 ms at your settings), to 16 ms.
Fixing this is possible, but a little tricky (will need a lot of testing), so I'll postpone that for a later version.
[Note to self: It's possible to send data to an audio buffer AFTER I've returned control to the driver if I call ASIOOutputReady() when I'm ready].