I'm surprised they would quote latency rather than throughput - I don't anyone else does and you generally code to try to hide the latency.Fafalada said:It's not strange, it's latency. The throughput is 16cycles, just like you would expect, and you can always execute invidual DOT4/MADDs single cycle if you prefer.SimonF said:That strikes me as a very strange figure.
Anyway, in summary it can basically execute a MADD per clock (since that would then correspond to a matrix-vector mul throughput of 16 cycles).
I'll take your word for it. I was just trying to get an idea of how it compares to a 'standard' DX/OGL vertex shader architecture.The thing has a very impressive feature set though, it's not just about how fast it is (and it's quite fast, especially once you go beyond simple stuff like madds).
Again, that just sounds odd. A matrix-vector mul can be done as 4 DPs so if you can do a DP in under 4 clocks (i.e. to beat the the 16 clocks quoted by Fafalada) why would you use a suboptimal approach? It'd be madness.Panajev2001a said:Simon, that would mean that 4x4 Matrix * 4x1 Vector does not push the VFPU to the max (there migth be dedicated dto product instructions that yield that figure): for 333 MHz operation the CPU was and still is specced at 2.6 GFLOPS.
Secondly, mat-vec muls are "bread and butter" operations in a graphics pipeline so you'd want to be damned sure they ran as fast as possible.