Questioning the 8.6 Tflops figure for the ASUS Rog Ally Z1 Extreme version

1684061937794.png


There’s only one way to calculate theoratical performance which is:

total number of cores (in this case is 768 sp’s) X 2 (each core performs 2 ops per clock) X clockspeed (in mhz) / 1000000 (converting gflops to tflops) == 8.6 Tflops

Now let’s replace variables with actual numbers:

768 X 2 X Clk / 1000000 = 8.6

1536 X Clk = 8600000

Clk = 8600000 / 1536

Clk = 5598 mhz


in order for the ROG Ally Z1 extreme to hit 8.6 Tflops the GPU needs to run at clock speed of ~ 5.59 Ghz. to my knowledge that clock speed couldn’t be reached for desktop liquid cooled GPU’s let alone a handheld GPU. Even with boost clocks still this seems questionable, appreciate your thoughts.
 
They are very likely using FP16 calculations. It's an old trick.
That’s low. Still a decent performance if sustainable. That’s like a Series S console in your hands. If the new gen Switch is anywhere near that I’d be happy, but that’s a very big if.
 
That’s low. Still a decent performance if sustainable. That’s like a Series S console in your hands. If the new gen Switch is anywhere near that I’d be happy, but that’s a very big if.
Nintendo has definitely no plans to make something like that. Switch 2 will be around 2tflops at most. This thing is not only power hungry but also way more expensive than Nintendo would ever accept
 
Nintendo has definitely no plans to make something like that. Switch 2 will be around 2tflops at most. This thing is not only power hungry but also way more expensive than Nintendo would ever accept

The Ally is throwing the max of everything you can currently fit into a handheld form factor. Nintendo save on screen, storage and probably cooling from a better process node/Nvidia architecture. They also don't need to make money on the hardware.
 
They are very likely using FP16 calculations. It's an old trick.
There is some AMD slide around confirming it's fp16:
AMD-Ryzen-7040U-Slide-Deck-5-768x418.jpg


However, afaik RDNA3 supports double rate fp16 only for one instruction : A dot product, primarily meant to accelerate ML.
If that's true, there is no ALU win to expect from using fp16 over fp32 on RDNA3 anymore. Maybe somebody can confirm...

So the 'old trick' in this case borders a marketing lie, if so.
 
Surely it's because the RDNA 3 architecture uses dual issue SIMD units (https://chipsandcheese.com/2023/01/07/microbenchmarking-amds-rdna-3-graphics-architecture/)? However in practice the compiler seems to make very limited use of them and the boost over RDNA 2 is negligible.
RDNA 3 dual-issue (VOPD) makes the trade off between Wave32 vs Wave64 more nuanced. RDNA 2 preferred Wave32 for virtually every types of shaders outside of pixel shaders. On RDNA 3, wave64 can be viably more preferable for some shaders since it's trivial to apply VOPD in that case as opposed to doing so for wave32 ...

With VOPD, using wave32 vs wave64 for shaders isn't as clear cut anymore as before ...
 
Back
Top