Questioning the 8.6 Tflops figure for the ASUS Rog Ally Z1 Extreme version

1684061937794.png


There’s only one way to calculate theoratical performance which is:

total number of cores (in this case is 768 sp’s) X 2 (each core performs 2 ops per clock) X clockspeed (in mhz) / 1000000 (converting gflops to tflops) == 8.6 Tflops

Now let’s replace variables with actual numbers:

768 X 2 X Clk / 1000000 = 8.6

1536 X Clk = 8600000

Clk = 8600000 / 1536

Clk = 5598 mhz


in order for the ROG Ally Z1 extreme to hit 8.6 Tflops the GPU needs to run at clock speed of ~ 5.59 Ghz. to my knowledge that clock speed couldn’t be reached for desktop liquid cooled GPU’s let alone a handheld GPU. Even with boost clocks still this seems questionable, appreciate your thoughts.
 
They are very likely using FP16 calculations. It's an old trick.
That’s low. Still a decent performance if sustainable. That’s like a Series S console in your hands. If the new gen Switch is anywhere near that I’d be happy, but that’s a very big if.
 
That’s low. Still a decent performance if sustainable. That’s like a Series S console in your hands. If the new gen Switch is anywhere near that I’d be happy, but that’s a very big if.
Nintendo has definitely no plans to make something like that. Switch 2 will be around 2tflops at most. This thing is not only power hungry but also way more expensive than Nintendo would ever accept
 
Nintendo has definitely no plans to make something like that. Switch 2 will be around 2tflops at most. This thing is not only power hungry but also way more expensive than Nintendo would ever accept

The Ally is throwing the max of everything you can currently fit into a handheld form factor. Nintendo save on screen, storage and probably cooling from a better process node/Nvidia architecture. They also don't need to make money on the hardware.
 
They are very likely using FP16 calculations. It's an old trick.
There is some AMD slide around confirming it's fp16:
AMD-Ryzen-7040U-Slide-Deck-5-768x418.jpg


However, afaik RDNA3 supports double rate fp16 only for one instruction : A dot product, primarily meant to accelerate ML.
If that's true, there is no ALU win to expect from using fp16 over fp32 on RDNA3 anymore. Maybe somebody can confirm...

So the 'old trick' in this case borders a marketing lie, if so.
 
Surely it's because the RDNA 3 architecture uses dual issue SIMD units (https://chipsandcheese.com/2023/01/07/microbenchmarking-amds-rdna-3-graphics-architecture/)? However in practice the compiler seems to make very limited use of them and the boost over RDNA 2 is negligible.
RDNA 3 dual-issue (VOPD) makes the trade off between Wave32 vs Wave64 more nuanced. RDNA 2 preferred Wave32 for virtually every types of shaders outside of pixel shaders. On RDNA 3, wave64 can be viably more preferable for some shaders since it's trivial to apply VOPD in that case as opposed to doing so for wave32 ...

With VOPD, using wave32 vs wave64 for shaders isn't as clear cut anymore as before ...
 
This lists various packed 2x16bit instructions, including V_PK_FMA_F16 (fused multiply-add), which would suggest the usual support for double-rate (packed) FP16.
how could be explained the huge difference in teraflops between the AMD Ryzen Z1 and the AMD Ryzen Z1 Extreme? Are they using fp32 for the Z1 power and fp16 for the Z1 Extreme?

I mentioned this in a different thread but the apparent difference in numbers (300% faster) isn't representative in games.

DsFfg5m.png


bbWZsFG.png


QCILULk.png
 
how could be explained the huge difference in teraflops between the AMD Ryzen Z1 and the AMD Ryzen Z1 Extreme? Are they using fp32 for the Z1 power and fp16 for the Z1 Extreme?

I mentioned this in a different thread but the apparent difference in numbers (300% faster) isn't representative in games.
The first slide you presented says the Z1 Extreme has 3x more CUs than the Z1. You don't see this difference in practice probably due to being limited by power and memory bandwidth first.
 
With that much bottleneck, I wonder if it's just 8 CU And still provide same enough performance with 12 CU?
 
It should be noted the above graphs are for upscaled 720p->1080p via RSR.

Native 1080p numbers show a larger gap especially for some games, but also smaller in some as well -

AMD Ryzen Z1 Series_Deck_Página_06_575px.jpg
 
Back
Top