Questioning the 8.6 Tflops figure for the ASUS Rog Ally Z1 Extreme version

Khalid · May 14, 2023

There’s only one way to calculate theoratical performance which is:

total number of cores (in this case is 768 sp’s) X 2 (each core performs 2 ops per clock) X clockspeed (in mhz) / 1000000 (converting gflops to tflops) == 8.6 Tflops

Now let’s replace variables with actual numbers:

768 X 2 X Clk / 1000000 = 8.6

1536 X Clk = 8600000

Clk = 8600000 / 1536

Clk = 5598 mhz

in order for the ROG Ally Z1 extreme to hit 8.6 Tflops the GPU needs to run at clock speed of ~ 5.59 Ghz. to my knowledge that clock speed couldn’t be reached for desktop liquid cooled GPU’s let alone a handheld GPU. Even with boost clocks still this seems questionable, appreciate your thoughts.

Globalisateur · May 14, 2023

They are very likely using FP16 calculations. It's an old trick.

Khalid · May 14, 2023

Globalisateur said:
They are very likely using FP16 calculations. It's an old trick.

That’s low. Still a decent performance if sustainable. That’s like a Series S console in your hands. If the new gen Switch is anywhere near that I’d be happy, but that’s a very big if.

Subtlesnake · May 14, 2023

Surely it's because the RDNA 3 architecture uses dual issue SIMD units (https://chipsandcheese.com/2023/01/07/microbenchmarking-amds-rdna-3-graphics-architecture/)? However in practice the compiler seems to make very limited use of them and the boost over RDNA 2 is negligible.

Inuhanyou · May 15, 2023

Khalid said:
That’s low. Still a decent performance if sustainable. That’s like a Series S console in your hands. If the new gen Switch is anywhere near that I’d be happy, but that’s a very big if.

Nintendo has definitely no plans to make something like that. Switch 2 will be around 2tflops at most. This thing is not only power hungry but also way more expensive than Nintendo would ever accept

cheapchips · May 15, 2023

Inuhanyou said:
Nintendo has definitely no plans to make something like that. Switch 2 will be around 2tflops at most. This thing is not only power hungry but also way more expensive than Nintendo would ever accept

The Ally is throwing the max of everything you can currently fit into a handheld form factor. Nintendo save on screen, storage and probably cooling from a better process node/Nvidia architecture. They also don't need to make money on the hardware.

JoeJ · May 15, 2023

Globalisateur said:
They are very likely using FP16 calculations. It's an old trick.

There is some AMD slide around confirming it's fp16:

However, afaik RDNA3 supports double rate fp16 only for one instruction : A dot product, primarily meant to accelerate ML.
If that's true, there is no ALU win to expect from using fp16 over fp32 on RDNA3 anymore. Maybe somebody can confirm...

So the 'old trick' in this case borders a marketing lie, if so.

Xmas · May 15, 2023

https://www.amd.com/system/files/TechDocs/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf

This lists various packed 2x16bit instructions, including V_PK_FMA_F16 (fused multiply-add), which would suggest the usual support for double-rate (packed) FP16.

Lurkmass · May 15, 2023

Subtlesnake said:
Surely it's because the RDNA 3 architecture uses dual issue SIMD units (https://chipsandcheese.com/2023/01/07/microbenchmarking-amds-rdna-3-graphics-architecture/)? However in practice the compiler seems to make very limited use of them and the boost over RDNA 2 is negligible.

RDNA 3 dual-issue (VOPD) makes the trade off between Wave32 vs Wave64 more nuanced. RDNA 2 preferred Wave32 for virtually every types of shaders outside of pixel shaders. On RDNA 3, wave64 can be viably more preferable for some shaders since it's trivial to apply VOPD in that case as opposed to doing so for wave32 ...

With VOPD, using wave32 vs wave64 for shaders isn't as clear cut anymore as before ...

Cyan · Jun 22, 2023

Xmas said:
https://www.amd.com/system/files/TechDocs/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf
This lists various packed 2x16bit instructions, including V_PK_FMA_F16 (fused multiply-add), which would suggest the usual support for double-rate (packed) FP16.

how could be explained the huge difference in teraflops between the AMD Ryzen Z1 and the AMD Ryzen Z1 Extreme? Are they using fp32 for the Z1 power and fp16 for the Z1 Extreme?

I mentioned this in a different thread but the apparent difference in numbers (300% faster) isn't representative in games.

Subtlesnake · Jun 22, 2023

Cyan said:
how could be explained the huge difference in teraflops between the AMD Ryzen Z1 and the AMD Ryzen Z1 Extreme? Are they using fp32 for the Z1 power and fp16 for the Z1 Extreme?

I mentioned this in a different thread but the apparent difference in numbers (300% faster) isn't representative in games.

The first slide you presented says the Z1 Extreme has 3x more CUs than the Z1. You don't see this difference in practice probably due to being limited by power and memory bandwidth first.

orangpelupa · Jun 22, 2023

With that much bottleneck, I wonder if it's just 8 CU And still provide same enough performance with 12 CU?

arandomguy · Jun 23, 2023

It should be noted the above graphs are for upscaled 720p->1080p via RSR.

Native 1080p numbers show a larger gap especially for some games, but also smaller in some as well -

AMD Ryzen Z1 Series_Deck_Página_06_575px.jpg

Questioning the 8.6 Tflops figure for the ASUS Rog Ally Z1 Extreme version

Khalid

Globalisateur

Globby

Khalid

Subtlesnake

Inuhanyou

cheapchips

JoeJ

Xmas

Porous

Lurkmass

Cyan

orange

Subtlesnake

orangpelupa

Elite Bug Hunter

arandomguy