Apple (PowerVR) TBDR GPU-architecture speculation thread

rikrak · Nov 13, 2020

I am not a big fan of GFXbenchmark as I don't know how representative it is, but the early results are impressive

https://gfxbench.com/compare.jsp?be...=NVIDIA+GeForce+GTX+1650+Ti+with+Max-Q+Design

I'd say, even if M1 is "only" comparable to a MX450 in graphics, it's a great success for Apple.

Arnold Beckenbauer said:
So, it's somekind of Apple's IMG Series A? 128 ECUs and 8* TMUs per core (cluster) and upto 1,3 GHz clock?

* or 4 TMUs mit doubled clock?

Based on the diagram in this tech talk (05:10) I would assume that there are 4 32-wide ALUs for every GPU core

P_EQUALS_NP said:
Even if apples claimed 2.6 teraflops uses half floats, 1.3 tera flops of single floats @10watts is pretty amazing!

I'm fairly certain they are talking about the FP32 performance

Ailuros · Nov 13, 2020

GFXbench results can be useful indications if one knows how to avoid pitfalls for some of the synthetic tests. Thanks for that one.

Leovinus · Nov 17, 2020

Benchmarks are starting to trickle out, Andrei of Anandtech made a very in-depth article on the M1 as a whole and with the GPU detailed as well. Well worth a read!

https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested

Arnold Beckenbauer · Dec 6, 2020

Gokhan Avkarogullari "@gfxch1ptweeter FP32 ALU rate is half of FP16 rate on A14 (and earlier chips). That has not changed on A14. F32 ALU rate relative to F16 increased on M1." / Twitter

https://twitter.com/i/web/status/1326582307025637376

Real World Technologies - Forums - Thread: M1 GPU microbenchmarks (peak FLOPS)

Deleted member 13524 · Dec 7, 2020

What does Rosetta translate? Is it only x86 MacOS applications?
Could the MacOS benchmarks/games be using a higher rate of FP16 shaders than the Windows ones, like the iOS and Android versions?

Also, the GPU is using the SoC's 16MB SLC just like Navi 21 uses Infinity Cache, right? AMD's slide on LLC hit rates in typical gaming scenarios point to 16MB getting around 25-30% hit rates on 1080p:

What's the theoretical SLC bandwidth on the M1? Are there any measurements made?

Arnold Beckenbauer said:
https://twitter.com/i/web/status/1326582307025637376

Real World Technologies - Forums - Thread: M1 GPU microbenchmarks (peak FLOPS)

Isn't this a super weird wording to use?
"FP32 relative to FP16 increased"... sounds like 2*FP16 throughput is just gone from the M1 GPU cores, but the guy just bent over backwards to make this a positive statement (why though? performance seems excellent anyways).
Maybe a good part of the reason for having 2*FP16 on the GPU was for ML tasks which are now being diverted towards the new neural engine.

rikrak · Dec 7, 2020

ToTTenTranz said:
What does Rosetta translate? Is it only x86 MacOS applications?
Could the MacOS benchmarks/games be using a higher rate of FP16 shaders than the Windows ones, like the iOS and Android versions?

It translates x86-64 code to the ARM64 code so that apps can run on the ARM CPU.

It doesn't seem like there is much point in degrading shader precision on M1 chips given the fact that FP32 and FP16 have identical performance.

ToTTenTranz said:
"FP32 relative to FP16 increased"... sounds like 2*FP16 throughput is just gone from the M1 GPU cores, but the guy just bent over backwards to make this a positive statement (why though? performance seems excellent anyways).
Maybe a good part of the reason for having 2*FP16 on the GPU was for ML tasks which are now being diverted towards the new neural engine.

I tend to interpret it that Apple mobile GPUs have FP16 ALUs but being able to compute FP32 operations at 1/2 the throughput rate. And M1 now "upgrades" those to handle FP32 at the same rate. Kind of similar to what Nvidia did with Ampere (they upgraded the dedicated INT32 unit to a FP32+INT32).

Deleted member 13524 · Dec 7, 2020

rikrak said:
It doesn't seem like there is much point in degrading shader precision on M1 chips given the fact that FP32 and FP16 have identical performance.

FP16 shaders, if handled natively by the ALUs (i.e. not promoted to FP32), occupy a lower footprint on the caches. This means there's lower power consumed on memory transactions and a higher number of operations occurring within the caches which increases effective bandwidth and performance.

For example, Polaris / GCN4 introduced native FP16 instructions which already claimed some improvements in performance and power efficiency, even though the architecture doesn't do RPM.

For a GPU with low bandwidth to system RAM (~68GB/s) but with access to a large LLC cache like the M1's, using a larger proportion of FP16 (as it is often the case with Android/iOS games and benchmarks) could make a considerable difference.

rikrak · Dec 7, 2020

ToTTenTranz said:
FP16 shaders, if handled natively by the ALUs (i.e. not promoted to FP32), occupy a lower footprint on the caches. This means there's lower power consumed on memory transactions and a higher number of operations occurring within the caches which increases effective bandwidth and performance.

For example, Polaris / GCN4 introduced native FP16 instructions which already claimed some improvements in performance and power efficiency, even though the architecture doesn't do RPM.

I completely agree with you that there are sizable benefits of having fast FP16 operations even on desktop GPUs. This is definitely not something I am debating. I am just pointing out that M1 appears to have identical throughput for both FP32 and FP16 operations, which I hypothesize is due to Apple "upgrading" the FP32 execution, not downgrading the FP16. Again, I think we both agree that one would have preferred to have double-rate FP16 with the current FP32 throughput, but I guess that this is not what Apple was ready to do in this iteration.

The key to testing this hypothesis is to check FP16 and FP32 throughput of the A14 GPU in comparison to M1.

ToTTenTranz said:
For a GPU with low bandwidth to system RAM (~68GB/s) but with access to a large LLC cache like the M1's, using a larger proportion of FP16 (as it is often the case with Android/iOS games and benchmarks) could make a considerable difference.

To continue your line of thought, there is a difference between data storage and data processing. Using lower-precision data for storage where appropriate is an old and very commonly optimization technique (trivial example of this is using byte-sized integers to encode color channels or using packed FP encodings with shared exponent for HDR content). As GPUs can convert between data representations for "free", all GPUs can benefit from using reduced precision for data storage, irrelevant whether they have special hardware to process reduced precision data faster.

Xmas · Dec 8, 2020

Yes, storing data as FP16 in memory is commonly done even on GPUs that have no FP16 ALUs. Having same-rate FP16 operations doesn't make individual operations faster but it reduces register footprint which can be a very significant - yet hard to quantify - benefit.

rikrak · Dec 9, 2020

ToTTenTranz said:
What does Rosetta translate? Is it only x86 MacOS applications?
Isn't this a super weird wording to use?
"FP32 relative to FP16 increased"... sounds like 2*FP16 throughput is just gone from the M1 GPU cores, but the guy just bent over backwards to make this a positive statement (why though? performance seems excellent anyways).
Maybe a good part of the reason for having 2*FP16 on the GPU was for ML tasks which are now being diverted towards the new neural engine.

Upgrade on this — new benchmark results are up, including A14 results (iPhone 12):

https://www.realworldtech.com/forum/?threadid=197759&curpostid=197985

To summarise: normalised per GPU core, A14 has exactly half the FP32 throughput of the M1, while their PF16 throughput is identical. My conclusion is that Apple indeed managed to improve the FP32 rate rather than dropping the FP16 rate. I also don't think that Apple GPU ever had 2*FP16, just 0.5*FP32...

P.S. (in case you have not guessed already, yes, I am the guy who wrote these benchmarks)

Xmas · Dec 10, 2020

rikrak said:
I also don't think that Apple GPU ever had 2*FP16, just 0.5*FP32...

Could you clarify what you mean by that? How are those two descriptions different, what is the baseline?

Arnold Beckenbauer · Dec 10, 2020

rikrak said:
Upgrade on this — new benchmark results are up, including A14 results (iPhone 12):

https://www.realworldtech.com/forum/?threadid=197759&curpostid=197985

To summarise: normalised per GPU core, A14 has exactly half the FP32 throughput of the M1, while their PF16 throughput is identical. My conclusion is that Apple indeed managed to improve the FP32 rate rather than dropping the FP16 rate. I also don't think that Apple GPU ever had 2*FP16, just 0.5*FP32...

P.S. (in case you have not guessed already, yes, I am the guy who wrote these benchmarks)

Can you run on your M1 system eg. 3DMark Wildlife for iOS? Or GFXBench 5 for iOS vs. for macOS (not m1 native)?

-----Edit

https://www.notebookcheck.net/Apple...s-Intel-and-AMD.508057.0.html#toc-performance

Here are some Wild Life results, all M1 Macs outperform the iPad Air 2020 with the Apple A14.

Fp16 vs. fp32 and so on: https://forum.beyond3d.com/threads/native-fp16-support-in-gpu-architectures.56180/

I also don't think that Apple GPU ever had 2*FP16, just 0.5*FP32...

IMG added "faster fp16 than fp32" capabilities with PowerVR Series6XT/Rogue GX (Apple A8 etc), https://www.tomshardware.com/reviews/apple-iphone-6s-6s-plus,4437-7.html
https://www.anandtech.com/show/7793/imaginations-powervr-rogue-architecture-exposed/3

Next edit:

I try to understand your numbers, and it looks for me, that a M1 GPU core is not like an Apple A14 GPU core. I understand your numbers, that an A14 GPU core has 64 fp32 EUs (and 128 fp16 EUs).

rikrak · Dec 11, 2020

Xmas said:
Could you clarify what you mean by that? How are those two descriptions different, what is the baseline?

For full disclosure, I have no idea how these things can be implemented in hardware. It was pointed out (https://www.realworldtech.com/forum/?threadid=197759&curpostid=197993) that fusing two FP16 ALUs to perform a FP32 operation or splitting a single FP32 ALU to perform two FP16 operations per clock are functionally equivalent, something I was not aware of.

What I mean essentially is that Apple GPUs are exposed as having SIMD width of 32 on both the A14 and M1 (the core is understood to have four SIDM units for the total width of 128). The SIMD width is a constant, regardless of whether you are funning FP16 or FP32 operations, and it supports all the regular bells and whistles like SIMD lane masking and various kinds of SIMD broadcasting and reduction operations. Regardless of how it is implemented (many people here will probably have a much better idea than me), the end result is that A14 (mobile chips) can run FP16 at full speed and FP32 at half the speed (whether they achieve it by using two FP16 ALUs or by using the same FP16 ALU over two cycles or something else is just an implementation detail IMO) while the M1 seems to have wider ALUs capable of doing FP32 FMA at full rate — without increasing it's relative FP16 rate. I would further speculate that M1 maybe does not have any special support for FP16 at all and simply does a FP32 computation while widening FP16 inputs and rounding the output back to FP16.

I think what my results primarily suggest is that Apple does not have separate FP16 and FP32 units like some other implementations, they reuse (parts of) the same ALUs to perform computations with various precision.

Arnold Beckenbauer said:
Can you run on your M1 system eg. 3DMark Wildlife for iOS? Or GFXBench 5 for iOS vs. for macOS (not m1 native)?

https://www.notebookcheck.net/Apple...s-Intel-and-AMD.508057.0.html#toc-performance

There are a lot of results available for these benchmarks already. Is there something specific that you are looking for? I don't think it's surprising that M1 Mac outperform the iPad, they have twice as many GPU cores...

Arnold Beckenbauer said:
IMG added "faster fp16 than fp32" capabilities with PowerVR Series6XT/Rogue GX (Apple A8 etc), https://www.tomshardware.com/reviews/apple-iphone-6s-6s-plus,4437-7.html
https://www.anandtech.com/show/7793/imaginations-powervr-rogue-architecture-exposed/3

If I interpret Rogue diagrams correctly, they seem to suggest that IMG has separate FP32 and FP16 that can execute operations in parallel. My results for Apple GPUs suggest that they are much "simpler". I don't believe they have separate FP16 and FP32 ALUs at all, just some ALUs that can run either FP32, FP16 or integer operations.

Arnold Beckenbauer said:
I try to understand your numbers, and it looks for me, that a M1 GPU core is not like an Apple A14 GPU core. I understand your numbers, that an A14 GPU core has 64 fp32 EUs (and 128 fp16 EUs).

My point is that M1 GPU core is physically wider, as it can do 128 FP32 FMA operations OR 128FP16 operations per clock. A14 core can do 128 FP16 FMA operations per clock OR 64FP32 FMA operations per clock. As I wrote above, I have no idea how this is actually implemented in hardware. Also, looking at die shots, the M1 core GPU is much larger physically relative to an A14 core.

Arnold Beckenbauer · Dec 11, 2020

rikrak said:
...

There are a lot of results available for these benchmarks already. Is there something specific that you are looking for? I don't think it's surprising that M1 Mac outperform the iPad, they have twice as many GPU cores...

...

Not necessary, I have edited my posting several times, so I forgot to highlight my edit with the NBC link.

Kugai Calo · Dec 27, 2020

The microarchitecture of the GPUs in M1 and A14 should be identical, they both belong to

Code:

MTLGPUFamilyApple7

It's interesting to note that Apple just added DXR-like raytracing API to Metal (i.e. not a shader library like MPS), maybe Apple's own raytracing hardware is on the way. You can find the relevant WWDC talk here: https://developer.apple.com/videos/play/wwdc2020/10012/ , it even comes with example code.

rikrak · Dec 27, 2020

Kugai Calo said:
The microarchitecture of the GPUs in M1 and A14 should be identical, they both belong to

Code:

MTLGPUFamilyApple7

Depends on what you understand under "microarchitecture". They have identical feature set, yes, but there should be little doubt that their ALUs are physically different (different FP32 compute throughput, different size on die).

Kugai Calo said:
It's interesting to note that Apple just added DXR-like raytracing API to Metal (i.e. not a shader library like MPS), maybe Apple's own raytracing hardware is on the way. You can find the relevant WWDC talk here: https://developer.apple.com/videos/play/wwdc2020/10012/ , it even comes with example code.

What's even more interesting that Metal RT works across all the GPUs, and related features such as shader function pointers, late function binding and recursive function calls can be used independently. I haven't tried out the API yet but it looks quite nice to me, and it is certainly designed with hardware-accelerated RT in mind.

mfaisalkemal · Dec 27, 2020

rikrak said:
Depends on what you understand under "microarchitecture". They have identical feature set, yes, but there should be little doubt that their ALUs are physically different (different FP32 compute throughput, different size on die).

based on techinsight analysis(link), M1 GPU core size is double of A14 GPU.

rikrak · Dec 27, 2020

mfaisalkemal said:
based on techinsight analysis(link), M1 GPU core size is double of A14 GPU.

Now I am very confused. If the GPU cores are basically identical how has Apple managed to double the FP32 rate on M1 relative to A14? Is this an artificial limitation on A14?

Kugai Calo · Dec 27, 2020

rikrak said:
Depends on what you understand under "microarchitecture". They have identical feature set, yes, but there should be little doubt that their ALUs are physically different (different FP32 compute throughput, different size on die).

By microarchitecture I was referring to the design of each "core". And do we know for sure that the M1's single precision throughput per core is better than A14's?

rikrak said:
What's even more interesting that Metal RT works across all the GPUs, and related features such as shader function pointers, late function binding and recursive function calls can be used independently. I haven't tried out the API yet but it looks quite nice to me, and it is certainly designed with hardware-accelerated RT in mind.

Indeed! IIRC for D3D the GPU hardware needs to be D3D12_FEATURE_LEVEL_12_2 compliant to support callable shader. And I'm personally a fan of Metal Shading Language as well.

Kugai Calo · Dec 27, 2020

One then I'm confused: if Apple's (and Imagination's) TBDR GPUs are so awesomely power efficiently, why isn't everyone doing it? What are the inherent limitations of a TBDR GPU?

Apple (PowerVR) TBDR GPU-architecture speculation thread

rikrak

Ailuros

Epsilon plus three

Leovinus

Arnold Beckenbauer

Deleted member 13524

Guest

rikrak

Deleted member 13524

Guest

rikrak

Xmas

Porous

rikrak

Xmas

Porous

Arnold Beckenbauer

rikrak

Arnold Beckenbauer

Kugai Calo

rikrak

mfaisalkemal

rikrak

Kugai Calo

Kugai Calo

Similar threads