Apple (PowerVR) TBDR GPU-architecture speculation thread

I am not a big fan of GFXbenchmark as I don't know how representative it is, but the early results are impressive

https://gfxbench.com/compare.jsp?be...=NVIDIA+GeForce+GTX+1650+Ti+with+Max-Q+Design

I'd say, even if M1 is "only" comparable to a MX450 in graphics, it's a great success for Apple.

So, it's somekind of Apple's IMG Series A? 128 ECUs and 8* TMUs per core (cluster) and upto 1,3 GHz clock?

* or 4 TMUs mit doubled clock?

Based on the diagram in this tech talk (05:10) I would assume that there are 4 32-wide ALUs for every GPU core

Even if apples claimed 2.6 teraflops uses half floats, 1.3 tera flops of single floats @10watts is pretty amazing!

I'm fairly certain they are talking about the FP32 performance
 
GFXbench results can be useful indications if one knows how to avoid pitfalls for some of the synthetic tests. Thanks for that one.
 
What does Rosetta translate? Is it only x86 MacOS applications?
Could the MacOS benchmarks/games be using a higher rate of FP16 shaders than the Windows ones, like the iOS and Android versions?


Also, the GPU is using the SoC's 16MB SLC just like Navi 21 uses Infinity Cache, right? AMD's slide on LLC hit rates in typical gaming scenarios point to 16MB getting around 25-30% hit rates on 1080p:


arch7-jpg.4939




What's the theoretical SLC bandwidth on the M1? Are there any measurements made?



Isn't this a super weird wording to use?
"FP32 relative to FP16 increased"... sounds like 2*FP16 throughput is just gone from the M1 GPU cores, but the guy just bent over backwards to make this a positive statement (why though? performance seems excellent anyways).
Maybe a good part of the reason for having 2*FP16 on the GPU was for ML tasks which are now being diverted towards the new neural engine.
 
What does Rosetta translate? Is it only x86 MacOS applications?
Could the MacOS benchmarks/games be using a higher rate of FP16 shaders than the Windows ones, like the iOS and Android versions?

It translates x86-64 code to the ARM64 code so that apps can run on the ARM CPU.

It doesn't seem like there is much point in degrading shader precision on M1 chips given the fact that FP32 and FP16 have identical performance.

"FP32 relative to FP16 increased"... sounds like 2*FP16 throughput is just gone from the M1 GPU cores, but the guy just bent over backwards to make this a positive statement (why though? performance seems excellent anyways).
Maybe a good part of the reason for having 2*FP16 on the GPU was for ML tasks which are now being diverted towards the new neural engine.

I tend to interpret it that Apple mobile GPUs have FP16 ALUs but being able to compute FP32 operations at 1/2 the throughput rate. And M1 now "upgrades" those to handle FP32 at the same rate. Kind of similar to what Nvidia did with Ampere (they upgraded the dedicated INT32 unit to a FP32+INT32).
 
It doesn't seem like there is much point in degrading shader precision on M1 chips given the fact that FP32 and FP16 have identical performance.
FP16 shaders, if handled natively by the ALUs (i.e. not promoted to FP32), occupy a lower footprint on the caches. This means there's lower power consumed on memory transactions and a higher number of operations occurring within the caches which increases effective bandwidth and performance.

For example, Polaris / GCN4 introduced native FP16 instructions which already claimed some improvements in performance and power efficiency, even though the architecture doesn't do RPM.

8Bht735.png




For a GPU with low bandwidth to system RAM (~68GB/s) but with access to a large LLC cache like the M1's, using a larger proportion of FP16 (as it is often the case with Android/iOS games and benchmarks) could make a considerable difference.
 
FP16 shaders, if handled natively by the ALUs (i.e. not promoted to FP32), occupy a lower footprint on the caches. This means there's lower power consumed on memory transactions and a higher number of operations occurring within the caches which increases effective bandwidth and performance.

For example, Polaris / GCN4 introduced native FP16 instructions which already claimed some improvements in performance and power efficiency, even though the architecture doesn't do RPM.

8Bht735.png

I completely agree with you that there are sizable benefits of having fast FP16 operations even on desktop GPUs. This is definitely not something I am debating. I am just pointing out that M1 appears to have identical throughput for both FP32 and FP16 operations, which I hypothesize is due to Apple "upgrading" the FP32 execution, not downgrading the FP16. Again, I think we both agree that one would have preferred to have double-rate FP16 with the current FP32 throughput, but I guess that this is not what Apple was ready to do in this iteration.

The key to testing this hypothesis is to check FP16 and FP32 throughput of the A14 GPU in comparison to M1.

For a GPU with low bandwidth to system RAM (~68GB/s) but with access to a large LLC cache like the M1's, using a larger proportion of FP16 (as it is often the case with Android/iOS games and benchmarks) could make a considerable difference.

To continue your line of thought, there is a difference between data storage and data processing. Using lower-precision data for storage where appropriate is an old and very commonly optimization technique (trivial example of this is using byte-sized integers to encode color channels or using packed FP encodings with shared exponent for HDR content). As GPUs can convert between data representations for "free", all GPUs can benefit from using reduced precision for data storage, irrelevant whether they have special hardware to process reduced precision data faster.
 
Yes, storing data as FP16 in memory is commonly done even on GPUs that have no FP16 ALUs. Having same-rate FP16 operations doesn't make individual operations faster but it reduces register footprint which can be a very significant - yet hard to quantify - benefit.
 
What does Rosetta translate? Is it only x86 MacOS applications?
Isn't this a super weird wording to use?
"FP32 relative to FP16 increased"... sounds like 2*FP16 throughput is just gone from the M1 GPU cores, but the guy just bent over backwards to make this a positive statement (why though? performance seems excellent anyways).
Maybe a good part of the reason for having 2*FP16 on the GPU was for ML tasks which are now being diverted towards the new neural engine.

Upgrade on this — new benchmark results are up, including A14 results (iPhone 12):

https://www.realworldtech.com/forum/?threadid=197759&curpostid=197985

To summarise: normalised per GPU core, A14 has exactly half the FP32 throughput of the M1, while their PF16 throughput is identical. My conclusion is that Apple indeed managed to improve the FP32 rate rather than dropping the FP16 rate. I also don't think that Apple GPU ever had 2*FP16, just 0.5*FP32...

P.S. (in case you have not guessed already, yes, I am the guy who wrote these benchmarks)
 
Upgrade on this — new benchmark results are up, including A14 results (iPhone 12):

https://www.realworldtech.com/forum/?threadid=197759&curpostid=197985

To summarise: normalised per GPU core, A14 has exactly half the FP32 throughput of the M1, while their PF16 throughput is identical. My conclusion is that Apple indeed managed to improve the FP32 rate rather than dropping the FP16 rate. I also don't think that Apple GPU ever had 2*FP16, just 0.5*FP32...

P.S. (in case you have not guessed already, yes, I am the guy who wrote these benchmarks)
Can you run on your M1 system eg. 3DMark Wildlife for iOS? Or GFXBench 5 for iOS vs. for macOS (not m1 native)?

-----Edit

https://www.notebookcheck.net/Apple...s-Intel-and-AMD.508057.0.html#toc-performance

Here are some Wild Life results, all M1 Macs outperform the iPad Air 2020 with the Apple A14.

Fp16 vs. fp32 and so on: https://forum.beyond3d.com/threads/native-fp16-support-in-gpu-architectures.56180/

I also don't think that Apple GPU ever had 2*FP16, just 0.5*FP32...

IMG added "faster fp16 than fp32" capabilities with PowerVR Series6XT/Rogue GX (Apple A8 etc), https://www.tomshardware.com/reviews/apple-iphone-6s-6s-plus,4437-7.html
https://www.anandtech.com/show/7793/imaginations-powervr-rogue-architecture-exposed/3

Next edit:

I try to understand your numbers, and it looks for me, that a M1 GPU core is not like an Apple A14 GPU core. I understand your numbers, that an A14 GPU core has 64 fp32 EUs (and 128 fp16 EUs).
 
Last edited:
Could you clarify what you mean by that? How are those two descriptions different, what is the baseline?

For full disclosure, I have no idea how these things can be implemented in hardware. It was pointed out (https://www.realworldtech.com/forum/?threadid=197759&curpostid=197993) that fusing two FP16 ALUs to perform a FP32 operation or splitting a single FP32 ALU to perform two FP16 operations per clock are functionally equivalent, something I was not aware of.

What I mean essentially is that Apple GPUs are exposed as having SIMD width of 32 on both the A14 and M1 (the core is understood to have four SIDM units for the total width of 128). The SIMD width is a constant, regardless of whether you are funning FP16 or FP32 operations, and it supports all the regular bells and whistles like SIMD lane masking and various kinds of SIMD broadcasting and reduction operations. Regardless of how it is implemented (many people here will probably have a much better idea than me), the end result is that A14 (mobile chips) can run FP16 at full speed and FP32 at half the speed (whether they achieve it by using two FP16 ALUs or by using the same FP16 ALU over two cycles or something else is just an implementation detail IMO) while the M1 seems to have wider ALUs capable of doing FP32 FMA at full rate — without increasing it's relative FP16 rate. I would further speculate that M1 maybe does not have any special support for FP16 at all and simply does a FP32 computation while widening FP16 inputs and rounding the output back to FP16.

I think what my results primarily suggest is that Apple does not have separate FP16 and FP32 units like some other implementations, they reuse (parts of) the same ALUs to perform computations with various precision.

Can you run on your M1 system eg. 3DMark Wildlife for iOS? Or GFXBench 5 for iOS vs. for macOS (not m1 native)?

https://www.notebookcheck.net/Apple...s-Intel-and-AMD.508057.0.html#toc-performance

There are a lot of results available for these benchmarks already. Is there something specific that you are looking for? I don't think it's surprising that M1 Mac outperform the iPad, they have twice as many GPU cores...


If I interpret Rogue diagrams correctly, they seem to suggest that IMG has separate FP32 and FP16 that can execute operations in parallel. My results for Apple GPUs suggest that they are much "simpler". I don't believe they have separate FP16 and FP32 ALUs at all, just some ALUs that can run either FP32, FP16 or integer operations.

I try to understand your numbers, and it looks for me, that a M1 GPU core is not like an Apple A14 GPU core. I understand your numbers, that an A14 GPU core has 64 fp32 EUs (and 128 fp16 EUs).

My point is that M1 GPU core is physically wider, as it can do 128 FP32 FMA operations OR 128FP16 operations per clock. A14 core can do 128 FP16 FMA operations per clock OR 64FP32 FMA operations per clock. As I wrote above, I have no idea how this is actually implemented in hardware. Also, looking at die shots, the M1 core GPU is much larger physically relative to an A14 core.
 
...



There are a lot of results available for these benchmarks already. Is there something specific that you are looking for? I don't think it's surprising that M1 Mac outperform the iPad, they have twice as many GPU cores...



...
Not necessary, I have edited my posting several times, so I forgot to highlight my edit with the NBC link.
 
The microarchitecture of the GPUs in M1 and A14 should be identical, they both belong to
Code:
MTLGPUFamilyApple7

It's interesting to note that Apple just added DXR-like raytracing API to Metal (i.e. not a shader library like MPS), maybe Apple's own raytracing hardware is on the way. You can find the relevant WWDC talk here: https://developer.apple.com/videos/play/wwdc2020/10012/ , it even comes with example code.
 
The microarchitecture of the GPUs in M1 and A14 should be identical, they both belong to
Code:
MTLGPUFamilyApple7

Depends on what you understand under "microarchitecture". They have identical feature set, yes, but there should be little doubt that their ALUs are physically different (different FP32 compute throughput, different size on die).

It's interesting to note that Apple just added DXR-like raytracing API to Metal (i.e. not a shader library like MPS), maybe Apple's own raytracing hardware is on the way. You can find the relevant WWDC talk here: https://developer.apple.com/videos/play/wwdc2020/10012/ , it even comes with example code.

What's even more interesting that Metal RT works across all the GPUs, and related features such as shader function pointers, late function binding and recursive function calls can be used independently. I haven't tried out the API yet but it looks quite nice to me, and it is certainly designed with hardware-accelerated RT in mind.
 
Depends on what you understand under "microarchitecture". They have identical feature set, yes, but there should be little doubt that their ALUs are physically different (different FP32 compute throughput, different size on die).
based on techinsight analysis(link), M1 GPU core size is double of A14 GPU.
 
Depends on what you understand under "microarchitecture". They have identical feature set, yes, but there should be little doubt that their ALUs are physically different (different FP32 compute throughput, different size on die).
By microarchitecture I was referring to the design of each "core". And do we know for sure that the M1's single precision throughput per core is better than A14's?

What's even more interesting that Metal RT works across all the GPUs, and related features such as shader function pointers, late function binding and recursive function calls can be used independently. I haven't tried out the API yet but it looks quite nice to me, and it is certainly designed with hardware-accelerated RT in mind.
Indeed! IIRC for D3D the GPU hardware needs to be D3D12_FEATURE_LEVEL_12_2 compliant to support callable shader. And I'm personally a fan of Metal Shading Language as well.
 
One then I'm confused: if Apple's (and Imagination's) TBDR GPUs are so awesomely power efficiently, why isn't everyone doing it? What are the inherent limitations of a TBDR GPU?
 
Back
Top