Apple (PowerVR) TBDR GPU-architecture speculation thread

Discussion in 'Architecture and Products' started by Kaotik, Jul 7, 2020.

Tags:
  1. rikrak

    Newcomer

    Joined:
    Sep 16, 2020
    Messages:
    23
    Likes Received:
    16
    I am not a big fan of GFXbenchmark as I don't know how representative it is, but the early results are impressive

    https://gfxbench.com/compare.jsp?be...=NVIDIA+GeForce+GTX+1650+Ti+with+Max-Q+Design

    I'd say, even if M1 is "only" comparable to a MX450 in graphics, it's a great success for Apple.

    Based on the diagram in this tech talk (05:10) I would assume that there are 4 32-wide ALUs for every GPU core

    I'm fairly certain they are talking about the FP32 performance
     
  2. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,467
    Likes Received:
    187
    Location:
    Chania
    GFXbench results can be useful indications if one knows how to avoid pitfalls for some of the synthetic tests. Thanks for that one.
     
  3. Leovinus

    Newcomer

    Joined:
    May 31, 2019
    Messages:
    134
    Likes Received:
    62
    Location:
    Sweden
    Lightman, chris1515 and BRiT like this.
  4. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,600
    Likes Received:
    587
    Location:
    Germany
    Leovinus, pharma and BRiT like this.
  5. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,849
    Likes Received:
    6,777
    What does Rosetta translate? Is it only x86 MacOS applications?
    Could the MacOS benchmarks/games be using a higher rate of FP16 shaders than the Windows ones, like the iOS and Android versions?


    Also, the GPU is using the SoC's 16MB SLC just like Navi 21 uses Infinity Cache, right? AMD's slide on LLC hit rates in typical gaming scenarios point to 16MB getting around 25-30% hit rates on 1080p:


    [​IMG]



    What's the theoretical SLC bandwidth on the M1? Are there any measurements made?



    Isn't this a super weird wording to use?
    "FP32 relative to FP16 increased"... sounds like 2*FP16 throughput is just gone from the M1 GPU cores, but the guy just bent over backwards to make this a positive statement (why though? performance seems excellent anyways).
    Maybe a good part of the reason for having 2*FP16 on the GPU was for ML tasks which are now being diverted towards the new neural engine.
     
    milk and BRiT like this.
  6. rikrak

    Newcomer

    Joined:
    Sep 16, 2020
    Messages:
    23
    Likes Received:
    16
    It translates x86-64 code to the ARM64 code so that apps can run on the ARM CPU.

    It doesn't seem like there is much point in degrading shader precision on M1 chips given the fact that FP32 and FP16 have identical performance.

    I tend to interpret it that Apple mobile GPUs have FP16 ALUs but being able to compute FP32 operations at 1/2 the throughput rate. And M1 now "upgrades" those to handle FP32 at the same rate. Kind of similar to what Nvidia did with Ampere (they upgraded the dedicated INT32 unit to a FP32+INT32).
     
  7. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,849
    Likes Received:
    6,777
    FP16 shaders, if handled natively by the ALUs (i.e. not promoted to FP32), occupy a lower footprint on the caches. This means there's lower power consumed on memory transactions and a higher number of operations occurring within the caches which increases effective bandwidth and performance.

    For example, Polaris / GCN4 introduced native FP16 instructions which already claimed some improvements in performance and power efficiency, even though the architecture doesn't do RPM.

    [​IMG]



    For a GPU with low bandwidth to system RAM (~68GB/s) but with access to a large LLC cache like the M1's, using a larger proportion of FP16 (as it is often the case with Android/iOS games and benchmarks) could make a considerable difference.
     
    T2098 likes this.
  8. rikrak

    Newcomer

    Joined:
    Sep 16, 2020
    Messages:
    23
    Likes Received:
    16
    [​IMG]
    I completely agree with you that there are sizable benefits of having fast FP16 operations even on desktop GPUs. This is definitely not something I am debating. I am just pointing out that M1 appears to have identical throughput for both FP32 and FP16 operations, which I hypothesize is due to Apple "upgrading" the FP32 execution, not downgrading the FP16. Again, I think we both agree that one would have preferred to have double-rate FP16 with the current FP32 throughput, but I guess that this is not what Apple was ready to do in this iteration.

    The key to testing this hypothesis is to check FP16 and FP32 throughput of the A14 GPU in comparison to M1.

    To continue your line of thought, there is a difference between data storage and data processing. Using lower-precision data for storage where appropriate is an old and very commonly optimization technique (trivial example of this is using byte-sized integers to encode color channels or using packed FP encodings with shared exponent for HDR content). As GPUs can convert between data representations for "free", all GPUs can benefit from using reduced precision for data storage, irrelevant whether they have special hardware to process reduced precision data faster.
     
  9. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,326
    Likes Received:
    151
    Location:
    On the path to wisdom
    Yes, storing data as FP16 in memory is commonly done even on GPUs that have no FP16 ALUs. Having same-rate FP16 operations doesn't make individual operations faster but it reduces register footprint which can be a very significant - yet hard to quantify - benefit.
     
  10. rikrak

    Newcomer

    Joined:
    Sep 16, 2020
    Messages:
    23
    Likes Received:
    16
    Upgrade on this — new benchmark results are up, including A14 results (iPhone 12):

    https://www.realworldtech.com/forum/?threadid=197759&curpostid=197985

    To summarise: normalised per GPU core, A14 has exactly half the FP32 throughput of the M1, while their PF16 throughput is identical. My conclusion is that Apple indeed managed to improve the FP32 rate rather than dropping the FP16 rate. I also don't think that Apple GPU ever had 2*FP16, just 0.5*FP32...

    P.S. (in case you have not guessed already, yes, I am the guy who wrote these benchmarks)
     
  11. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,326
    Likes Received:
    151
    Location:
    On the path to wisdom
    Could you clarify what you mean by that? How are those two descriptions different, what is the baseline?
     
    milk likes this.
  12. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,600
    Likes Received:
    587
    Location:
    Germany
    Can you run on your M1 system eg. 3DMark Wildlife for iOS? Or GFXBench 5 for iOS vs. for macOS (not m1 native)?

    -----Edit

    https://www.notebookcheck.net/Apple...s-Intel-and-AMD.508057.0.html#toc-performance

    Here are some Wild Life results, all M1 Macs outperform the iPad Air 2020 with the Apple A14.

    Fp16 vs. fp32 and so on: https://forum.beyond3d.com/threads/native-fp16-support-in-gpu-architectures.56180/

    IMG added "faster fp16 than fp32" capabilities with PowerVR Series6XT/Rogue GX (Apple A8 etc), https://www.tomshardware.com/reviews/apple-iphone-6s-6s-plus,4437-7.html
    https://www.anandtech.com/show/7793/imaginations-powervr-rogue-architecture-exposed/3

    Next edit:

    I try to understand your numbers, and it looks for me, that a M1 GPU core is not like an Apple A14 GPU core. I understand your numbers, that an A14 GPU core has 64 fp32 EUs (and 128 fp16 EUs).
     
    #132 Arnold Beckenbauer, Dec 10, 2020
    Last edited: Dec 11, 2020
  13. rikrak

    Newcomer

    Joined:
    Sep 16, 2020
    Messages:
    23
    Likes Received:
    16
    For full disclosure, I have no idea how these things can be implemented in hardware. It was pointed out (https://www.realworldtech.com/forum/?threadid=197759&curpostid=197993) that fusing two FP16 ALUs to perform a FP32 operation or splitting a single FP32 ALU to perform two FP16 operations per clock are functionally equivalent, something I was not aware of.

    What I mean essentially is that Apple GPUs are exposed as having SIMD width of 32 on both the A14 and M1 (the core is understood to have four SIDM units for the total width of 128). The SIMD width is a constant, regardless of whether you are funning FP16 or FP32 operations, and it supports all the regular bells and whistles like SIMD lane masking and various kinds of SIMD broadcasting and reduction operations. Regardless of how it is implemented (many people here will probably have a much better idea than me), the end result is that A14 (mobile chips) can run FP16 at full speed and FP32 at half the speed (whether they achieve it by using two FP16 ALUs or by using the same FP16 ALU over two cycles or something else is just an implementation detail IMO) while the M1 seems to have wider ALUs capable of doing FP32 FMA at full rate — without increasing it's relative FP16 rate. I would further speculate that M1 maybe does not have any special support for FP16 at all and simply does a FP32 computation while widening FP16 inputs and rounding the output back to FP16.

    I think what my results primarily suggest is that Apple does not have separate FP16 and FP32 units like some other implementations, they reuse (parts of) the same ALUs to perform computations with various precision.

    There are a lot of results available for these benchmarks already. Is there something specific that you are looking for? I don't think it's surprising that M1 Mac outperform the iPad, they have twice as many GPU cores...

    If I interpret Rogue diagrams correctly, they seem to suggest that IMG has separate FP32 and FP16 that can execute operations in parallel. My results for Apple GPUs suggest that they are much "simpler". I don't believe they have separate FP16 and FP32 ALUs at all, just some ALUs that can run either FP32, FP16 or integer operations.

    My point is that M1 GPU core is physically wider, as it can do 128 FP32 FMA operations OR 128FP16 operations per clock. A14 core can do 128 FP16 FMA operations per clock OR 64FP32 FMA operations per clock. As I wrote above, I have no idea how this is actually implemented in hardware. Also, looking at die shots, the M1 core GPU is much larger physically relative to an A14 core.
     
    Arnold Beckenbauer likes this.
  14. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,600
    Likes Received:
    587
    Location:
    Germany
    Not necessary, I have edited my posting several times, so I forgot to highlight my edit with the NBC link.
     
  15. Kugai Calo

    Newcomer

    Joined:
    Mar 6, 2020
    Messages:
    170
    Likes Received:
    166
    The microarchitecture of the GPUs in M1 and A14 should be identical, they both belong to
    Code:
    MTLGPUFamilyApple7
    It's interesting to note that Apple just added DXR-like raytracing API to Metal (i.e. not a shader library like MPS), maybe Apple's own raytracing hardware is on the way. You can find the relevant WWDC talk here: https://developer.apple.com/videos/play/wwdc2020/10012/ , it even comes with example code.
     
  16. rikrak

    Newcomer

    Joined:
    Sep 16, 2020
    Messages:
    23
    Likes Received:
    16
    Depends on what you understand under "microarchitecture". They have identical feature set, yes, but there should be little doubt that their ALUs are physically different (different FP32 compute throughput, different size on die).

    What's even more interesting that Metal RT works across all the GPUs, and related features such as shader function pointers, late function binding and recursive function calls can be used independently. I haven't tried out the API yet but it looks quite nice to me, and it is certainly designed with hardware-accelerated RT in mind.
     
  17. mfaisalkemal

    Newcomer

    Joined:
    May 26, 2017
    Messages:
    61
    Likes Received:
    33
    based on techinsight analysis(link), M1 GPU core size is double of A14 GPU.
     
  18. rikrak

    Newcomer

    Joined:
    Sep 16, 2020
    Messages:
    23
    Likes Received:
    16
    Now I am very confused. If the GPU cores are basically identical how has Apple managed to double the FP32 rate on M1 relative to A14? Is this an artificial limitation on A14?
     
  19. Kugai Calo

    Newcomer

    Joined:
    Mar 6, 2020
    Messages:
    170
    Likes Received:
    166
    By microarchitecture I was referring to the design of each "core". And do we know for sure that the M1's single precision throughput per core is better than A14's?

    Indeed! IIRC for D3D the GPU hardware needs to be D3D12_FEATURE_LEVEL_12_2 compliant to support callable shader. And I'm personally a fan of Metal Shading Language as well.
     
  20. Kugai Calo

    Newcomer

    Joined:
    Mar 6, 2020
    Messages:
    170
    Likes Received:
    166
    One then I'm confused: if Apple's (and Imagination's) TBDR GPUs are so awesomely power efficiently, why isn't everyone doing it? What are the inherent limitations of a TBDR GPU?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...