Apple (PowerVR) TBDR GPU-architecture speculation thread

Discussion in 'Architecture and Products' started by Kaotik, Jul 7, 2020.

Tags:
  1. Entropy

    Entropy Veteran

    We are fortunate on these forums that some people with deep knowledge and insight sometimes post here. It makes sense to listen to what they say.
     
  2. Ailuros

    Ailuros Epsilon plus three Legend Subscriber

    We shouldn't forget that all mobile ULP SoC GPUs cut corners one way or another in order to be as power efficient as possible. NVIDIA itself was cutting several corners with its initial Tegra GPUs until they addressed the high end automotive market with a Kepler derivative.
     
    Last edited: Nov 5, 2020
  3. rikrak

    rikrak Newcomer

    Very true, this also makes educated comparisons tricky.

    For Apple specifically, lack of hardware double precision (something I find personally reasonable) and the fact that they seem to aggressively lower prevision to FP16 in the shaders.
     
  4. Ailuros

    Ailuros Epsilon plus three Legend Subscriber

    Apple's GPU ALU don't lack FP32, irrelevant what they optimize for; no idea if FP64 is possible but if it should be very slow (there was optional support for FP64 for Series7XT but I doubt Apple ever opted for it). Apple (as anyone else in their shoes) has it's own way of doing things and they were never really interested in DX11, tessellation or any of the kind. They rather invested for smartphones/tablets the hw overhead needed for DX11 in higher performance.

    Reasons why the skipped double precision and/or improved rounding support amongst others up to some stage in this blog post from Kristof here: https://www.imgtec.com/blog/powervr-gpu-the-mobile-architecture-for-compute/

    Now that they seem to move with their own GPUs with their Macs they shouldn't IMHO cut corners also there. Higher end designs are obviously completely different animals.
     
    Last edited: Nov 6, 2020
  5. rikrak

    rikrak Newcomer

    I don't expect them to support FP64 any time soon — if ever — and personally, I think that the decision to skip it makes sense. Software emulation of double precision for whoever needs it is much faster than the gimped FP64 hardware units we have on modern consumer hardware, and it makes sense to invest silicon space to where it matters more.

    As to all other things, I completely agree. A14 in particular now implements some features of desktop GPUs that Apple was lacking (e.g. SIMD reduction operations and hardware barycentric coordinates) while improving the efficiency of async compute and GPU-driven rendering.
     
  6. Pressure

    Pressure Veteran

    I’m definitely watching their “One more thing” event tomorrow. That should give us some answers.
     
  7. Leovinus

    Leovinus Newcomer

    Short recap of the important bulletpoints verbatim as they were presented:[​IMG]
    • Unified memory architecture
      • High bandwidth, low latency
      • Apple-designed package
      • Accessible to entire SoC
    • CPU
      • 4 high-performance cores
        • Ultra-wide execution architecture
        • 192KB instruction cache
        • 128KB data cache
        • Shared 12MB L2 cache
      • 4 high-efficiency cores
        • Wide execution architecture
        • 128KB instruction cache
        • 64KB data cache
        • Shared 4MB L2 cache
    • GPU
      • Up 8 Cores
        • 128 execution units
        • Up to 24,576 concurrent threads
        • 2.6 teraflops
        • 82 gigatexels/second
        • 41 gigatexels/second
      • Claimed 2x performance at 10W vs. "latest PC laptop chip"
      • Claimed 1/3 power draw at the indicated max performance of "latest PC laptop chip"
    • M1 claims
      • High efficiency CPU cores
      • High-Performance CPU cores
      • Secure Enclave
      • Low power video playback
      • Neural Engine
      • Advanced display engine
      • High-performance GPU
      • HDR video processor
      • HDR imaging
      • Gen 4 PCI Express
      • High-performance video editing
      • Always-on processor
      • Performance controller
      • Thunderbolt/USB 4 controller
      • High quality image signal processor
      • Low-power design
      • High-performance NVMe storage
      • High-efficiency audio processir
      • Advanced silicon packaging
     
    Pete, Lightman and BRiT like this.
  8. Arnold Beckenbauer

    Arnold Beckenbauer Veteran Subscriber

    So, it's somekind of Apple's IMG Series A? 128 ECUs and 8* TMUs per core (cluster) and upto 1,3 GHz clock?

    * or 4 TMUs mit doubled clock?
     
  9. Kaotik

    Kaotik Drunk Member Legend

    Apparently AnandTech has been able to confirm it's really LPDDR4X memory behind 128-bit bus, not HBM as the finnish Apple pages said
     
    BRiT likes this.
  10. Ailuros

    Ailuros Epsilon plus three Legend Subscriber

    I'd like to stand corrected, but TMUs with double the frequency should be (for today's standards at least) suicidal for power consumption. It sounds like Alborix as you say but we don't have enough information yet.

    If my layman's backwards math shouldn't contain any serious brainfart, frequency should be somewhere in the =/>1.275GHz region.
     
    Last edited: Nov 11, 2020
  11. P_EQUALS_NP

    P_EQUALS_NP Newcomer

    Even if apples claimed 2.6 teraflops uses half floats, 1.3 tera flops of single floats @10watts is pretty amazing!
     
  12. Leovinus

    Leovinus Newcomer

    Possibly a bit off topic, but I feel tangential, to the technical and performance discussion:

    Over at eGPU.io there is plenty of discussion regarding wether or not Apple will discontinue the use of AMD GPU's altogether to focus on their own silicon. That being one of the possibilities why AMD drivers haven't yet been seen in the universal binaries of the Big Sur RC for ARM. Of course, a potential Pro 16" and iMac ARM lineup might change this for all we know.

    Either way it seems Apple could go the way of Intel and expand and refine this iGPU solutions into very capable products that entirely replace AMD. If developer and customer interest in their offerings take hold (an entirely parallel, non-static, hardware and software ecosystem seems a burden for all involved frankly). Still I think it raises an enormous amount of questions. Will we see Apple branded discrete GPU's for modular systems like Mac Pro? How will different SKU's be handled in other products? Fixed performance SKU's with novel SoCs (and assumedly binned versions for Lowe tiers). Multi-chip modules for bigger SoC's to increase yields? Discrete chips for high end systems? And I'm not even mentioning the developer aspects. My brain feels about as blocked as when you think about the endlessness of space when trying to fathom how Apple will move forward here.
     
  13. ^M^

    ^M^ Newcomer

    Apple speaks of a unified memory architecture between CPU and iGPU.
    Is it new for them? I remembered AMD trying something like this a while back, did it go somewhere?
     
  14. P_EQUALS_NP

    P_EQUALS_NP Newcomer

    its new for them, and interestingly enough this may be the first implementations of unified memory in consumer space, since Intel and Amd apu's segment their gpu//cpu memory with all the assess restrictions imposed by windows api that *mostly* forbid sharing data between cpu and gpu.
     
  15. pTmdfx

    pTmdfx Regular

    It isn’t new to Apple, only new to the Mac.

    Metal has long documented that iOS and tvOS operate in the unified memory model (in Metal’s definition), while Mac (aka Intel) IGPs are made to present a discrete memory model.

    https://developer.apple.com/library.../doc/uid/TP40016642-CH17-DontLinkElementID_19
     
    Last edited: Nov 11, 2020
    BRiT likes this.
  16. mfaisalkemal

    mfaisalkemal Newcomer

    [​IMG]
    no, that's for peak performance of M1 @14.3W. Macbook Air @10W around 1.13TF FP32. Macbook Air with 70% FP16 and 30% FP32 shader code combination will achieve nearly original PS4 shader performance(1.742TF vs 1.843TF).
     
    Lightman and BRiT like this.
  17. Arnold Beckenbauer

    Arnold Beckenbauer Veteran Subscriber

    How can you then explain 82 Gtexels/s?

    The M1 is not a mobile SoC like A14, so FP16 performance is less relevant compared to FP32.
     
  18. Pete, Lightman, Entropy and 2 others like this.
  19. Leovinus

    Leovinus Newcomer

    Apple has now posted more in-depth information on what they compared performance with, scroll to the bottom of the page for footnotes. Though I'm not entirely sure it's detailed enough to give more than a slightly less murky view of things.

    The ones I interpret as making mention of the GPU are the following:
    • Testing conducted by Apple in October 2020 using preproduction MacBook Air systems with Apple M1 chip and 8-core GPU, as well as production 1.2GHz quad-core Intel Core i7-based MacBook Air systems, all configured with 16GB RAM and 2TB SSD. Tested with prerelease Final Cut Pro 10.5 using a 55-second clip with 4K Apple ProRes RAW media, at 4096x2160 resolution and 59.94 frames per second, transcoded to Apple ProRes 422. Performance tests are conducted using specific computer systems and reflect the approximate performance of MacBook Air.
    • Testing conducted by Apple in October 2020 using preproduction 13‑inch MacBook Pro systems with Apple M1 chip and 16GB of RAM using select industry-standard benchmarks. Comparison made against the highest-performing integrated GPUs for notebooks and desktops commercially available at the time of testing. Integrated GPU is defined as a GPU located on a monolithic silicon die along with a CPU and memory controller, behind a unified memory subsystem. Performance tests are conducted using specific computer systems and reflect the approximate performance of MacBook Pro.
    • Testing conducted by Apple in October 2020 using preproduction Mac mini systems with Apple M1 chip, and production 3.6GHz quad-core Intel Core i3-based Mac mini systems with Intel Iris UHD Graphics 630, all configured with 16GB of RAM and 2TB SSD. Tested with prerelease Final Cut Pro 10.5 using a complex 2-minute project with a variety of media up to 4K resolution. Performance tests are conducted using specific computer systems and reflect the approximate performance of Mac mini.
     
    Lightman likes this.
  20. Arnold Beckenbauer

    Arnold Beckenbauer Veteran Subscriber

    who can read has a clear advantage
    https://www.anandtech.com/show/15156/imagination-announces-a-series-gpu-architecture/3

    One M1's "GPU-core" should be 8-256-4, 8 texels per clock, 256 FP-FLOP per clock and 4 pixels per clock. It's something like 64-2048-32. Does it make sense?

    plus:
    https://www.anandtech.com/show/13661/the-2018-apple-ipad-pro-11-inch-review/6

     
    Pete likes this.
Loading...

Share This Page

Loading...