Apple (PowerVR) TBDR GPU-architecture speculation thread

Entropy · Nov 5, 2020

Ailuros said:
A few years ago the suggestion circulated that TBDRs suck with DX11 tesselation. One of the reasons was that anything relevant behaved more than bad on SGX GPU IP in the Sony Vita handheld. With the GPU being DX9L3 I wouldn't expect any tesselation to present anything better than seconds per frame on a GPU like that.

Now the discussion here is moving back and forth for Apple's SoC GPUs which are based on a PowerVR Series7 Plus GPU which despite the tesselation unit present or anything else is not more than DX10.x compliant (if memory serves well it might only be up to DX10.0) due to high precision values that they skipped in the ALUs, and here we are discussing whether A or B DX12 feature works well or not on a TBDR. No idea where Alborix A or B lies in capabilities, but I wouldn't be surprised if their baseline starts with DX11.0 compliance and we don't have a single integrated unit yet from A series.

https://forum.beyond3d.com/posts/2168556/

Considering he's been leading the GPU designs at IMG for at least two past decades I suggest he knows what he's talking about.

We are fortunate on these forums that some people with deep knowledge and insight sometimes post here. It makes sense to listen to what they say.

Ailuros · Nov 5, 2020

Entropy said:
We are fortunate on these forums that some people with deep knowledge and insight sometimes post here. It makes sense to listen to what they say.

We shouldn't forget that all mobile ULP SoC GPUs cut corners one way or another in order to be as power efficient as possible. NVIDIA itself was cutting several corners with its initial Tegra GPUs until they addressed the high end automotive market with a Kepler derivative.

rikrak · Nov 5, 2020

Ailuros said:
We shouldn't forget rhat all mobile ULP SoC GPUs cut corners one way or another in order to be as power efficient as possible. NVIDIA itself was cutting severl corners with its initial Tegra GPUs until they addressed the high end automtivw market with Kepler derivative.

Very true, this also makes educated comparisons tricky.

For Apple specifically, lack of hardware double precision (something I find personally reasonable) and the fact that they seem to aggressively lower prevision to FP16 in the shaders.

Ailuros · Nov 5, 2020

rikrak said:
For Apple specifically, lack of hardware double precision (something I find personally reasonable) and the fact that they seem to aggressively lower prevision to FP16 in the shaders.

Apple's GPU ALU don't lack FP32, irrelevant what they optimize for; no idea if FP64 is possible but if it should be very slow (there was optional support for FP64 for Series7XT but I doubt Apple ever opted for it). Apple (as anyone else in their shoes) has it's own way of doing things and they were never really interested in DX11, tessellation or any of the kind. They rather invested for smartphones/tablets the hw overhead needed for DX11 in higher performance.

Reasons why the skipped double precision and/or improved rounding support amongst others up to some stage in this blog post from Kristof here: https://www.imgtec.com/blog/powervr-gpu-the-mobile-architecture-for-compute/

Now that they seem to move with their own GPUs with their Macs they shouldn't IMHO cut corners also there. Higher end designs are obviously completely different animals.

rikrak · Nov 9, 2020

Ailuros said:
Now that they seem to move with their own GPUs with their Macs they shouldn't IMHO cut corners also there. Higher end designs are obviously completely different animals.

I don't expect them to support FP64 any time soon — if ever — and personally, I think that the decision to skip it makes sense. Software emulation of double precision for whoever needs it is much faster than the gimped FP64 hardware units we have on modern consumer hardware, and it makes sense to invest silicon space to where it matters more.

As to all other things, I completely agree. A14 in particular now implements some features of desktop GPUs that Apple was lacking (e.g. SIMD reduction operations and hardware barycentric coordinates) while improving the efficiency of async compute and GPU-driven rendering.

Pressure · Nov 9, 2020

I’m definitely watching their “One more thing” event tomorrow. That should give us some answers.

Leovinus · Nov 11, 2020

Short recap of the important bulletpoints verbatim as they were presented:

Unified memory architecture
- High bandwidth, low latency
- Apple-designed package
- Accessible to entire SoC
CPU
- 4 high-performance cores
  - Ultra-wide execution architecture
  - 192KB instruction cache
  - 128KB data cache
  - Shared 12MB L2 cache
- 4 high-efficiency cores
  - Wide execution architecture
  - 128KB instruction cache
  - 64KB data cache
  - Shared 4MB L2 cache
GPU
- Up 8 Cores
  - 128 execution units
  - Up to 24,576 concurrent threads
  - 2.6 teraflops
  - 82 gigatexels/second
  - 41 gigatexels/second
- Claimed 2x performance at 10W vs. "latest PC laptop chip"
- Claimed 1/3 power draw at the indicated max performance of "latest PC laptop chip"
M1 claims
- High efficiency CPU cores
- High-Performance CPU cores
- Secure Enclave
- Low power video playback
- Neural Engine
- Advanced display engine
- High-performance GPU
- HDR video processor
- HDR imaging
- Gen 4 PCI Express
- High-performance video editing
- Always-on processor
- Performance controller
- Thunderbolt/USB 4 controller
- High quality image signal processor
- Low-power design
- High-performance NVMe storage
- High-efficiency audio processir
- Advanced silicon packaging

Arnold Beckenbauer · Nov 11, 2020

So, it's somekind of Apple's IMG Series A? 128 ECUs and 8* TMUs per core (cluster) and upto 1,3 GHz clock?

* or 4 TMUs mit doubled clock?

Kaotik · Nov 11, 2020

Apparently AnandTech has been able to confirm it's really LPDDR4X memory behind 128-bit bus, not HBM as the finnish Apple pages said

Ailuros · Nov 11, 2020

Arnold Beckenbauer said:
So, it's somekind of Apple's IMG Series A? 128 ECUs and 8* TMUs per core (cluster) and upto 1,3 GHz clock?

* or 4 TMUs mit doubled clock?

I'd like to stand corrected, but TMUs with double the frequency should be (for today's standards at least) suicidal for power consumption. It sounds like Alborix as you say but we don't have enough information yet.

If my layman's backwards math shouldn't contain any serious brainfart, frequency should be somewhere in the =/>1.275GHz region.

P_EQUALS_NP · Nov 11, 2020

Even if apples claimed 2.6 teraflops uses half floats, 1.3 tera flops of single floats @10watts is pretty amazing!

Leovinus · Nov 11, 2020

Possibly a bit off topic, but I feel tangential, to the technical and performance discussion:

Over at eGPU.io there is plenty of discussion regarding wether or not Apple will discontinue the use of AMD GPU's altogether to focus on their own silicon. That being one of the possibilities why AMD drivers haven't yet been seen in the universal binaries of the Big Sur RC for ARM. Of course, a potential Pro 16" and iMac ARM lineup might change this for all we know.

Either way it seems Apple could go the way of Intel and expand and refine this iGPU solutions into very capable products that entirely replace AMD. If developer and customer interest in their offerings take hold (an entirely parallel, non-static, hardware and software ecosystem seems a burden for all involved frankly). Still I think it raises an enormous amount of questions. Will we see Apple branded discrete GPU's for modular systems like Mac Pro? How will different SKU's be handled in other products? Fixed performance SKU's with novel SoCs (and assumedly binned versions for Lowe tiers). Multi-chip modules for bigger SoC's to increase yields? Discrete chips for high end systems? And I'm not even mentioning the developer aspects. My brain feels about as blocked as when you think about the endlessness of space when trying to fathom how Apple will move forward here.

^M^ · Nov 11, 2020

Apple speaks of a unified memory architecture between CPU and iGPU.
Is it new for them? I remembered AMD trying something like this a while back, did it go somewhere?

P_EQUALS_NP · Nov 11, 2020

^M^ said:
Apple speaks of a unified memory architecture between CPU and iGPU.
Is it new for them? I remembered AMD trying something like this a while back, did it go somewhere?

its new for them, and interestingly enough this may be the first implementations of unified memory in consumer space, since Intel and Amd apu's segment their gpu//cpu memory with all the assess restrictions imposed by windows api that *mostly* forbid sharing data between cpu and gpu.

pTmdfx · Nov 11, 2020

P_EQUALS_NP said:
its new for them, and interestingly enough this may be the first implementations of unified memory in consumer space, since Intel and Amd apu's segment their gpu//cpu memory with all the assess restrictions imposed by windows api that *mostly* forbid sharing data between cpu and gpu.

It isn’t new to Apple, only new to the Mac.

Metal has long documented that iOS and tvOS operate in the unified memory model (in Metal’s definition), while Mac (aka Intel) IGPs are made to present a discrete memory model.

https://developer.apple.com/library.../doc/uid/TP40016642-CH17-DontLinkElementID_19

mfaisalkemal · Nov 11, 2020

P_EQUALS_NP said:
Even if apples claimed 2.6 teraflops uses half floats, 1.3 tera flops of single floats @10watts is pretty amazing!

no, that's for peak performance of M1 @14.3W. Macbook Air @10W around 1.13TF FP32. Macbook Air with 70% FP16 and 30% FP32 shader code combination will achieve nearly original PS4 shader performance(1.742TF vs 1.843TF).

Arnold Beckenbauer · Nov 11, 2020

P_EQUALS_NP said:
Even if apples claimed 2.6 teraflops uses half floats, 1.3 tera flops of single floats @10watts is pretty amazing!

How can you then explain 82 Gtexels/s?

The M1 is not a mobile SoC like A14, so FP16 performance is less relevant compared to FP32.

Nebuchadnezzar · Nov 11, 2020

P_EQUALS_NP said:
Even if apples claimed 2.6 teraflops uses half floats, 1.3 tera flops of single floats @10watts is pretty amazing!

We've had 1.2TFLOP mobile SoCs within 4-5W for well over a year now; https://www.anandtech.com/show/1517...-and-765-5g-for-all-in-2020-all-the-details/2

The M1's 2.6 figure is undoubtedly FP32.

Leovinus · Nov 12, 2020

Apple has now posted more in-depth information on what they compared performance with, scroll to the bottom of the page for footnotes. Though I'm not entirely sure it's detailed enough to give more than a slightly less murky view of things.

The ones I interpret as making mention of the GPU are the following:

Testing conducted by Apple in October 2020 using preproduction MacBook Air systems with Apple M1 chip and 8-core GPU, as well as production 1.2GHz quad-core Intel Core i7-based MacBook Air systems, all configured with 16GB RAM and 2TB SSD. Tested with prerelease Final Cut Pro 10.5 using a 55-second clip with 4K Apple ProRes RAW media, at 4096x2160 resolution and 59.94 frames per second, transcoded to Apple ProRes 422. Performance tests are conducted using specific computer systems and reflect the approximate performance of MacBook Air.
Testing conducted by Apple in October 2020 using preproduction 13‑inch MacBook Pro systems with Apple M1 chip and 16GB of RAM using select industry-standard benchmarks. Comparison made against the highest-performing integrated GPUs for notebooks and desktops commercially available at the time of testing. Integrated GPU is defined as a GPU located on a monolithic silicon die along with a CPU and memory controller, behind a unified memory subsystem. Performance tests are conducted using specific computer systems and reflect the approximate performance of MacBook Pro.
Testing conducted by Apple in October 2020 using preproduction Mac mini systems with Apple M1 chip, and production 3.6GHz quad-core Intel Core i3-based Mac mini systems with Intel Iris UHD Graphics 630, all configured with 16GB of RAM and 2TB SSD. Tested with prerelease Final Cut Pro 10.5 using a complex 2-minute project with a variety of media up to 4K resolution. Performance tests are conducted using specific computer systems and reflect the approximate performance of Mac mini.

Arnold Beckenbauer · Nov 12, 2020

Arnold Beckenbauer said:
So, it's somekind of Apple's IMG Series A? 128 ECUs and 8* TMUs per core (cluster) and upto 1,3 GHz clock?

* or 4 TMUs mit doubled clock?

who can read has a clear advantage
https://www.anandtech.com/show/15156/imagination-announces-a-series-gpu-architecture/3

Each SPU houses two USCs in the current IP configuration, meaning we have two clusters of 128-wide ALUs. This is valid for all AXT parts, but we imagine the AXM-8-256 unit just has a single USC. The AXT-16-512 is the smallest configuration with a fully populated SPU.
Each SPU has its own geometry pipeline, and up to two texture processing units. The A-Series carries over the per-TPU throughput design from the Furian architecture, meaning the block is able to sample 8 bilinear filtered texels per clock. The A-Series doubles this up now per SPU and the AXT models feature two TPUs, bringing up the total texture fillrate to 16 samples per clock per SPU.

One M1's "GPU-core" should be 8-256-4, 8 texels per clock, 256 FP-FLOP per clock and 4 pixels per clock. It's something like 64-2048-32. Does it make sense?

Nebuchadnezzar said:
We've had 1.2TFLOP mobile SoCs within 4-5W for well over a year now; https://www.anandtech.com/show/1517...-and-765-5g-for-all-in-2020-all-the-details/2

The M1's 2.6 figure is undoubtedly FP32.

plus:
https://www.anandtech.com/show/13661/the-2018-apple-ipad-pro-11-inch-review/6

And with the larger surface area of the iPad compared to the phone, likely a higher frequency as well. There’s now seven of the A12 GPU cores, compared to just four on the iPhone, and Apple claims the GPU in the iPad Pro is equivalent to an Xbox One S, although how they came to thise conclusion is difficult to say since we know so little about the underpinnings of the GPU.

In rough terms, the Xbox One S is roughly 1.4 TFLOPS at its peak. But for better or worse, when the PC moved to unified shaders, the industry moved to FP32 for all GPU functions. This is as oppposed to the mobile world, where power is an absolute factor for everything, Vertex shaders are typically 32bpc while Pixel and Compute shaders can often be 16bpc. We’ve seen some movement on the PC side to use half-precision GPUs for compute, but for gaming, that’s not currently the case.

Apple (PowerVR) TBDR GPU-architecture speculation thread

Entropy

Ailuros

Epsilon plus three

rikrak

Ailuros

Epsilon plus three

rikrak

Pressure

Leovinus

Arnold Beckenbauer

Kaotik

Drunk Member

Ailuros

Epsilon plus three

P_EQUALS_NP

Leovinus

^M^

P_EQUALS_NP

pTmdfx

mfaisalkemal

Arnold Beckenbauer

Nebuchadnezzar

Leovinus

Arnold Beckenbauer

Similar threads