Apple Dynamic Caching on M3 GPU

Scott_Arm

Legend



Sounds like the intent is to make large shaders more efficient. Very interesting feature. Just watching now.

My understanding before fully watching this is that if you write an "uber shader" the gpu will allocate registers, resources for the worst case of all the branches in the shader, so you can easily run into register pressure etc which tanks gpu utilization. This feature should allow for efficient (more efficient) execution of uber shaders, to minimize the number of total shaders and improve utilization.

I feel like it's not a coincidence that this feature came at the same time as ray tracing and even mesh shading.

Edit: Okay, so the register file and the threadgroup/tile memory are now caches. Lol, a few more seconds and it looks like one large cache(per gpu core) to serve the register file, threadgroup/tile memory and the buffer/stack cache.

1699641438408.png

So if you spillover from the cache data will go to last level cache, but the simd scheduler will adjust occupancy to make sure that your running threads fit back into the on-chip cache. Really cool. Basically the cache is not fixed segmentation between the register cache, the buffer cache or the tile cache, so any one of those can take up as much of the cache as needed, and then the scheduler will make sure the cache is not spilling over to a higher cache or main memory.

Wondering how long it'll be until we see nvidia, amd, intel copy this design. I think for wide gpus, and ray tracing performance it could be a big win.
 
Last edited:
I see that Apple's latest GPU architecture are now capable of doing dynamic register allocation in hardware rather than static register allocation by the compiler as is the case on other GPU architecture but they still recommend programmers to compile specialized shader variants in one of their most recent technical disclosures!

 
It's soooo cool. It's a shame that Apple makes products for uh, Apple, otherwise CPU and GPU wise Nvidia, AMD, Intel, etc. would all be doing pretty poorly in the consumer space against such competition.

I do wonder/hope someone will micro-benchmark the new GPU arch, it'd be interesting to see if you really could skip specialized shader variants/what impact this really does have on raytracing/etc.
 
Awesome. This is a direction we need to head in for a multitude of reasons.. but even if they all released products supporting this capability tomorrow, it would still be a while before the market is there.

It's got to happen at some point though, and I'm glad Apple seems to have kicked it off!
 
Last edited:
I've stumbled upon some very interesting exchanges that involved what a Microsoft representative had to say about the feature ...

Is it true that hlsl has no calling convention, and everything always gets inlined?
d3d assumes GPUs can't do native function calls for everything but raytracing
even for raytracing that's only done by the specific functions for that purpose, rather than CPU-style function calls
FWIW, we're designing and building a proper calling convention for HLSL so that we can support true separate compilation. We'll likely require that everything is still inlined at runtime because the performance would be too awful otherwise, but maybe someday GPUs will have a good way to handle calls...

The fundamental problem is that SIMT/SPMD programming inherently makes calling overhead pretty excessive unless you design limited-purpose calls (i.e. calls with capped numbers of usable registers)
Apple has dynamic register allocation, so let's hope it would be possible someday on other gpus too
Still has other issues if the branches aren't uniform though
Apple’s dynamic register allocation doesn’t really help with this problem. There are limitations to what that can solve, and generally it doesn’t allow you mid-shader to increase the number of available registers.
It’s probably better to think of Apple’s dynamic caching feature as dynamic deallocation rather than dynamic allocation.
Hmm, I thought it helped if you need to for example evaluate two completely unrelated shaders in one. For example a dielectric and a conductor material where one might use way more registers than the other. Ofc it'd still have the other drawbacks like not being able to sort the threads to ensure there's no gaps or duplicate work
So at the end of the day I guess it wouldn't matter much. You still need to do some smart sorting if you want this to work correctly
Well not correctly, but optimally
It does in that you allocate the worst-case registers up front and can deallocate the unnecessary ones once execution takes a path that uses less resources.
 
Back
Top