Anybody understand this:
[PATCH] drm/amdgpu: correct the cu and rb info for sienna cichlid
https://lists.freedesktop.org/archives/amd-gfx/2020-October/055069.html
The code has existing functions for registering which blocks like the RBEs are active or have been disabled by fuses or BIOS settings.
It seems new to this is the disabling of shader arrays. Before this, there seemed to have been facilities for disabling at the shader engine or individual RBE level, so Sienna Cichlid would appear to be adding an intermediate level of salvage.
I was able to search for a mention of new function (gfx_v10_3_get_disabled_sa) that showed up:
https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg53790.html
[PATCH] drm/amdgpu: add function to program pbb mode for sienna cichlid
Add function for sienna_cichlid to force PBB workload mode to zero by
checking whether there have SE been harvested.
What PBB mode does exactly, I'm not sure of. It does seem that there is at least some distinction between a fully-enabled GPU and one with one or more deactivated shader arrays. Perhaps this means load-balancing is handled differently due to a shift in the ROP versus rasterizer capacity, or the algorithm for allocating screen space is altered if the RBEs and rasterizers remained linked at a shader array level.
Seems to imply that Navi 21 is the only GPU where ROPs and shader arrays are disabled.
I wonder if ROPs are bound to shader arrays.
For RDNA, the RBEs were clients to the L1, which is per shader array for Navi 10.
Having the option to disable at a shader array level may be a change due to how redundant many of the resources are. There are many CUs in an array, and the code seems to have a separate pre-existing bitmask for handling disabled RBEs.
This may indicate 8 shader arrays is enough to warrant the trouble versus a similar amount of rasterizer and RBE hardware per array in Navi10, or that additional less-redundant hardware is at that level.
Another possibility is that the function gfx_v10_3_program_pbb_mode from my earlier link actually goes through quite a bit of setup just to check whether a shader engine's shader arrays are active. Perhaps it's meant for future scalability or consistency in the code, but building a bitmask based on system parameters when the traditional configuration is 1 or 2 per SE may mean a larger number could be possible for this family.
it is a bit odd that all the leaks are for 3dmark. Maybe AMD did like Nvidia and whitelisted only certain apps for AIB drivers.
3dmark leaks seem to happen with regularity for AMD, including multiple console chips. It happens enough that I'd have to suspect it's on purpose or policies are such that AMD doesn't stop it from happening. I'd imagine the benchmark is a readily available non-trivial 3D application for early testing and validation, and one that the vendor has put more effort into optimizing for or dedicating functions to in the driver. This might make it more likely that there's programming and debugging resources available, and possibly special frameworks or driver paths explicitly for getting early testing functional on it.
That it happens to upload results to the world at this point would be well-understood, and might be part of a controlled leak for marketing or maybe giving certain interested parties an idea of how to plan their market segmentation.
Anyone taking bets on Infinity Cache being gigabytes of 3D stacked memory? It goes without saying that would be pretty amazeballs.
https://www.freepatentsonline.com/y2020/0183848.html - Cache for Storing Regions of Data
3D on the SOC seems unlikely given the TDP numbers bandied about.
On-package would be possible, if AMD committed to an interposer solution. That might not scale down as well for the smaller SKUs that allegedly have something like this.
As a counterargument, I'm not sure as much die area would be needed for this compared to what rumors indicate the extra area is. The concerns for cache coherence and latency aren't traditionally GPU-related.
The rumors also didn't clearly cite this sort of MCM package or links, and while HBM2 has bandwidth the amount it provides versus GDDR6 is similar enough that I'd be interested if it would be worth the complexity balancing between the types.