AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

The general layout is similar to Fiji, but I see the instruction and constant caches are 50% up in size for Vega. Was that mentioned in some document or paper?
 
The general layout is similar to Fiji, but I see the instruction and constant caches are 50% up in size for Vega. Was that mentioned in some document or paper?
I thought it was on a HotChips presentation, but I didn't see any mention in that PDF. We knew the instruction/constant caches/buffers increased without specifics. Vega had 45MB total SRAM (mentioned by Mantor), but limited reference from Polaris/Fiji to extrapolate.
 
The general layout is similar to Fiji, but I see the instruction and constant caches are 50% up in size for Vega. Was that mentioned in some document or paper?
Not sure if that was the source, but I did mention this in our RX Vega Q&A - but under another premises: Transistor count.

Since 3 NCUs at max are sharing I/C-cache now instead of four, you can only feed the 16 CUs in each shader engine with six of each I/C-cache - in Fiji four where sufficient. Six vs. four is a 50 % increase, of course not in individual capacity but in cache-count and (most probably) in terms of transistors.
 
Also, there's indeed a lot more logic and SRAM around the HBM interfaces. I guess, since the HBM memory can act as a dynamic cache pool, there's must be a robust buffering and tagging in there.
 
Vega had 45MB total SRAM (mentioned by Mantor), but limited reference from Polaris/Fiji to extrapolate.
I can count up to 30MB of SRAM in Vega for all known pools. I think the rest is in the front-end macro block in between the multiprocessor array. In that region, I can assume max 1 or 2 MB of parameter cache for sure in there, but there's a lot more visible.

hm.... top-right = display, bottom right = PCI-E ?
The bottom-right looks like the SERDES phy used in Zen, so it must be used for the PCIe interfacing. AMD implemented it in Polaris too.
 
Last edited:
hm.... top-right = display, bottom right = PCI-E ?

AMD's presentation from July 2017 labelled the bottom right corner as part of the security/virtualization system.
Given that this seems to include PHY, perhaps the PCIe block was included because of its being linked to SRIOV and hardware management of IO/config. The slide had an L-shaped block an the bottom left next to the HBCC and infinity fabric, which I suppose would be the PSP and perhaps its cryptography hardware.

Top right was listed as being part of the display engine (other part is between the HBM PHY blocks).

I'm wondering from the extra sliver of memory in the CUs in the LDS/scalar unit area means there's a scalar cache per CU?
 
I can count up to 30MB of SRAM in Vega for all known pools. I think the rest is in the front-end macro block in between the multiprocessor array. In that region, I can assume max 1 or 2 MB of parameter cache for sure in there, but there's a lot more visible.
That's roughly the number reached last time we attempted to add everything up. Still a good chunk of SRAM unaccounted. I can't recall where, but there was some anecdotal evidence the parameter cache may be larger than 1-2MB. That could make sense depending on how NGG was implemented in hardware. There would exist extra metadata for culling that might be outside of the normal paths in addition to simply having a larger cache for vertices.

Speculating here, but some sort of built in PIM block might make sense here. I doubt it would be huge however. There was a patent discussed a while back dealing with bin intercepts that could likely benefit the DSBR stage. I don't believe we've seen any documentation on binning metadata either, but keeping it separate from the traditional shared caches would make sense. Some sort of delta compression cache near the memory controller to accelerate DCC may make sense as well.

Might be a MB or two for the video block as well. The 4k60 encode/decode and up to 16 simultaneous users with SRIOV most likely necessitates some extra storage there. Guessing that's 4x4k or 16x1080p that should be well buffered.

ACE/HWS and the GCP may share a larger cache, but again it shouldn't be that large. ACEs in the past were 8 pointers with minimal metadata as I recall. Stepping up those capabilities without spilling to memory might make sense. Even with 30MB accounted, that's a third of the SRAM that's still unknown.

EDIT. Vega also added large page support, so it's possible there is a substantial block of SRAM near the HBM controller to actually contain pages. Linux drivers added 2MB pages and there are likely multiple and larger, possibly less efficient, options. I could see 8MB or more just for pages making sense. That could make up most of the difference and the diagram does show a substantial chunk of SRAM in that location.
 
Last edited:
That mixed Intel/Radeon driver gave me shivers to be honest. One can only wonder how long it will take for Intel to update their driver releases. .
 
I think I located the GDS block next to the command processor, still containing 64KB by counting the SRAM banks. With all the overhaul of the memory subsystem, it's weird AMD keeps dragging this proprietary structure forward.
 
I believe AMD still supplies the drivers, it's just an Intel skin
There needs to be some magic for the shared TDP between the cpu and gpu. Albeit that doesn't rule out a standard AMD driver - presumably there's just some intel driver which monitors and adjusts the power bits of the gpu via some more or less official AMD api...
But you still need to rely on intel for actually supplying you the driver, even if the gpu driver is unchanged from AMD (unless you can hack-install an AMD driver...), that's at least what I could gather from the announcement (this is really a intel product in the end).
 
So Vega M here:
http://playwares.com/pcreview/56285625#

If AMD keeps this ratio of memory/CU's/ROP's of Vega M specs going forward and they can push past what appear to be previous unit count limits things might not look so "bad" for a navi thats still heavily on the traditional GCN path.

Goddamn I want one of those so bad. So lewd with all those ports too :p

Performance is right where I expected for a 3.5 to 4.0 TFLOP (with OC) AMD GPU.
 
Goddamn I want one of those so bad. So lewd with all those ports too :p

Performance is right where I expected for a 3.5 to 4.0 TFLOP (with OC) AMD GPU.

Seems to me to be performing significantly above what "traditionally configured" 3.5 to 4 TFLOP amd cards would perform. Take RX 580 as an example its only somewhere like 15% behind it and has 66% of the FLOPS. More ROPs per ALU and more bandwidth per FLOP seem to result in significantly better performance per flop and per watt.
 
Seems to me to be performing significantly above what "traditionally configured" 3.5 to 4 TFLOP amd cards would perform. Take RX 580 as an example its only somewhere like 15% behind it and has 66% of the FLOPS. More ROPs per ALU and more bandwidth per FLOP seem to result in significantly better performance per flop and per watt.

Makes you wonder if the PS4 Pro, now that we know it has 64 ROPs, isn't actually above Polaris 10 cards in real-world scenarios and how close it is to the Xbone X.
 
Seems to me to be performing significantly above what "traditionally configured" 3.5 to 4 TFLOP amd cards would perform. Take RX 580 as an example its only somewhere like 15% behind it and has 66% of the FLOPS. More ROPs per ALU and more bandwidth per FLOP seem to result in significantly better performance per flop and per watt.
The most important factor isn't performance per flop, but performance per square milimeter. The ROPs are quite big and as a result performance per square milimeter is very close to Polaris 10. Those benchmark results are in the range of Tonga / R9 380X and Polaris 10 / RX 480; quite close to RX 470 at the average. I can't see any change in efficiency with the exception of performance per watt.
 
The most important factor isn't performance per flop, but performance per square milimeter. The ROPs are quite big and as a result performance per square milimeter is very close to Polaris 10. Those benchmark results are in the range of Tonga / R9 380X and Polaris 10 / RX 480; quite close to RX 470 at the average. I can't see any change in efficiency with the exception of performance per watt.
i completely disagree, people dont buy based on performance per mm.
performance per mm is not really a metric that has any meaning when the GPU is manufactured at GF as well. We have no idea how much they pay per wafer compared to TMSC or how far 14LPP is behind TMSC's 16nm.

All perf per mm affects is AMD's margin, given where AMD's R&D budgets have been, trading perf per mm for performance and performance per watt especially when considering the WSA likely helps as well, it seems like a good set of trade offs.
 
Back
Top