AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

What bottlenecks does it address though? Memory bandwidth, memory capacity, and FP64 throughput seem to be the only changes relative to Vega 64. Some routing changes to critical paths would be the biggest potential gain relative to previous Vegas. Voltage could be artificially high to overcome a single bad path. Increasing power consumption in areas that don't really need it. The CUs for example probably don't need high clockspeeds to maintain throughput, but get pulled higher to keep geometry throughput high.
Clocks though are currently what give the difference between Fiji and Vega when it comes to compute with the professional models - putting aside the context regarding mixed-precision and FP64 difference.
Without the 50% peak clock increase they would have very similar theoretical FP32 TFLOPs spec from AMD, I appreciate though it is more complex than this in real world HPC (such as memory) but clocks are pretty fundamental in this case still.
 
What bottlenecks does it address though? Memory bandwidth, memory capacity, and FP64 throughput seem to be the only changes relative to Vega 64. Some routing changes to critical paths would be the biggest potential gain relative to previous Vegas. Voltage could be artificially high to overcome a single bad path. Increasing power consumption in areas that don't really need it. The CUs for example probably don't need high clockspeeds to maintain throughput, but get pulled higher to keep geometry throughput high.
Are they though? Do we actually know anything else about Vega 20 than 4096bit HBM2 memory and 7nm manufacturing process?
 
I hope AMD is making a Tech Demo and show the performance of Vega when all features are enabled.

I dont know if this will tell much to the public ( or us ), as it will certainly only compare with vega. ( meaning comparaison of performance will be really hard to judge on other hardware )
 
Clocks though are currently what give the difference between Fiji and Vega when it comes to compute with the professional models - putting aside the context regarding mixed-precision and FP64 difference.
Without the 50% peak clock increase they would have very similar theoretical FP32 TFLOPs spec from AMD, I appreciate though it is more complex than this in real world HPC (such as memory) but clocks are pretty fundamental in this case still.
Simply ramping clocks from an improved process I don't consider an architectural improvement. Routing adjustments perhaps, but they could just as easily apply to 14nm. For compute and professional, HBCC and RPM would be the big gains. For graphics DSBR, DX12.1 features, primitive shaders, etc would make a difference. Initial benchmarks did have Vega as an overclocked Fiji, but once the new features are used there is an added benefit.

Are they though? Do we actually know anything else about Vega 20 than 4096bit HBM2 memory and 7nm manufacturing process?
I'm not sure we can tell yet, but a simple change to 7nm alone shouldn't change the architecture. Unless an existing feature is somehow fixed or quad packed math is a thing, there doesn't seem to be much added. Even bandwidth with 4 stacks may be in proportion to clock gains from 7nm. Yes that's ignoring any efficiency hit from FP64 that was added. If there are more features coming with Vega 20, nobody has heard anything. Even the old roadmap leaks didn't show anything. I'd think there would have been some rumors or hints at what is coming. Changing cache sizes is all that really comes to mind.
 
I dont know if this will tell much to the public ( or us ), as it will certainly only compare with vega. ( meaning comparaison of performance will be really hard to judge on other hardware )
Or they should implement it in one game. They invested so many time in wattman and link. They should more invest in Ngg fastpath.

People are buying GPU because they look good in benchmark not because the driver GUI is nice.
 
There is only one reason. The hardware is broken.

The patents are out so AMD should have no fear to talk about it and also if it's a software issue AMD could use the open source community to help them get the driver on.
 
Simply ramping clocks from an improved process I don't consider an architectural improvement. Routing adjustments perhaps, but they could just as easily apply to 14nm. For compute and professional, HBCC and RPM would be the big gains. For graphics DSBR, DX12.1 features, primitive shaders, etc would make a difference. Initial benchmarks did have Vega as an overclocked Fiji, but once the new features are used there is an added benefit.
I agree but that is different to the original reasons discussed which come back to compute and Vega 20, is 64CU important and efficiency of GCN where both are fine with compute-HPC type applications or even compute cryptomining if looking at consumers; clocks are part of that discussion.
The points you raise are definitely valid and I agree but it is mixing segments of HPC with others now, and Vega 20 is primarily a compute-HPC card, those functions-features you mention fit more readily with other segments (see below).

The context is compute though and theoretical TFLOPs or even TFLOPs spec; you need core clocks and cores-CU, like I said real world though requires other aspects such as memory already touched upon.
RPM does not help with scientific FP32 or FP64 but yeah nice for mixed-precision, HBCC (beyond the Unified Memory between HBM and system) still needs to be seen operating with a real world HPC application on scaled up/out nodes until then is what one expects in a modern accelerator with Unified Memory.

For future generations in the HPC-compute segment, AMD will probably have to implement one/some of those patents you and others have found in the past that focus either on scaling CU or changes at Vector Unit-ALU level.
And fair to say Nvidia can scale their own current architecture so far (although it is scaling surprisingly well in some ways) before potential contention-bottlenecks (Arun's tool looks like it may had shown some of this with V100 when looking purely at Geometry engine ratio but more of an issue for Geforce rather than Quadro-Tesla segment).
 
There are two Linux drives outside. One is from AMD and one is from the community.

The reward is of course for the community ;)
 
Interesting, but more or less a fat PCIe connection. IMHO it won't really change the architecture, but accelerate certain tasks with a lot of traffic going off chip. The exception would be if it was enabling some sort of MCM interconnection like Epyc.

There is only one reason. The hardware is broken.

The patents are out so AMD should have no fear to talk about it and also if it's a software issue AMD could use the open source community to help them get the driver on.
Not necessarily broken, but just not worth the effort given a more versatile solution. With compute based culling the implicit paths may perform worse. The explicit paths probably better, but that requires Vega specific programming that in the current mining environment I'm not sure makes sense. No devs, or even AMD for that matter, would see a point in CURRENTLY developing it for the existing Vega marketshare. That's in addition to or replacing the compute based culling implementation that should run on most if not all current hardware.

RPM does not help with scientific FP32 or FP64 but yeah nice for mixed-precision, HBCC (beyond the Unified Memory between HBM and system) still needs to be seen operating with a real world HPC application on scaled up/out nodes until then is what one expects in a modern accelerator with Unified Memory.
RPM and Tensor Cores are somewhat analogous, so more than just mixed precision. HBCC is probably working on some HPC workloads. Specifically the oil and gas or astronomy guys with extremely parallel workloads on large datasets. The oil guys were the ones requesting the feature in the first place as I recall. They were also likely using a SAN for storage and we haven't seen any relevant cards publicly.

For future generations in the HPC-compute segment, AMD will probably have to implement one/some of those patents you and others have found in the past that focus either on scaling CU or changes at Vector Unit-ALU level.
I'd imagine there will be some changes to the CUs in the future. Personally I still favor cascaded DSPs to avoid heavy VGPR traffic and essentially create deeper pipelines if practical. Creating a systolic network for macro blocks. That design should be a step beyond what Nvidia is doing with the register file caching.

That being said, I'm not sure AMD needs to worry about scaling past 64 CUs as process tech isn't getting much smaller and their goal was multiple chips working together. The 64 CU limit is a nice square number from a parallel hardware perspective. The solution would be to make extremely efficient, mobile-like cores with an infinitely fast interconnect between them. Some of the upcoming differential or modulated signaling implementations may allow that. It's always possible Vega 12 has xGMI and is doing something similar. The current roadmap is mobile or 7nm compute parts so it would seem plausible. Leave Navi for scaling beyond two(?) chips efficiently.

There are two Linux drives outside. One is from AMD and one is from the community.
AMD actively maintains both, although the community one does have additional devs working on it. The problem lies in code that can't be freely distributed or readily meet kernel standards.
 
You use NGG fastpath to reduce load on Shader and then you use culling over Shader. This make no sense.

I think ngg fastpath works fine, you see also the example in the white paper which was a real world result it's only hard to use it in games, because every game has a own LOD system.
 
You use NGG fastpath to reduce load on Shader and then you use culling over Shader. This make no sense.

I think ngg fastpath works fine, you see also the example in the white paper which was a real world result it's only hard to use it in games, because every game has a own LOD system.

An example from the labs in a whitepaper isn't really "real world result" IMO
 
With real world I mean there was a Programme and a driver where you can test it on the real hardware. It was not a calculation on the paper.
 
RPM and Tensor Cores are somewhat analogous, so more than just mixed precision. HBCC is probably working on some HPC workloads. Specifically the oil and gas or astronomy guys with extremely parallel workloads on large datasets. The oil guys were the ones requesting the feature in the first place as I recall. They were also likely using a SAN for storage and we haven't seen any relevant cards publicly.
.
AMD has shown HBCC with all its functions as a working demo-concept (not the right word as it is more than that but not sure what to call it) with a GPU rather than scaled up/out use accelerator node, so far in this segment it is a traditional modern Unified Memory accelerator solution and that is good but best to temper expectations as one does when AMD showed all features/functions Vega had pre launch; HBCC storage-cache would be incredibly complex (with overheads) in a scaled up/out HPC solution; look at (not just the interconnect but where positioned/control-flow/etc) new HPC cache-storage solutions with products based around CCIX or IBM next gen to CAPI.
RPM is mixed precision, where are you getting the idea it can do accelerated tensor-GEMM mathematics beyond 2xFP16?
If it could AMD would had shown this in a demo with current Vega, not sure it would align with RPM as it ideally needs a specific architecture-functions/instructions to co-exist with current architecture design.

Edit:
I went back to the HotChips presentation, and they say "Optimized GEMM for Deep learning".
 
Last edited:
You use NGG fastpath to reduce load on Shader and then you use culling over Shader. This make no sense.
NGG wouldn't necessarily reduce the load on the shaders, but the 4 tri/clock bottleneck in the front end by removing culled triangles. That can occur with primitive shaders or a compute shader culling and compacting the stream. Technically they could be chained together, but that would be a bit redundant. Ignoring both shader types running on the shader array for the sake of argument, neither PS or CS would reduce the shader load unless presented with some interesting culling capability by the developer. Some culling operation being hinted the default path couldn't pick up on or implement due to guarantees. Typically GCN would have difficulty using the shader array from an inability to feed geometry through the front-end. Async works around this bottleneck and in this case the culling shaders are scheduled during the previous frame.

HBCC storage-cache would be incredibly complex (with overheads) in a scaled up/out HPC solution; look at (not just the interconnect but where positioned/control-flow/etc) new HPC cache-storage solutions with products based around CCIX or IBM next gen to CAPI.
Complexity depends on how they are using it. For certain HPC tasks, large read-only datasets, most of that overhead would be nonexistent. That should be the case for the oil and gas guys and the implementation somewhat proprietary. Large scale rendering or raytracing could be similar. Multiple GPUs each working a subset of the screen space and HBCC caching pages that get hit. As each GPU would be completely independent the control flow issues go away. CCIX and CAPI are interesting for a certain segment of problems, but shouldn't be necessarily for a SSG type problem with a SAN. Once the GPUs have to start synchronizing it can get more difficult, but automated paging makes that far easier. No different than CPU programming where data pages automatically. That would be familiar to many researches with limited programming ability. It becomes a question of efficiency and any gaps are filled with other work thanks to async compute if practical. So long as all the jobs don't generate stalls simultaneously the chips should stay near peak performance. If that is occurring the implementation will be problematic on any hardware.
 
Back
Top