AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Status
Not open for further replies.
Fully TBDR would be fully deferring the whole frame's shading until all positions and culling is done. Primitive shaders still appear to be invoked at the frequency of the original primitive submissions, and the on-die storage appears to be insufficient to hold everything since the rasterization scheme has multiple outs for exceeding storage.
That's what I'm suggesting, but I imagine there is some work to do there. For existing pipelines they would be invoked per primitive, however they are designed to transform, cull, and even create more geometry to feed into the rasterizer. How exactly they do that is undefined as they could use a RNG to endlessly spawn primitives.

It doesn't seem unreasonable for a primitive shader to pass geometry bins as opposed to primitives to the rasterizer. Just need some clarity on where these stages begin and end. Rasterizer then packing waves as it sees fit.

As for on die storage, I'm assuming a majority of the missing 23MB SRAM as a victim cache could do it. Using undisclosed hardware. Drop to 8/16bit positions and indices along with some delta compression to make the most out of space with TBDR. Then repeat with full precision and close to perfect ordering.

The instruction is a shortcut for calculating what shader engines may be affected by a primitive. It seems to assume some rather constrained parameters at an ISA level, like 32-pixel tiles, and by its very name applies only to GPUs with 4 shader engines. It's insufficiently precise to do more than determine which shader engines will likely be tasked with doing actual binning and evaluation at sub-pixel accuracy. Relying on a bounding box is conservative, but tasking this instruction or a primitive shader with the evaluation of narrow triangles and other corner cases may lose more than it gains.
In the comments section it mentioned 1SE and 2SE possibilities. For the purposes of binning it could work recursively with variable spatial dimensions of tiles for a TBDR approach. Performance would obviously need to be studied, but a dev could implement IHV style optimizations with the shader. Once AMD documents and releases an API for it at least. That instruction's existence suggests that API exists internally.
 
I'm not tech savy enough so sorry if this question is dumb but, is Imagination Tech own patents concerning TBDR stuff that could be obstacle to AMD (or nVidia...) to implement a good TBDR solution ?
 
The instruction is a shortcut for calculating what shader engines may be affected by a primitive. It seems to assume some rather constrained parameters at an ISA level, like 32-pixel tiles, and by its very name applies only to GPUs with 4 shader engines.
Which makes me wonder: could all future AMD GPUs always have exactly four shader engines?

The difference amongst GPUs would then be the count of memory channels, ROPs and L2 cache size, and then per shader engine, the count of CUs.

But weren't we once "promised" that AMD would break past the "limit" of four shader engines? Dunno.

I think it would be useful to think about the kinds of chiplet that would make up a GPU and then think about the kinds of combinations that could be made up from them.

---

One thing that gives me yet more pause for thought is that 16 CUs (on 14nm) appear to be about the same area as an HBM2 module. HBM2 modules are about double the size of HBM modules it seems. The power usage of 16 CUs (let's say 50W for the sake of argument though perhaps as much as 70W?) is way too high to put under memory.

In general the area of an HBM2 module is so large that filling the controller die with compute would exceed the power constraints we've talked about (10W or so). 7nm looks unlikely to make a difference there, meaning CUs won't appear there. Which appears to mean that AMD will either:
  • have mostly empty logic dies under the memory - HBM2's base die must be mostly empty, I'm guessing it's just wiring from the DRAM controller out to a very large grid of micro bumps to interface to the interposer (power, data, control) - so adding ROPs and L2 won't consume all the remaining space
  • logic dies under memory will be low-clocked (say 500MHz) and stuffed with ROP and cache logic as well as DRAM controller logic, perhaps with a much higher count of ROPs (ROPs are pretty small - 8 look to be about the size of 1 or 2 CUs)
  • memory chiplets will be much smaller than HBM2 (perhaps smaller than HBM?) - whose going to make such a thing?
Perhaps my starting assumption is wrong: that the base logic die of HBM2 is mostly empty. Perhaps there isn't much room for non-DRAM-controller logic?
 
I'm not tech savy enough so sorry if this question is dumb but, is Imagination Tech own patents concerning TBDR stuff that could be obstacle to AMD (or nVidia...) to implement a good TBDR solution ?
They could license them or consider a buyout or merger. That whole situation is a mess with Apple ending deals, possibly eyeing a takeover, and using AMD GPUs. TBDR is more a software solution so it could always end up a platform specific capability.

Which makes me wonder: could all future AMD GPUs always have exactly four shader engines?
If going to chiplets or MCMs with Navi it wouldn't be unreasonable as each would have four engines operating independently. Does leave open the possibility of a control chip with higher throughput to distribute primitives. AMD was hiring front end engineers, but as they aren't hired yet, I'd assume the results are a ways off.

One thing that gives me yet more pause for thought is that 16 CUs (on 14nm) appear to be about the same area as an HBM2 module. HBM2 modules are about double the size of HBM modules it seems. The power usage of 16 CUs (let's say 50W for the sake of argument though perhaps as much as 70W?) is way too high to put under memory.
The capacity of the HBM will likely be smaller if stacked on a compute die. Vega10 with just the base die would be ~5 stacks of RAM. Two more if you put logic under the existing memory, and no reason that couldn't go larger. Even with 2-Hi stacks at 2GB each you just created a 10GB+ GPU with local memory bandwidth in excess of 1.5TB/s. More with higher clocks and process improvements.

The scaling could be such that lower core clocks make far more sense and drop to near threshold voltages where very little power would be used.

have mostly empty logic dies under the memory - HBM2's base die must be mostly empty, I'm guessing it's just wiring from the DRAM controller out to a very large grid of micro bumps to interface to the interposer (power, data, control) - so adding ROPs and L2 won't consume all the remaining space
I wouldn't be surprised if there was a lot of power circuitry down there or giant capacitors in the metal later to attempt to offset swings from the wide IO. As for logic, there would be the controller, thermal limiting, error checking/hashing, that could take up some space. Could always add cache for tracking meta data on the memory chip as well. SRAM could be interesting for PIM or even faster caching as there wouldn't be a strobe. Buffer frequent reads and writes in memory.
 
It doesn't seem unreasonable for a primitive shader to pass geometry bins as opposed to primitives to the rasterizer. Just need some clarity on where these stages begin and end. Rasterizer then packing waves as it sees fit.
AMD's patents don't mention that form of involvement, and seem to task hardware after the setup pipeline for various bin checks, making them after the early culling posited for primitive shaders. However, AMD's claims are likely general enough to allow some interpretation to span this.

As for on die storage, I'm assuming a majority of the missing 23MB SRAM as a victim cache could do it. Using undisclosed hardware.
We don't have a ready data point for how many MB of SRAM aren't part of the externally noted storage pools for prior GPUs. While storage related to the expanded geometry and rasterization could be a notable part of it, there's a lot of miscellaneous storage on every GPU that AMD didn't make a distinction for, and potentially a swath of extra buffers throughout for latency compensation.

In the comments section it mentioned 1SE and 2SE possibilities.
1SE seems trivial. 2SE may be derived from whether any of the [5:6] bits for the bounding box flip, once the required prior extent checks are passed.

Which makes me wonder: could all future AMD GPUs always have exactly four shader engines?
The instruction documentation notes that less complex cases use normal instructions.
Nothing would prevent AMD from introducing a different SE count instruction at some later date. Either way, it doesn't strike me as a particularly clean ISA decision, although a minor issue given that it's one instruction so far.
If AMD decided to change any of the semantics in the future, it would need some additional instructions or mode settings. Tile size, assignment patterns, and max extents seem like ready candidates for HWID context values, which would potentially save on setup code or reduce the chance of an architectural mismatch if the instruction were capable of a somewhat more complex evaluation than an 8-bit LUT.
One possible alternative that AMD has done before was add new instructions and rename the current one as legacy.

It might be a mark against Navi's "scalability" if that is maintained. AMD's HPC chiplet configurations sound like they could readily hit the limits of this instruction with one chiplet, which then leaves just as much software work undone if Navi or the next generation actually did any MCM graphics products.
If AMD decides to move beyond the constraints of this form of fixed function pipeline, it questions the value of this instruction anyway.

On the other hand, adding more instructions targeted at primitive shaders might mean future implementations will have a more impressive showing of the technique.
Otherwise, if AMD added the the instruction because it decided it wouldn't matter much to tack on a kludge to an architecture it was going to revamp or cull, that might be reasonable as well.

But weren't we once "promised" that AMD would break past the "limit" of four shader engines? Dunno.
Anandtech's Vega review indicates they revisited the topic. AMD stated they could have if they decided to, but they decided not to.

"Talking to AMD’s engineers about the matter, they haven’t taken any steps with Vega to change this. They have made it clear that 4 compute engines is not a fundamental limitation – they know how to build a design with more engines – however to do so would require additional work. In other words, the usual engineering trade-offs apply, with AMD’s engineers focusing on addressing things like HBCC and rasterization as opposed to doing the replumbing necessary for additional compute engines in Vega 10."
http://www.anandtech.com/show/11717/the-amd-radeon-rx-vega-64-and-56-review/2

One thing that gives me yet more pause for thought is that 16 CUs (on 14nm) appear to be about the same area as an HBM2 module. HBM2 modules are about double the size of HBM modules it seems. The power usage of 16 CUs (let's say 50W for the sake of argument though perhaps as much as 70W?) is way too high to put under memory.
It would probably need more margin unless that calculation includes the effectively dead area belonging to the power and I/O for the stack. Currently, HBM's DRAM layers lose something like 1/5 or more of their area to the stack's TSVs. The base die has that area lost, and then more area to the PHY and ballout to the interposer below.

Perhaps if a cluster of Vega-like CUs were customized to the practical clock ceiling for silicon under DRAM, they could drop the majority of the transistor bloat Vega had over Fiji.
AMD's estimated power ceiling is about 5-7x lower than those estimates, and given what they say about DRAM's efficiency at high temps some of the measurements of Vega's voltage and temps for HBM2 are potential signs of trouble.

Perhaps my starting assumption is wrong: that the base logic die of HBM2 is mostly empty. Perhaps there isn't much room for non-DRAM-controller logic?
For HBM1 at least, Hynix indicates the base logic layer has a role in distributing connections from the interposer to the rest of the stack, decoupling capacitors, as well as built-in testing and fault recovery. I'm unclear if there are other functions such as some of the thermal monitoring or DRAM stack functions that might be handled in the logic layer.
http://www.semicontaiwan.org/zh/sit..._taiwan_2015_ppt_template_sk_hynix_hbm_r5.pdf

For any practical solution, that needs to go somewhere in the system and likely somewhere in the stack. Samsung had that proposed cheap version of HBM that would remove the base die, but that leaves the question of what other layers will take the area hit.
I'm not sure if HBM2 populates the base more, or needed to more capacitors. For thermal reasons, the stack may be broader so that they could fit more thermal bumps.
 
AMD's patents don't mention that form of involvement, and seem to task hardware after the setup pipeline for various bin checks, making them after the early culling posited for primitive shaders. However, AMD's claims are likely general enough to allow some interpretation to span this.
Not in current patents, but it may just the first step. Not that far off some of the SIGGRAPH papers. If you think about it, that would be no different than a developer streaming geometry or writing out their own vertex data. Just passing a reference to an array they packed. With a bindless model it would open the door to more flexible approaches. A dev could implement TBDR as a culling strategy.

We don't have a ready data point for how many MB of SRAM aren't part of the externally noted storage pools for prior GPUs. While storage related to the expanded geometry and rasterization could be a notable part of it, there's a lot of miscellaneous storage on every GPU that AMD didn't make a distinction for, and potentially a swath of extra buffers throughout for latency compensation.
True, but AMD did feel a need to mention the figure so I'd assume there is some significance to it. Roughly half of that figure is accounted for. Going off the ISA and llvm memory model, having a victim cache in there would make sense. May even be an Infinity thing with Zen APUs and HSA. Even for HBCC it could facilitate write combining and be transparent to the programmer. Problem being, actively using it would probably look like a register spill to a programmer. That's not something any of them would attempt blindly as it's actively avoided.

If AMD decided to change any of the semantics in the future, it would need some additional instructions or mode settings. Tile size, assignment patterns, and max extents seem like ready candidates for HWID context values, which would potentially save on setup code or reduce the chance of an architectural mismatch if the instruction were capable of a somewhat more complex evaluation than an 8-bit LUT.
There were some bin configuration registers in the Linux drivers, GFX9.c if I recall. Fairly simple settings for DSBR. I'd even wonder about a FPGA they could use for setup or simply sorting more efficiently if dealing with a large number of bit masks. Not too unlike their ACE usage.
 
Not in current patents, but it may just the first step. Not that far off some of the SIGGRAPH papers. If you think about it, that would be no different than a developer streaming geometry or writing out their own vertex data. Just passing a reference to an array they packed. With a bindless model it would open the door to more flexible approaches. A dev could implement TBDR as a culling strategy.
This is an area where primitive shaders as described have variable interactions with a DSBR.
A primitive shader is described as a two-phase process, with position calculation and culling in one stage and attribute calculation in the other. While I am unclear as to whether this is all one single workgroup or a producer/consumer pair, it would be the case that the process wouldn't be complete until the inputs to the shader stage are all processed into outbound position and attribute data.
The patent's setup and binning process can do something a shader generally cannot, in that it must dynamically split bins at a different granularity than a dispatch or patch submission, based on what it can determine for bin intercepts and the full evaluation of contributing fragments and the total context data needed to handle them--which the primitive shader does not readily know about unless it does a full rasterization process.

In this regard, the primitive shader's culling of primitives so they do not take up attribute space can help the DSBR. However, the culling capabilities of a primitive shader may cause the DSBR to contribute less to the effectiveness of the geometry front end, since a lot of what the DSBR could cull would be culled already.

True, but AMD did feel a need to mention the figure so I'd assume there is some significance to it. Roughly half of that figure is accounted for. Going off the ISA and llvm memory model, having a victim cache in there would make sense.
For a marketing slide, a number being big may be the only significance needed.
I'm not sure where this undisclosed cache sits. The primitive shaders are running on the CUs, and their data paths are defined. There are various caches in the geometry and setup stages that could be upsized, but they aren't new.
 
A primitive shader is described as a two-phase process, with position calculation and culling in one stage and attribute calculation in the other. While I am unclear as to whether this is all one single workgroup or a producer/consumer pair, it would be the case that the process wouldn't be complete until the inputs to the shader stage are all processed into outbound position and attribute data.
It can be either a single invocation (that's how the diagrams show it) that takes a batch of vertices through to fully culled and shaded vertices, or attribute shading can be deferred (the white paper explicitly says this is an option).

The white paper also says that a surface shader can combine VS and HS, followed by a primitive shader that combines DS and GS. In a pipeline configured for tessellation, VS is normally pretty lightweight and attribute shading is left until DS. So the surface shader is another kind of shader that combines stages - though I'm doubtful there's any scope for culling twixt VS and HS. The advantage with the surface shader would appear to be not having to rely upon a producer-consumer buffer (which will be too small, in general). In my opinion HS still dumps patch data into LDS for DS to pick up later (unless driver sets VRAM as the target for the patch data), so using LDS to collate data from VS into HS processing, within the scope of a single invocation seems to make sense.

Culling can be done in the primitive shader after DS, but I believe this depends on the kind of output GS produces (as I wrote about last week).

In this regard, the primitive shader's culling of primitives so they do not take up attribute space can help the DSBR. However, the culling capabilities of a primitive shader may cause the DSBR to contribute less to the effectiveness of the geometry front end, since a lot of what the DSBR could cull would be culled already.
Well, it's worth remembering that in a GPU without a primitive shader to do culling, the triangles would still be culled before DSBR saw them. Primitive shader culling means desired triangles appear at the DSBR in a shorter span of time, because the hardware is less likely to have choked on triangle data causing back pressure.

If binning works, then it works because there's enough triangles in each bin to evaluate against each other (otherwise it's just an expensive buffer in L2, miles away from the rasteriser). The best way to get enough triangles to evaluate is with high locality in time and space. The best way to get locality in time is to kill off bad triangles as early as possible, before they get a chance to choke the geometry section of the pipeline.

Additionally, since DSBR is fighting other clients for its share of L2 space, culling would help DSBR actually work instead of thrashing to do nothing because it'll be getting "full" bins from shorter bursts of better-localised triangles.

Well that's my theory. There's no evidence to refer to and who knows if we'll ever get any.
 
It can be either a single invocation (that's how the diagrams show it) that takes a batch of vertices through to fully culled and shaded vertices, or attribute shading can be deferred (the white paper explicitly says this is an option).
Since it has multiple options, the point I was debating is whether that overall process was the binning process as described by AMD for its hybrid rasterizer.
The way that works, however, doesn't necessarily align with what the batching process does. Possibly, the last primitive that leaves the primitive shader would be a finalizing primitive for a batch, but the DSBR could have created and closed multiple batches during a single primitive shader's run, and can potentially exclude a primitive due to more precise evaluation based on the intercepts it generates.

Well, it's worth remembering that in a GPU without a primitive shader to do culling, the triangles would still be culled before DSBR saw them. Primitive shader culling means desired triangles appear at the DSBR in a shorter span of time, because the hardware is less likely to have choked on triangle data causing back pressure.
AMD's binning rasterizer patents still mention the cases of culling covered by the primitive shader, such as back-faced culling.
The stages from primitive setup until the point that the rasterizer discards non-contributing fragments was mentioned as part of the process, and the video interview on primitive culling indicated a cycle cost if anything got past the primitive shader.

The whitepaper indicates the primitive shader stage exists in tandem with the fixed-function pipeline, so there is some level of duplication in cases where the primitive shader's coarse check for shader engine tile coverage means sending primitives to a shader engine that still culls them.

Additionally, since DSBR is fighting other clients for its share of L2 space, culling would help DSBR actually work instead of thrashing to do nothing because it'll be getting "full" bins from shorter bursts of better-localised triangles.
I'm curious if AMD doesn't do something to reduce the chances of the DSBR's data from being evicted. AMD's discussion of the front-end bottleneck and how it describes the efficiencies of having only a few bins in progress doesn't point to a lot of latency tolerance, and if AMD is worried about losing a cycle due to an unculled primitive, it probably doesn't serve the pipeline if it sporadically loses hundreds based on dynamic conditions.
 
So here's a random thought. What if each lane of a SIMD was executing a different instruction temporally? Similar to that scalar idea I had a while back, but gone overboard. Using almost no VGPRs as all the temporary results are being forwarded one at a time to the next lane instead of stored. Conceptually a 16 deep pipeline taking 64-1024 cycles to complete 16 instructions and forming a complex graph. Any divergence or masking could easily be skipped without idling lanes and a full crossbar on the input for redirecting lanes required. Covers variable SIMD, flexible scalar, and should be rather efficient. Could still work the old way as well. Normal issue with SIMD is deciding an instruction, if it's repeating it could do something different. Clocks on that I'd think would be rather high with all logic next to neighbors and even the RF minimally involved.
 
So here's a random thought. What if each lane of a SIMD was executing a different instruction temporally? Similar to that scalar idea I had a while back, but gone overboard. Using almost no VGPRs as all the temporary results are being forwarded one at a time to the next lane instead of stored. Conceptually a 16 deep pipeline taking 64-1024 cycles to complete 16 instructions and forming a complex graph. Any divergence or masking could easily be skipped without idling lanes and a full crossbar on the input for redirecting lanes required. Covers variable SIMD, flexible scalar, and should be rather efficient. Could still work the old way as well. Normal issue with SIMD is deciding an instruction, if it's repeating it could do something different. Clocks on that I'd think would be rather high with all logic next to neighbors and even the RF minimally involved.

Have you even seen what typical assembly code of any real-world routine looks like?
Have you even tried to analyze how the data flows between those instructions?

Those data dependencies form a complex graph, not a linear list.
 
I know this refers to Epyc, but these numbers sure make a MCM GPU very appealing.

AcmTUqW.jpg


60% of the cost compared to a monolithic die with 10% overhead from using inter-chip IF.

With the cost difference they could even quadruple the IF overhead and it would still be worth it.
Vega 20 might be AMD's last big GPU.
 
Have you even seen what typical assembly code of any real-world routine looks like?
Have you even tried to analyze how the data flows between those instructions?

Those data dependencies form a complex graph, not a linear list.
Yes, and it would definitely be interesting to compile and require some hardware modifications, but the performance and power should also be interesting if it worked. It would basically be chaining DSPs together and be similar to a FPGA. Removing a good deal of VGPR demand from dependent instructions as well as energy to move the data. With a single instruction active for a period of time, even power gating inside an ALU could be practical.

Nvidia listed 40-70% outputs used within a few cycles, and this would work towards that. Assumption being the input and output of lanes being closer to each other than cache. Diverged code would be a problem, but for some workloads this could work rather well. Even if just breaking a SIMD into quads and processing small chains or large parallel thread groups. Should be sufficient operand bandwidth thanks to forwarding and heavily ported temporary registers, and yes it would require sufficient operand bandwidth for all initial instructions in each branch. Keep in mind this would be multiple threads executing temporally, so cache hits might make it more viable than it appears on paper. Sixteen threads on paper are 32/48 operands, but many of those may be the same value.

The linear list is just many parallel threads that don't divergence, hopefully enough that the trouble of configuring this array is worthwhile.
 
I know this refers to Epyc, but these numbers sure make a MCM GPU very appealing.

Code:
http://i.imgur.com/AcmTUqW.jpg

60% of the cost compared to a monolithic die with 10% overhead from using inter-chip IF.

With the cost difference they could even quadruple the IF overhead and it would still be worth it.
Vega 20 might be AMD's last big GPU.
I wonder how they arrive at 850+ mm², when supposedly they are more area efficient than Intel in Skylake SP. Are 14% moar coars and 33% more memory channels really worth ~30-ish percent die space -or is DDR4-PHY padding just the reason, Intel went with only 6 memory channels on Skylake SP?

And on the topic of MCM-GPUs: The Threadripper guides talk extensively about NUMA/UMA and all that stuff that MCMs bring with them. AMD even created pre-defined profiles and switches to alleviate it's effects on certain types of games. I am not sure, that MCM in terms of smooth gaming performance is a solved problem yet.
 
Last edited:
I wonder how they arrive at 850+ mm², when supposedly they are more area efficient than Intel in Skylake SP. Are 14% moar coars and 33% more memory channels really worth ~30-ish percent die space -or is DDR4-PHY padding just the reason, Intel went with only 6 memory channels on Skylake SP?

And on the topic of MCM-GPUs: The Threadripper guides talk extensively about NUMA/UMA and all that stuff that MCMs bring with them. AMD even created pre-defined profiles and switches to alleviate it's effects on certain types of games. I am not sure, that MCM in terms of smooth gaming performance is a solved problem yet.
852mm² is the current dies put together, 777mm² is what the supposed monolithic die would have been.
(Though I think the numbers are still wrong, Zeppelin die size is supposed to be ~195mm², not 213mm² as suggested by the slide)
 
Supposed by...? Wikipedia cites sweclockers. AMD has it on a Hotchips slide (I cannot remember them talking officially on this matter before).

Yes, 777 mm² - my percentages were calculated on that base. Forgot to change the 852. So the question remains.
 
I wonder how they arrive at 850+ mm², when supposedly they are more area efficient than Intel in Skylake SP. Are 14% moar coars and 33% more memory channels really worth ~30-ish percent die space -or is DDR4-PHY padding just the reason, Intel went with only 6 memory channels on Skylake SP?

And on the topic of MCM-GPUs: The Threadripper guides talk extensively about NUMA/UMA and all that stuff that MCMs bring with them. AMD even created pre-defined profiles and switches to alleviate it's effects on certain types of games. I am not sure, that MCM in terms of smooth gaming performance is a solved problem yet.

The area comparison was CCX vs 4core skylake +l3.

Also your forgetting all the extra PCI-E EPYC has and also there are things that are duplicated on every die that wont be used ( audio, usb etc). The gmi interfaces are very small on the die way under 10%.

But really what your ignoring is the scale of economy of producing one chip that services such a huge TAM. Intel is servicing that same market with 4-5 chips and in each of those markets none of their chips are significantly better.......
 
AMD discounted for redudant logic in their 777mm² figure, the PCIe Ports are a good point though!

I am not ignoring anything, I am just wondering.
 
I am not ignoring anything, I am just wondering.

I would just take it at face value then, 777 = 32core, 64mb L3, 128 lane PCI-E, 8 channel DDR4 SOC

with XCC being 677 , a 777mm 32 core Zen is almost exact scaling per core. Now obviously intel cores are a lot bigger but they also have a lot less I/O.

But if AMD can turn around good yields on a ~200mm 7nm chip (12 core) in mid to late 18 the MCM option (48 core) all of a sudden looks pretty good against a 650mm 10nm CPU.
 
Last edited:
Status
Not open for further replies.
Back
Top