AMD: Navi Speculation, Rumours and Discussion [2017-2018]

firstminion · Dec 15, 2017

Nemo said:
new_chip.gfx10
https://lists.freedesktop.org/archives/amd-gfx/2017-July/010862.html

+reg gfx10 mmSUPER_SECRET mmio 0x12345670

3dilettante · Dec 15, 2017

Ethatron said:
Oh, I see. I think that's weird to want, because you loose all the flexibility it could give.
On the other hand, EPYC doesn't look like 1 CPU core transparently just wider. Maybe that "complete" degree of transparency wasn't exactly asked for from chiplets (or threadlets, corelets, modulets), rather something more realistic. Aren't we halfway into lego silicon anyway?

It simplifies what the software has to do in order to support the multi-chip GPU. The cache invalidation process is heavyweight enough that I'm not sure about the extent of the delay that could be added by another GPU performing the same invalidation. The microcode processors and separate hierarchies wouldn't normally be trying to check externally, so there would be additional synchronization on top of that.

Zen's coherent caches and generalized fabric make it generally the case that execution will proceed equivalently regardless of which CCX or chip its processes may span. It's true that it doesn't pretend to be one core, but that's where the graphics context differs. A GPU has a heavier context and abstraction, since it is actually a much less cohesive mass of processors.

Jawed said:
In Infinity Fabric, compute nodes are multi-ported. The question is whether the base logic die under a stack of memory is a compute node or a memory node in an IF system.

The infinity fabric itself is agnostic to the topology that might be chosen for an implementation. If the topology in question that moves ROPs off-chip differs from the more recently discussed systems with constrained connectivity to the stacks, it can avoid the general-purpose memory from being blockaded by the mostly unidirectional traffic of the export bus. The challenge of taking a broad bus off-chip is that the orders of magnitude worse power and physical constrains reveal what costs the design used to treat as effectively "free", or what elements of the design are not prepared to deal with arbitration involving a sea of formerly invisible competing clients if it's all going over a single general-purpose connection.

Obviously GPUs don't mind latency in general. So then bandwidth is the real question. In graphics, ROPs are the primary bandwidth hog, but too much imbalance in favour of intra-PIM bandwidth is going to hurt compute.

GPUs generally tolerate latency in the data path to DRAM. GCN's wait counts give something of an idea of the relative weights the architecture has between the vector path to heavily contended external memory, and bespoke internal paths the architecture has hardwired limits on. I would expect a change in the costs and behavior of exports may require an evaluation of the bits devoted to export counts. Vega has shifted things substantially in favor tolerating vector memory stalls, given the extra bits it gave to VMCNT.

AMD has shown a desire to push the ratio of intra-PIM bandwidth to external bandwidth as high as it can, since the peak bandwidth, power efficiency, and ability to adjust the implementation details internally are greater. How extreme would the ratio have to be in order to deeply cut into the external path's effectiveness? The logic base die would likely be working generally shorter lengths and feature dimensions 30-100x finer than the interposer (5-7nm vs ~65nm?), and the bump pitches are orders of magnitude worse than that.
The vertical path to the DRAM, required power/ground, and die area lost to the vertical interfacing is linked to the limits of the external connectivity of a JEDEC-defined DRAM standard.

This is the fundamental question that we're really struggling for data on, so I'm not trying to suggest it's easy. Obviously an interposer is seen as a solution to bandwidth amongst chips. I don't share your utter disdain for interposers, for what it's worth. They, or something very similar, are the target of lots of research simply because the pay-off is so great.

The most recent discussion centered on solutions without interposers--which for this use case are even worse-off, but that aside I am fine with 2.5D and/or 3D integration and manufacturing.
If there is a specific form that I have doubts about, it is the use of un-elaborated 2.5D silicon interposers specified for hosting HBM stacks--and in some ways my jaundiced opinion is specifically so for AMD.
I get why that was the choice, but it's a solution that provides an optimization in one area--trace density, and causes significant de-optimizations across many axes. The various other interesting directions that an interposer implementation could take are left un-exercised, and the tech isn't sufficiently capable to do much more than what it accomplishes.
All the while, we're seeing the progression of solutions that avoid, remove, or minimize the amount of silicon used becoming good enough to supplant it for HBM connectivity.

The spartan 2.5D AMD uses is insufficient for its purposes, which its various proposals tacitly admit. Its supposed doubling down with chiplets and active interposers has a much more serious risk profile and a long time horizon, however.

I think it would be useful to think in terms of bandwidth amplification and coherence separately. ROPs and BR rely entirely upon both of these things.

There's still latency. The shader engine front end pipeline and workgroup/wavefront arbitration path are significantly less parallel. Their interactions with more of the traditional graphics abstractions and command processor elements relative to CU-dominated compute means there are interactions with elements with low concurrency and wide scope.
The ROPs traditionally were meant to brute-force their high utilization of the DRAM bus with their assembly-line handling of import/export of tiles, which is less about having the capacity to handle unpredictable latency than it is having a generally consistent stream coming over the export bus.

Vega now provides L2 as backing for "all" clients (obviously TEX/Compute have L1, so they're partially isolated). e.g. Texels are fairly likely to only ever see a single CU in any reasonably short (intra-frame) period of time. So that's not amplification. It's barely coherence, too. And so on.

The L1s are rather small and thrash-prone to isolate them more than mitigate the impact of on-die L2 latency and the current scatter/gather. Possibly corner cases with lookup tables/textures, distant entries in a hierarchical detail structure, compression metadata, etc. The coalescing behavior of the L1 is also rather limited, last I checked. Adjacent accesses within a given wave instruction could be coalesced, but even mildly out of order yet still equivalently coalesce-able addresses might translate into separate L2 hits.

I'm now wondering what fixed-function units should be in PIM, beyond ROPs and BR.

Atomics, although they could be presumed to be part of the L2 if it's being placed in the PIM.
Any sufficient level of incoherent or poorly cache-aligned gather/scatter is going to inflate the amount of data going in and out versus the registers passed to the VM pipeline.
Sufficiently complex filtering, given the automatic stride calculation and expansion of a few base addresses to multiple neighbors/LODs. Complex or new formats/compression techniques could switch out with the PIM base rather than fiddling with the CU.
While we're at it, possibly any vector writes that do not completely align with and fully overwrite a cache line--given GCN's current write through and eviction policies.
Writes in general? Just parity in the CU coupled with a read-only cache could allow ECC to be an HPC PIM function.

Other elements need to be weighed as to whether they make sense in terms of bandwidth demand, whether they are generally useful to be tied to the memory any implementation needs, whether it makes sense for them to be duplicated, and the space premium for a PIM's base die.

Still, putting the L2 at a remove over another connection makes the limited buffering capacity of the L1 less effective, and the costs of the connection would encourage some additional capacity. The shared instruction and scalar caches may need some extra capacity or backing store, given that vector memory handling is the main focus of GCN's latency hiding.

One relative exception could be a video display engine or media processor, which might benefit from proximity in otherwise idle scenarios. That might point to a specific media/IO chiplet with a more modest buffer. Virtualization/security co-processors might do better separate.

Remember these are fairly coarse tilings of texture and render targets. Large bursts to/from memory imply substantial tiles.

The natural tendency is to stripe addresses across channels to provide more uniform access behavior and to higher bandwidth utilization. An EPYC-like MCM solution that keeps the usual physical association between L2 slices and their nearest channels would stripe a roughly equal number of accesses to local and remote L2s.

Ethatron · Dec 15, 2017

3dilettante said:
It simplifies what the software has to do in order to support the multi-chip GPU. The cache invalidation process is heavyweight enough that I'm not sure about the extent of the delay that could be added by another GPU performing the same invalidation. The microcode processors and separate hierarchies wouldn't normally be trying to check externally, so there would be additional synchronization on top of that.

Zen's coherent caches and generalized fabric make it generally the case that execution will proceed equivalently regardless of which CCX or chip its processes may span. It's true that it doesn't pretend to be one core, but that's where the graphics context differs. A GPU has a heavier context and abstraction, since it is actually a much less cohesive mass of processors.

Well, thread migration sucks, always. There is no processor where this is free or even cheap.
Threaded contention of the cache-lines mapping the same memory region sucks even more. There is only slow coherence over time, which is automatic, true, but very expensive.

In the end, a multi-threat core needs to be spoken too in a very different manner than a single core - even ignoring the change of the algorithm.

You rightly say that this stuff is way worst for GPUs, even if you would treat is as a multi-"core"/"rasterizer" GPU ... but only if you insists that the algorithm's optimal minimum should stay at the same spot - that is, you don't make different assumptions of optimality and you don't change the algorithm, you run it basically in emulation mode.
Few software engineer do that. When the architecture offers you another more optimal minimum, you rewrite your algorithms. The kernels of high performing algorithms have permutations for even minor architecture changes.

Who would wants this transparent behaviour? And who would see a chance to make the problem being solved faster with the very same amount of traditional GPU resources distributed over more disparate "cores", even compensating some some degree of coherency overhead? I would believe the latter fraction is larger.

Jawed · Dec 16, 2017

3dilettante said:
The infinity fabric itself is agnostic to the topology that might be chosen for an implementation. If the topology in question that moves ROPs off-chip differs from the more recently discussed systems with constrained connectivity to the stacks, it can avoid the general-purpose memory from being blockaded by the mostly unidirectional traffic of the export bus. The challenge of taking a broad bus off-chip is that the orders of magnitude worse power and physical constrains reveal what costs the design used to treat as effectively "free", or what elements of the design are not prepared to deal with arbitration involving a sea of formerly invisible competing clients if it's all going over a single general-purpose connection.

So finally, in Vega, AMD made all graphics blocks clients of L2. AMD took its sweet time making this change. So, yes, it is clearly something you do cautiously and PIM with remote L2s adds another layer of complexity.

We could ask why AMD has stayed away from this for so long when it's been a key part of NVidia's GPUs since, erm ... Fuzzy memory alert. A long time... Was NVidia's adoption of an L2 based architecture the fundamental reason it killed AMD with Maxwell onwards? Allowing NVidia to add fixed-function algorithms to the GPU underwritten by having paths to L2 with enough bandwidth across the entire chip. Distributed, load-balancing rasterisation (relies upon geometry flowing easily), delta colour compression, tiled rasterisation. All these things rely upon L2 being at the heart of the architecture, as far as I can tell.

And performance per watt, despite the "disadvantages" (whatever they are) of L2 instead of dedicated paths and buffers seen in traditional GPUs, was significantly improved. L2 might be difficult, but done right it appears to have been a cornerstone of NVidia's architecture.

Vega just looks like a half-broken first attempt with many "anomalies" in bandwidth-sensitive scenarios and awesomely bad performance per watt.

How extreme would the ratio have to be in order to deeply cut into the external path's effectiveness?

I'm thinking about compute algorithms here, rather than graphics. The issue I see is that while it might be possible to have say 4:1 intra-PIM:extra-PIM, compute algorithms running outside of PIM (on ALUs which are too hot for a stack of memory) may not tolerate that. So taking Vega as an example, let's say the ROPs have 400GB/s to themselves and the rest of the chip has to share 100GB/s. That will surely hurt compute, but might be completely fine for graphics.

I think you see this the same way, so we're in general agreement. I don't have any concrete arguments for or against the practicalities at any given ratio.

One could observe that the logic die in a PIM actually has far greater bandwidth available to it from the stack of DRAM. In HBM the extra-module bandwidth is a function of the interface size, the page sizes, refresh timings, bank timings, burst sizes, turn-around latencies, buffer configurations and on and on and on. What if L2 inside PIM gets a 2x boost in effective bandwidth? Could it be even more? If so, then the extra-PIM bandwidth may be rather more generous than we're expecting.

There's still latency. The shader engine front end pipeline and workgroup/wavefront arbitration path are significantly less parallel. Their interactions with more of the traditional graphics abstractions and command processor elements relative to CU-dominated compute means there are interactions with elements with low concurrency and wide scope.

Those functions aren't latency sensitive though.

The ROPs traditionally were meant to brute-force their high utilization of the DRAM bus with their assembly-line handling of import/export of tiles, which is less about having the capacity to handle unpredictable latency than it is having a generally consistent stream coming over the export bus.

ROPs do very little work on the fragment data they receive and spend very little time doing that work (deciding which fragments to write, blending/updating MSAA and depth/stencil).

There's a queue on the input and that's it: you get a maximum fillrate defined by the pixel format and rasterisation rate. Depending on the format/content of the fragment data, the processing time will vary, but the there's only so many combinations. There is a maximum rate at which fragments can arrive (since the rasteriser is the bottleneck) and for a lot of the time fragments will be arriving more slowly and erratically. For a game at given graphics settings it's common for the frame rate to vary by an order of magnitude.

Obviously some parts of a frame are entirely fillrate limited, like shadow buffer rendering. But ROPs and the paths to them have been sized according to an overall balance which includes a compromise over the range of fragment:vertex ratios and available off-die bandwidth.

The L1s are rather small and thrash-prone to isolate them more than mitigate the impact of on-die L2 latency and the current scatter/gather. Possibly corner cases with lookup tables/textures, distant entries in a hierarchical detail structure, compression metadata, etc. The coalescing behavior of the L1 is also rather limited, last I checked. Adjacent accesses within a given wave instruction could be coalesced, but even mildly out of order yet still equivalently coalesce-able addresses might translate into separate L2 hits.

Yes, precisely. L1s are more "cache by name" than "cache by nature". At least for TEX. But for writes there's more value. It's really hard to find anything really concrete.

https://gpuopen.com/gdc2017-advanced-shader-programming-on-gcn/

There's some slides early in that deck that talk about throughputs and lifetimes. But I can't find much to enrich our discussion from there.

Still, putting the L2 at a remove over another connection makes the limited buffering capacity of the L1 less effective, and the costs of the connection would encourage some additional capacity.

They're definitely allowed to be bigger...

The shared instruction and scalar caches may need some extra capacity or backing store, given that vector memory handling is the main focus of GCN's latency hiding.

AMD has recently been increasing instruction cache sizes. Constant cache is another major performance enhancer...

One relative exception could be a video display engine or media processor, which might benefit from proximity in otherwise idle scenarios. That might point to a specific media/IO chiplet with a more modest buffer. Virtualization/security co-processors might do better separate.

Naively, it's very tempting to characterise a chiplet approach as consisting of three types:

an interface chiplet - PCI Express, pixel output, video and audio engines, top-level hardware scheduling
shader engine chiplets - CU with TEX
PIMs - ROPs, BR

To be honest it looks too simple and too complex at the same time. Too simple: ignoring practicalities, it seems to follow how GPUs are architected. Too complex: inter-chiplet bandwidth is non-trivial even with an interposer and PIM is a whole new manufacturing problem.

So, apart from observing in this posting that PIM functions (ROPs, say) might actually gain bandwidth from being on the "right" side of the HBM interface (close to the memory stack, not close to the CUs), I'm afraid to say it still all seems like wishful thinking.

Jawed · Dec 16, 2017

Jawed said:
A key point of the paper is that the logic in the base die has access to substantially higher bandwidth from within the stack (4x). I'm not sure if this is actually possible with HBM. One could argue that a new variant of the HBM stack could be designed such that intra-stack bandwidth would be monstrous but ex-stack bandwidth would be a fraction of that value.

Sigh, seems I've gone full circle right back to the start of the thread. So, erm, literally no progress.

CSI PC · Dec 16, 2017

Maybe part of the issue compared to Nvidia comes back to how well implemented cache coherency is *shrug*; context here being for tiled rather than CPU-GPU.

3dilettante · Dec 17, 2017

Ethatron said:
Well, thread migration sucks, always. There is no processor where this is free or even cheap.

It's a matter of degree.
Thread migration is comparatively expensive, although for many general CPU workloads the CCX or thread migration penalties are minor, and for latency-sensitive games the penalties are measured in tens of percent. Individual access penalties could be in the hundreds of cycles.
The GPU situation is the lack of migration, data corruption/instability, or penalties that can go hundreds of thousands of cycles or more.

Threaded contention of the cache-lines mapping the same memory region sucks even more. There is only slow coherence over time, which is automatic, true, but very expensive.

For the graphics domain elements in question, there is no coherence or correctness ever. Without going to a higher driver/OS function, they're either wrong or physically incapable of contending.

In the end, a multi-threat core needs to be spoken too in a very different manner than a single core - even ignoring the change of the algorithm.

At issue here is that CPU semantics have well-defined behaviors and rules for multiple processors. At the hardware level, there is an implementation of both single-processor and multi-processor function as parts of the hardware-enforced model presented to software and developers.

Outside of the cluster that is SLI/Xfire, the GPU's graphics context presents an abstraction that hews more closely to the single entity model (albeit with a much larger unit context), with very poorly defined or missing behaviors for the multi-chip case in-hardware, or a punt to the driver/OS/developer levels for basic consistency resolution.

Few software engineer do that. When the architecture offers you another more optimal minimum, you rewrite your algorithms. The kernels of high performing algorithms have permutations for even minor architecture changes.

These products are intended to be sold to a customer base that would have almost purely legacy software relative to an architecture that mandates explicit multi-adapter as the basis for any future improvement, and the reality of the graphics portion is that this is entrusting the architecture to the work ethic of the likes of EA and Ubisoft, or the average level of competence of software on Steam.

Even if this could be relied upon, the architecture's multi-peer model would need to be well defined and communicated to the community at large. The CPU vendors gave lead times sometimes amounting to years for notable changes in topology or semantics, and at least within ISA families the rules are mostly consistent with differences that don't manifest without taking some low-level approaches around the established system utilities or libraries.

Who would wants this transparent behaviour? And who would see a chance to make the problem being solved faster with the very same amount of traditional GPU resources distributed over more disparate "cores", even compensating some some degree of coherency overhead? I would believe the latter fraction is larger.

Those who've seen what the industry does if there's even a remote amount of fragility or have software written prior to the introduction of said product.
Game production managers would prefer the former. Tim Sweeney has on a number of occasions cited productivity as the limiter, going so far as to state a willingness to trade performance for productivity in development.
Getting even to a decent level of multi-core coding has taken decades at this point, with all those elements that make it as transparent as possible in architectures vastly more robust than GPUs have.

I'd be interested to see AMD try at it, but at present I think the abstraction pushed by the GPU is too high-level, large, and complex for it cleanly define multi-peer behavior or to make it even as approachable as CPU multi-threading.

Jawed said:
We could ask why AMD has stayed away from this for so long when it's been a key part of NVidia's GPUs since, erm ... Fuzzy memory alert. A long time... Was NVidia's adoption of an L2 based architecture the fundamental reason it killed AMD with Maxwell onwards?

Fermi introduced the read/write cache hierarchy and L2. It didn't massively change Nvidia's client fortunes, but might have had long-term economic effects with compute. Maxwell's significant change seems to stem from Nvidia's pulling techniques from its mobile lines into the mainline, with bandwidth and power efficiency gains. Some of those techniques turned out to be early versions of DCC and tile-based rendering, which did heavily leverage the L2 in ways AMD did not prior to Vega.
Other elements are probably a change in the relative level of physical and architectural optimization and iteration.

Distributed, load-balancing rasterisation (relies upon geometry flowing easily), delta colour compression, tiled rasterisation. All these things rely upon L2 being at the heart of the architecture, as far as I can tell.

For Nvidia, it does seem like many of those do rely heavily on the L2. Does geometry flow through the L2?
With Vega, I'm not sure how much it uses the L2 specifically, as it is still possible that DCC is in a similar place for spill/fill from the ROP caches or in the path to the memory controller, and it's not clear if the bin heuristics in the Linux drivers reflect graphics-specific hardware limits versus the L2. The code seems concerned with shader engine and associated hardware limitations, while in theory L2 could be more flexible in relation to those units.

Vega's relatively lower level of success in leveraging the techniques Nvidia has running through its L2 may mean some of AMD's methods don't use it the way Nvidia does. Perhaps if AMD did, some of its problems wouldn't be happening.

Vega just looks like a half-broken first attempt with many "anomalies" in bandwidth-sensitive scenarios and awesomely bad performance per watt.

It could be a half-attempt, given some of its anomalies seem to align with not disturbing the old paths.

I'm thinking about compute algorithms here, rather than graphics. The issue I see is that while it might be possible to have say 4:1 intra-PIM:extra-PIM, compute algorithms running outside of PIM (on ALUs which are too hot for a stack of memory) may not tolerate that.

One thing I noted when comparing the earlier TOP-PIM concepts versus AMD's exascale chiplets is that there may have been a change in AMD's expectations of processing in memory.
TOP-PIM posited up to 160 GB/s external bandwidth versus the 640 GB/s internal to the stack. That was the most optimistic ratio between external and internal bandwidth.
AMD's exascale APU didn't seem to note a different bandwidth for external/internal bandwidth, and also seems to expect at most a doubling of external bandwidth from HBM2 in the next 5+ years.

One possibility is that TOP-PIM's estimate may have been too generous on the top end, and that either way HBM2 or its next generation version would have been close enough at that point.
Additionally, TOP-PIM is conceptually closer to hybrid memory cube in what proposed for the logic layer and memory interface. The latest version's external bandwidth is 480 GB/s in aggregate, but the DRAM stack can only be driven to 320 GB/s, which is perhaps ominously what TOP-PIM had as a pessimistic trendline.
This also aligns with an Nvidia presentation on bandwidth scaling, where the current and near-future DRAM standards have so well-optimized interface power draw that DRAM array and device ceilings dominate and scale linearly with bandwidth.
Even if a PIM could quadruple HBM2/3/4's external bandwidth internally, unless something is done about the DRAM itself there may be limited gains permitted in-stack because the arrays are saturated.

Those functions aren't latency sensitive though.

I was thinking of DICE's method for triangle culling in compute citing the risks of not quashing empty indirect draw calls, since the command would be submitted to the command processor, which would then sit on it while reading the indirect draw count from memory. The whole GPU could not hide more than a handful of those.
Graphics context state rolls are another event developers are told to avoid due to the small number of concurrent contexts and the latency it would incur to drain the pipeline and load another.

AMD has recently been increasing instruction cache sizes. Constant cache is another major performance enhancer...

I recall Vega reduced the maximum number of CUs that could share an instruction front end to allow for higher clocks. That's a modest increase per-CU in a maximal shader array, though I think the instruction cache is still the same size.

Naively, it's very tempting to characterise a chiplet approach as consisting of three types:

an interface chiplet - PCI Express, pixel output, video and audio engines, top-level hardware scheduling

shader engine chiplets - CU with TEX

PIMs - ROPs, BR

Going by AMD's exascale concept, items like PCIe might be built into the interposer as well.
The active interposer would significantly help with the area cost of going with chiplets, since the NOC would be mostly moved into the lower die.

There could be ancillary blocks whose purposes aren't fully disclosed that might go in one or the other chiplet. For example, AMD's claims on various functions in GPUs like context switching, compression, or load balancing very frequently allow for the possibility for yet another processing block. It might be decided on a case by case basis, and may have to weight them by which hardware they are most tightly linked to.

Jawed · Dec 17, 2017

3dilettante said:
For Nvidia, it does seem like many of those do rely heavily on the L2. Does geometry flow through the L2?

I remember L2 being key in the tessellation data flow.

I recall Vega reduced the maximum number of CUs that could share an instruction front end to allow for higher clocks. That's a modest increase per-CU in a maximal shader array, though I think the instruction cache is still the same size.

https://techreport.com/review/30328/amd-radeon-rx-480-graphics-card-reviewed/2

If many wavefronts (AMD's name for groups of threads) of the same workload are set to be processed, a new feature called instruction prefetch lets executing wavefronts fetch instructions for subsequent ones. The company says this approach makes its instruction caching more efficient. Polaris CUs also get a larger per-wave instruction buffer, a feature that's claimed to increase single-threaded performance within the CU.

So, not an increase in cache size, but a tweak to how the cache is used.

Kaotik · Dec 18, 2017

Jawed said:
So, not an increase in cache size, but a tweak to how the cache is used.

Actually just like he said, they reduced the number of CUs that can share same instruction & constant cache from 4 to 3, while the cache itself is the same size as before.

_cat · Dec 18, 2017

3dilettante said:
For Nvidia, it does seem like many of those do rely heavily on the L2. Does geometry flow through the L2?

YES Post-Transform Geometry stays on-Chip in the L2.
i am not allowed to post links, yet. but i try
https://www.techpowerup.com/img/17-03-01/f34e39b49c7c.jpg

_cat · Dec 18, 2017

first i´m a non-pro, most of you are pros

My concern is, that GCN as a whole doesn´t allow more than 4 Shader-Engines and because of that there will never be the Front-end-performance mostly at geometry, world-transform, culling etc. like what nvidia does since Maxwell or even Kepler.

Some synthetic tests are showing that the cull-rate at about 50% on kepler is nearly as high as the 100% cull-rate, while Tahiti is like +10% of the 0% cull-rate (miles away from the 100%-cull value)

So if the first perf. for the vertices and 3d-scene is too low or the culling-perf is too low or both takes too long, the first thing is: you added latency
and i whould not want to add unneccesary latency with the first action i make

CSI PC · Dec 18, 2017

_cat said:
YES Post-Transform Geometry stays on-Chip in the L2.
i am not allowed to post links, yet. but i try
https://www.techpowerup.com/img/17-03-01/f34e39b49c7c.jpg

And there is a bit of emphasis on the cache coherency in that slide, which is why I wonder if AMD solution is just not as flexible or efficient in this context.

3dilettante · Dec 18, 2017

CSI PC said:
And there is a bit of emphasis on the cache coherency in that slide, which is why I wonder if AMD solution is just not as flexible or efficient in this context.

The Vega white paper has some interesting wording on the topic (from the primitive shader section):

In keeping with “Vega’s” new cache hierarchy, the geometry engine can now use the on-chip L2 cache to store vertex parameter data.

This arrangement complements the dedicated parameter cache, which has doubled in size relative to the prior-generation “Polaris” architecture. This caching setup makes the system highly tunable and allows the graphics driver to choose the optimal path for any use case.

Vega maintains and enlarges the pre-existing parameter cache, but adds L2 caching. I'm not sure from the wording if this is an automatic spill/fill or an either/or proposition given that the driver is apparently making a decision on which one to use. Nvidia's slide seems to point towards just committing to one thing.

_cat · Dec 18, 2017

*the geometry engine can now use the on-chip L2 cache*
maybe it´s me but that sound not like a 100% that it always does ... but it can

*graphics driver to choose the optimal path*

All modern CUDA capable cards (Fermi architecture and later) have a fully coherent L2 Cache.
http://supercomputingblog.com/cuda/cuda-memory-and-cache-architecture/
Vega is AMDs first try.

_cat · Dec 18, 2017

Fermi launched 7 Years ago, nvidia learned a bit in that time.

pixeljetstream · Dec 19, 2017

About geometry on NV, it always routes through L2

http://on-demand.gputechconf.com/gt...bisch-pierre-boudier-gpu-driven-rendering.pdf

3dilettante · Dec 19, 2017

The part of the Nvidia presentation I was thinking of was the attribute and vertex shader portions running in the L1, then the next slide has triangles distributed via crossbar to the raster units.

3dcgi · Dec 19, 2017

pixeljetstream said:
About geometry on NV, it always routes through L2

http://on-demand.gputechconf.com/gt...bisch-pierre-boudier-gpu-driven-rendering.pdf

Maybe I missed it, but where do you see the L2 mentioned?

Deleted member 2197 · Dec 20, 2017

Once the warp has completed all instructions of the vertex-shader, its results are being processed by Viewport Transform. The triangle gets clipped by the clipspace volume and is ready for rasterization. We use L1 and L2 Caches for all this cross-task communication data.

http://pixeljetstream.blogspot.de/2015/02/life-of-triangle-nvidias-logical.html

Edit: @pixeljetstream, welcome and thanks for posting interesting info on your website. It also seems you have contributed quite a bit to GTC as well.
http://on-demand-gtc.gputechconf.co...&sessionYear=&sessionFormat=&submit=&select=+

Infinisearch · Dec 20, 2017

Apparently the linux driver patch for navi thing is wrong:
https://www.phoronix.com/scan.php?page=news_item&px=AMD-Linux-Drivers-Not-Yet-Navi

AMD: Navi Speculation, Rumours and Discussion [2017-2018]

firstminion

3dilettante

Ethatron

Jawed

Jawed

CSI PC

3dilettante

Jawed

Kaotik

Drunk Member

_cat

_cat

CSI PC

3dilettante

_cat

_cat

pixeljetstream

3dilettante

3dcgi

Deleted member 2197

Guest

Infinisearch