AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
I'm thinking it's more about "NVIDIA hasn't figured out a way to use both at once, and think SLI bridges are better for gaming"
When my solution A is the way it works, they're absolutely right that it's better.

And it's not even hard to implement: treat the SLI bridge as a virtual video interface way past the intricacies of resource sharing and whatnot and you're done.

It seems much smarter to me to bet on that solution than to think that you're smarter than the combined intelligence of a bunch of engineers who think about it all day.
 
Are you guys talking about implict or explicit multi-GPU? Because AFAIK for explicit PCIe is used. But the way it is talked about it seems like its geared toward bulk transfers.
 
Ok. I assume that this would be a relatifely minor performance optimization?
You asked why I think 32 ROPs are enough and I think geometry discard could make ROPs more efficient. Is that a minor optimisation if that's how it works?

Since not doing any work at all is by far the best way to improve performance, you'd think that in decades of GPU design this would have had already its share of attention.
Why has it taken so long for delta colour compression to get to where it is? Or do you think that bandwidth isn't the single most important constraint in GPU design since forever?
 
If you are talking about linked adapter mode then why are you talking about sharing resources (non framebuffer), aren't resources replicated across adapters in linked mode?
 
If you are talking about linked adapter mode then why are you talking about sharing resources (non framebuffer), aren't resources replicated across adapters in linked mode?
Cross-node sharing/linked adapter mode (though some adapters may comes with multiple linked nodes, mostly 2 nodes), also known as SLI and CFX in the commercial names, allows explicit sharing of resources under Direct3D 12 (and I guess Vulkan too though I did not study in deep this one). This is different from Direct3D 11 and older high-level APIs where quite every resource needs a copy on every VRAM pool (though later versions of NVLINK and AGS APIs allowed a little control over them). What can be shared and what cannot be shared is expressed by the cross node sharing tier. A higher cross-node sharing tier can be nullify if the cross-node configuration is limited by bandwidth: SLI bridges - even the "new double" of Pascal - have a lower bandwidth compared to how many PCI-E lines a typical multi-GPU configuration can gain.
Cross-node sharing is historically tied with AFR, but with the new APIs many different and more efficient techniques or implementation are allowed.

Instead, cross-adapter mode is when you have different adapters (even of different vendors or architectures) seen as different devices. On Cross-adapter mode you can share a resource heap too. In this scenario "share" literally means "copy", but that's not a main issue since in this mode you really want to the GPUs do different jobs and minimize the dependencies. Of course different hardware and different configuration may have different efficiency and cross-adapter resource sharing implementation (ie: two cards of the same vendor with similar architecture may optimize the sharing compared to two card of different vendor).
Cross-adapter mode can be a lot efficient if the workload is well balanced between the GPU involved. AOS is at least one application using this kind of multi-GPU technique, and it demonstrated it can be a lot efficient, as it demonstrated the result may vary if the GPUs involved are switched in the jobs.
 
Last edited:
You asked why I think 32 ROPs are enough and I think geometry discard could make ROPs more efficient. Is that a minor optimisation if that's how it works?


Why has it taken so long for delta colour compression to get to where it is? Or do you think that bandwidth isn't the single most important constraint in GPU design since forever?

I think a while ago someone was blown away of Polaris discard abilities. I think he says "in some cases its faster than anything in the market" referring to the discard capabilities that Polaris had.

Having BW is always important but BW alone is useless unless you can use it(Fury)

In my opinion if Polaris can do the things AMD is saying it can do then I think Polaris resources are "good enough" for the work AMD want Polaris to do. My biggest concern is what will happen to the gap between the 230 dollars polaris vs 400 dollars Pascal. You cant just try to throw a bunch of "high end customs design" of a mid range GPU and populate the 170 dollars gap...specially is Polaris OC is as week as the last rumors are suggesting.
 
I'm thinking it's more about "NVIDIA hasn't figured out a way to use both at once, and think SLI bridges are better for gaming"
More surprising is that we haven't seen NVLink being used over their existing SLI connector. Might be a pin issue, but that would be a rough substitute for a lack of PCIE bandwidth. You could have a second card that was all memory.

If you are talking about linked adapter mode then why are you talking about sharing resources (non framebuffer), aren't resources replicated across adapters in linked mode?
In the case of compute shaders, especially asynchronous, I could see the sharing issues getting rather interesting. Some of the uses would only be practical on a single adapter.
 
I think a while ago someone was blown away of Polaris discard abilities. I think he says "in some cases its faster than anything in the market" referring to the discard capabilities that Polaris had.

Having BW is always important but BW alone is useless unless you can use it(Fury)

In my opinion if Polaris can do the things AMD is saying it can do then I think Polaris resources are "good enough" for the work AMD want Polaris to do. My biggest concern is what will happen to the gap between the 230 dollars polaris vs 400 dollars Pascal. You cant just try to throw a bunch of "high end customs design" of a mid range GPU and populate the 170 dollars gap...specially is Polaris OC is as week as the last rumors are suggesting.

Are you referring to the quote in this post?
 
SLI was introduced for NVidia cards in 2004, back when AGP slots were the standard. PCIe 1.0 came out enabling installing multiple cards simultaneously.. so cool! But PCIe 1.0 bandwidth is only about 1/8 today's PCI 3.0 bandwidth, and implementations were not as polished. The SLI bridge gave NVIdia the ability to directly share rendered images (alternate frames, partial frames, interleaved frames..) over a private bus they could control and therefore not sensitive to exact PCI bus contention, motherboard quality, or northbridge chipset behaviors.. (ah, 2004, we had northbridges!)

It's now over 10 years later and PCIE bandwidth is no longer a major issue for graphics cards, even for sharing 4K@60Hz frames. So why keep SLI?
The answer is simple. It is guaranteed bandwidth, and more importantly it's guaranteed low latency communiation. You want to minimize frame stuttering when using multiple GPUs? Then a low, fixed, latency pipe makes it much easier. Using PCIE only is still possible of course, but if you want to get a guaranteed, low latency, transfer of video frames, the SLI backchannel gives it to you.

Starting now with with Pascal (and perhaps Polaris??) new unified memory models will mean that GPUs share RAM with each other (and the CPU) more transparently and likely more frequently. So our PCIE busses are going to be busier than they have been in the past.. meaning the backchannel bus is more useful now than it was last year. NVidia's new SLI bridges are higher bandwidth likely to handle 4K or higher frames at higher refresh rates.

In summary, SLI bridges are useful not because of extra bandwidth, but for the guaranteed low latency communication. The freed PCIE bandwidth is a pleasant side effect.
 
You asked why I think 32 ROPs are enough and I think geometry discard could make ROPs more efficient. Is that a minor optimisation if that's how it works?
When talking specifically about making ROPs more efficient, I mean it from a narrow point of view: increasing efficient from, say, 90% of their theoretical peak to 95% of their theoretical, not about avoiding them altogether.

Of course, improved discard would help to avoid the alleged 32 ROP bottleneck...

Why has it taken so long for delta colour compression to get to where it is?
Because the bandwidth is not improving as fast core shader performance and the gates required to implement compression are becoming relatively cheaper.

Or do you think that bandwidth isn't the single most important constraint in GPU design since forever?
That's exactly what I think. I've stated many times in the past, in the context of the introduction of HBM, that there's no need yet to go all out on bandwidth. That doesn't mean that it's not important, but it's definitely not the single most important constraint. If it were, Fury X would blow all other GPUs out of the water.
 
In summary, SLI bridges are useful not because of extra bandwidth, but for the guaranteed low latency communication. The freed PCIE bandwidth is a pleasant side effect.
I agree with the general content of your post, but PCIe is high bandwidth enough to guarantee the QoS levels needed to transfer real-time video streams by added enough buffering. Either by using of-chip RAM (which would eat up some extra MC bandwidth) or on-chip FIFOs to feed straight into the display output port.
The latter transforms the SLI/Crossfire-over-PCIe implementation into a matter of cost: how much FIFO die size do you need to have a guaranteed lack of underflows when scanning out pixels.
 
Geometry discard is of 3 types:
1. Backface culling : Solved
2. View volume culling : Solved
3. Hidden surface culling : Solved in TBDR

My point, there's not much left in culling more triangles to ultimately reduce pixel pixel shading and ROP work. So not much to be gained here I think.

You could scale the geometry engines i.e make more shader clusters with like 8 CUs instead of 16 on Fiji, this would also increase the rate of discard. This won't actually save the pixel shading work but just reduces the probability of it being a bottleneck. But then again I am pretty sure it's either the pixel shading or memory bandwidth the bottleneck instead of geometry in most modern games.
 
With NVLink-like signaling, signal integrity could be a real issue.
True, but Nvidia still managed to get 8 working in a box. Granted that was using that mezzanine connector. With HBM likely removing some pin constraints, I'm still wondering if we'll see a dual die card where a link could be hardwired. What's to say they couldn't modify a motherboard for a socketed GPU either? The interconnects are likely all GMI now. A dual or quad socket board, while typically reserved for servers, could have GPUs stuck in some sockets.
 
I think they're talking about rejecting triangles before they get rasterised.

My theory is that pixel shading/ROP is bottlenecked by useless triangles living for too long and clogging up buffers, when they shouldn't even be in those buffers. The ROPs can't run at full speed (e.g. in shadow buffer fill) if triangles are churning uselessly too far through the pipeline before being discarded.

The triangle discard rates for backface culling and tessellated triangle culling are, in absolute terms, still very poor:

http://www.hardware.fr/articles/948-10/performances-theoriques-geometrie.html

The relative rate for culling when tessellation is active appears to be the only time AMD is working well.
I don't know if this was linked here before, for whatever its worth:

I've played around with this little beast and I have to say this is the biggest improvement in years. Polaris is extremely powerful in micropoligons. I didn't even imagine that this kind of performance is possible on a quad raster design. In an extreme test case with 8xMSAA and 64 polys/pixel, the Polaris 10 is the fastest GPU in the market by far.
The second interesting thing is the pipeline stall handling. I wrote a program to test it, and remarkable how it works. I hate dealing with pipeline stalls, because it is hard, but on Polaris the stalls are just reduced by far. Even if I run a badly optimized program, the hardware just "try to solve the problem", and it works great. This behavior reminds me the Larrabee... and now we have it, not from Intel, but a hardware is here to solve a lot of problems!
http://semiaccurate.com/forums/showpost.php?p=266518&postcount=2022
 
The second interesting thing is the pipeline stall handling. I wrote a program to test it, and remarkable how it works. I hate dealing with pipeline stalls, because it is hard, but on Polaris the stalls are just reduced by far. Even if I run a badly optimized program, the hardware just "try to solve the problem", and it works great. This behavior reminds me the Larrabee... and now we have it, not from Intel, but a hardware is here to solve a lot of problems!
Ignorant question of the day: on today's non-Polaris GPUs, how are pipeline stalls visible to the programmer and how does one typically have to deal with it? And what did Larrabee do different? Come to think of it: was Larrabee ever available as a GPU?
 
Status
Not open for further replies.
Back
Top