AMD: Navi Speculation, Rumours and Discussion [2017-2018]

silent_guy · Jul 18, 2018

giannhs said:
... but amd already has a working model on ryzen and probably knows quite a lot more from a practical pov of how bad or not it can be for a gpu

AMD has an inter chip connect for CPUs, where all CPU still very much act as individual agents. Nvidia has Nvlink which does the same for GPUs.

There is not reason whatsoever to think that AMD knows quite a lot more from a practical point of view.

Anarchist4000 · Jul 18, 2018

Rootax said:
Drivers can't expose them as one ? Real question. If the hardware is designed this way...

Exposing the ability to make multiple adapters act as one isn't in itself difficult, it's doing it efficiently. The API specifies geometry to be rasterized in the order it is submitted, even if this is seldom required nowadays. Very rarely if using OIT as blending actions are what cause the issues. Drivers already selectively bypass this with out of order raster, but technically it's cheating and not to specification. Nvidia has the same issue, but they have higher clocks driving their front end. MCM with the current front-ends would be along the lines of Vega56 to 64 comparisons. Added horsepower but not necessarily improving framerate all that much. Then you need divisible compute tasks that largely stay on their respective chips for efficiency. Not all that difficult, but developers need to code with tiles and a possibly dynamic distribution in mind.

Currently there are 4, and occasionally 6, shader engine designs where geometry is binned into quadrants. This number can be increased, but then there are issues balancing the work distribution and any geometry crossing boundaries can be problematic. The gain is that most geometry doesn't cross boundaries and can be discarded is many cases.

Magnum_Force said:
For multi gpu rendering, I wonder if it's possible to use some kind of multi frame super resolution, where each gpu renders the same scene, but from an ever so slightly different perspective - literally a pixel shift or two, then combine into the final upscaled image. I think some VR rendering techniques are somewhat similar?

Still easier to use some form of split frame in that case. Each chip having it's own memory channel with different resources loaded/cached. The different perspectives, as you suggested, would likely heavily benefit from better caching. SFR theoretically doubling your cache size to save memory bandwidth where a scanline type approach would only double cache bandwidth. I can't immediately think of any scenario where SFR wouldn't be superior. At best there would be a purely computational load where the distribution wouldn't even matter.

giannhs · Jul 19, 2018

silent_guy said:
AMD has an inter chip connect for CPUs, where all CPU still very much act as individual agents. Nvidia has Nvlink which does the same for GPUs.

There is not reason whatsoever to think that AMD knows quite a lot more from a practical point of view.

i dont think its the same at all considering that amd does it inside the chip while nvlink basicly is a wider sli protocol that still needs a hardware switch to do the job for it and plus it doesnt have a direct link to the cpu's because of the qpi retardness

BoMbY · Jul 19, 2018

Rootax said:
Drivers can't expose them as one ? Real question. If the hardware is designed this way...

I don't think what was said is the full story. AMD is working on stuff like this for a longer time: https://patents.google.com/patent/US20160253774

Rootax · Jul 19, 2018

Yeah but I don't really trust AMD with this kind of thing after the Vega kind of fiasco. Like, they have very good ideas, but fail to execute / implement them. And they're have pretty limited resources.

silent_guy · Jul 19, 2018

giannhs said:
i dont think its the same at all considering that amd does it inside the chip while nvlink basicly is a wider sli protocol that still needs a hardware switch to do the job for it and plus it doesnt have a direct link to the cpu's because of the qpi retardness

You are very confused about what nvlink can do.

Wynix · Jul 21, 2018

Rootax said:
Yeah but I don't really trust AMD with this kind of thing after the Vega kind of fiasco. Like, they have very good ideas, but fail to execute / implement them. And they're have pretty limited resources.

Vega fiasco? Are you talking about the availability problem?

Rootax · Jul 21, 2018

Wynix said:
Vega fiasco? Are you talking about the availability problem?

I'm talking about performances, which were a let down after all the hype, and can't even beat "old pascal" in most scenario. But the real fiasco for me was the Primitive Shaders thing. On paper it's was a really good idea. But it went from "we will auto enable it with driver" to "ok you will need a special api to use them", and right now there is no way to use them. So we have a O/C Fiji with more vram.

That's why I said I don't "trust" their good ideas / patents.

Digidi · Jul 21, 2018

The Primitive Shader thing is really disappointing. I don’t know why AMD don‘t release a demo or why they don‘t show the programm where they measured the 17 primitive/clock from their white paper.

Alexko · Jul 23, 2018

Digidi said:
The Primitive Shader thing is really disappointing. I don’t know why AMD don‘t release a demo or why they don‘t show the programm where they measured the 17 primitive/clock from their white paper.

My guess is that they've just decided to cut their losses on Vega and focus their ressources on making it work (probably better) on Navi. At this point, it would do little to increase Vega sales anyway.

Rootax · Jul 23, 2018

Or they're not focused on PC high end gaming graphics anymore, like some articles suggest. It's all about AI/compute/datacenter and consoles. PC graphics in the low end and middle, but no high end to compete with top nvidia cards. We already are in this situation since Polaris... And maybe it's better for them right now. They don't have the ressources to fight nVidia everywhere.

BoMbY · Jul 24, 2018

Rootax said:
Or they're not focused on PC high end gaming graphics anymore, like some articles suggest.

This was said in January:

On the GPU side, we have multiple teams that are looking at how to improve the instruction set as you go forward and I would say that we are thinking on the compute aspect of GPUs. How do we think about sort of the traditional GPU engines to special accelerators, things like that.

Q: Does that mean that there is room in the future for GPU bifurcation, between a gaming focus and a compute focus?

LS: I think there is. You will see us move on this, and we’re very committed to gaming so that’s not going to change, but you will see us do some more purpose-built products for the compute side of things.

https://www.anandtech.com/show/1231...n-exclusive-interview-with-dr-lisa-su-amd-ceo

lanek · Jul 24, 2018

Rootax said:
Or they're not focused on PC high end gaming graphics anymore, like some articles suggest. It's all about AI/compute/datacenter and consoles. PC graphics in the low end and middle, but no high end to compete with top nvidia cards. We already are in this situation since Polaris... And maybe it's better for them right now. They don't have the ressources to fight nVidia everywhere.

In reality they need both, they need AI, computing gpu's ( i mean computing on a large scale ( scientific, 3D modeling, rendering, data crunch etc ) and high end gaming.. this is where ressources become a problem )...

iamw · Jul 30, 2018

SPLIT FRAME RENDERING
http://www.freepatentsonline.com/20180211435.pdf

CarstenS · Jul 30, 2018

CarstenS said:
Sure, it looks good in some demos, you can boost benchmark scores and for some people it works in some games. Generally though, as is evident with the diminishing effort put into marketing Crossfire and SLI to gamers or even to include support in certain types of graphics cards, AMD and Nvidia seem to have all but given up on that idea and focus on the professional market with MGPU.

Computerbase did a test with SLI and Crossfire on some newer titles, doubling both computational power and memory bandwidth with the respective setups. They reported that it was actually quite hard for them to finde suitable titles that had Multi-GPU support in the first place.

And yet, even though in some cases SLI and Crossfire managed to produce more frames per second, the results were in each and every case inferior to what a gamer would experience with one of the respective cards.
https://www.computerbase.de/2018-07/crossfire-sli-test-2018/

They also did re-test the SLI setup with a high speed SLI bridge, which improved results considerably compared with the the standard model. Maybe inter-die bandwidth and intelligent management of that data is more important for games after all?

Anarchist4000 · Jul 30, 2018

CarstenS said:
Maybe inter-die bandwidth and intelligent management of that data is more important for games after all?

Still more beneficial to avoid the synchronization in the first place. AFR, which Crossfire and SLI will likely attempt, isn't going to work all that well. Really need SFR and a developer coding with it in mind. Not necessarily difficult, just need to avoid some of the guarantees. Throw triangles at screen space without consideration to order and pixels crossing boundaries it should work. Current limitations are a bit silly.

yuri · Jul 30, 2018

iamw said:
http://www.freepatentsonline.com/20180211435.pdf

This patent seems to be interesting. Any conclusions?

A quick layman skim through:
Two 'GPUs' with good old 16-lane SIMD and 4 'geometry engines', etc. Work could be easily split and done in parallel after the GS stage (seems to be intuitive). Messing with vertices is harder and with tessellation included super-hard (and slow?)...

BoMbY · Jul 31, 2018

Well, if they had a good working Crossfire solution there wouldn't be any reason not to put multiple GPU dies on a single interposer anymore ...

3dilettante · Aug 6, 2018

iamw said:
SPLIT FRAME RENDERING
http://www.freepatentsonline.com/20180211435.pdf

yuri said:
This patent seems to be interesting. Any conclusions?

A quick layman skim through:
Two 'GPUs' with good old 16-lane SIMD and 4 'geometry engines', etc. Work could be easily split and done in parallel after the GS stage (seems to be intuitive). Messing with vertices is harder and with tessellation included super-hard (and slow?)...

From my initial read, the patent appears to outline two scenarios for multiple GPUs working together to render to a screen space that is subdivided into zones each GPU is responsible for.
Both scenarios involve a generally identical stream of commands being fed to all the GPUs.
The first and most straightforward scenario has similar language to other patents and disclosures that have AMD's graphics pipeline divided into world-space work (input, transform, tessellation, etc.) and screen space (pixel), and has every GPU duplicate the world-space work and pass along to the screen-space portion of the pipeline fragments that overlap with an area of the screen a GPU is responsible for. This sounds the most like the culling work done by primitive shaders and their methods for doing the same sort of coverage determination for the responsible shader engine and screen-space tiled ROPs within a GPU.
It is more straightforward and requires a low amount of cross-communication and synchronization, but also does not leverage the extra hardware and resources very well prior to getting to the screen-space part of the process.

The second scenario seeks to distribute world-space work among the different front ends with a dedicated work distribution facility. Each GPU still receives mostly the same commands, but the input assembly, setup, and geometry portions are farmed out in chunks for each GPU to take on individually. The patent puts forward round-robin distribution of sections of the index buffers and groups of setup primitives as an example.
A work distributor is found between the primitive setup and geometry/vertex shader phases in the absence of tessellation, or the work distributor is found on the input and output ends of the tessellation unit.

Each GPU's work distributor internally tracks the global API submission order with incrementing counters, a series of FIFOs for each geometry engine and tessellation block, and communications to and from other work distributors as to the status or ordering of various work items each GPU is responsible for.
A work distributor will have a counter and FIFOs for each of its local geometry engines and shader launch pipes, as well as a series of FIFOs corresponding to the equivalent hardware belonging to other GPUs. Each distributor will run through the same evaluation process, and then it compares the calculated selection tags to what is available locally.

The distributors accelerate the distributed work process by semi-independently incrementing the ordering count (each GPU derives its count from effectively the same command stream) and using the same load-balancing rules to rapidly pass data to local engines or discard elements that another GPU (independently making the same calculations) will cover. A lower number of updates related to completion status and the output of setup stages is broadcast from each GPU to all the others so that they have a consistent view of what the ordering is and what is in-progress. Some output data from the geometry engines is broadcast to the FIFOs in the other GPUs, whereas in other cases a stage that expands the amount of data like the tessellation stage might just have the work distributors pass the relevant ordering number to a GPU that will then fire up the selected tessellation unit, which will read in control points and feed the next surface/geometry shader locally.

This allows the GPUs to provide more resources to hopefully speed up the world-space portion of the process, with a dedicated portion for maintaining ordering guarantees, broadcasting status and outputs, and for making accelerated culling decisions about whether their local GPU will be handing a set of inputs or not. While there is a work distributor of sorts mentioned in recent AMD GPUs, the last part concerning culling seems like it brings part of the culling duties of primitive shaders that might be part of the first scenario in the patent (and perhaps primitive shaders as we know them) and places the decision making in this dedicated logic stage.

BoMbY · Aug 8, 2018

Someone added some GFX10 stuff to the settings_gfx9.json in GPUOpen-Drivers/pal:

----

Gfx9UseDcc - "Bitmask of cases where DCC (delta color compression) surfaces will be used":

New Option:

Gfx10UseDccUav - "Shader writable surfaces(UAV)"

----

Gfx10UseCompToSingle - "Whether we need to set first pixel of each block that corresponds to 1 byte in DCC memory, and don't need to do a fast clear eliminate based on image type."

Options:

Gfx10UseCompToSingleNone: "Use comp-to-reg all image type."
Gfx10UseCompToSingle2d: "Use comp-to-single for 2d image."
Gfx10UseCompToSingle2dArray: "Use comp-to-single for 2d array image."
Gfx10UseCompToSingleMsaa: "Use comp-to-single for MSAA image."
Gfx10UseCompToSingle3D: "Use comp-to-single for 3d image."
Gfx10DisableCompToReg: "If set, comp-to-reg rendering will be disabled for images that were cleared with comp-to-single."

----

SdmaPreferCompressedSource - "Affects tiled-to-tiled image copies on the GFX10 SDMA engine where both images are compressed Set to true to leave the source image in a compressed state, set to false to leave the dest image in a compressed state."

----

Gfx10ForceWaveBreakSize - "Forces the size of a wave break; over-rides client-specified value."

Options:

Gfx10ForceWaveBreakSizeNone: "No wave breaks by region."
Gfx10ForceWaveBreakSize8x8: "8x8 region size."
Gfx10ForceWaveBreakSize16x16: "16x16 region size."
Gfx10ForceWaveBreakSize32x32: "32x32 region size."
Gfx10ForceWaveBreakSizeClient: "Use client specified value."
Gfx10ForceWaveBreakSizeAuto: "Let PAL decide."

----

And then there are some GFX9 options referencing a PAL_BUILD_NAVI10_LITE in addition to a PAL_BUILD_GFX10.

Not exactly sure what it all could mean.

AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Moderator