AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Anarchist4000 · Nov 15, 2018

yuri said:
There was another AMD representative confirming a "new architecture Next-gen" for 2020 at Next Horizon.

A ray tracing only architecture could make sense in that timeframe. With all shaders more or less running the same path, a cascaded SIMD setup would make a lot more sense. Back to the variable SIMD designs from a year or so ago.

Ext3h said:
At the same time we are seeing the "next gen" hints. But contrary to Navi it's not marketed as a successor to the GCN legacy.

With a chiplet design, multiple concurrent architectures could make sense for different tasks.

DmitryKo said:
I see little point in a software implementation when even expensive hardware-accelerated cards are struggling to come with acceptable performance levels. Nvidia did it as a stop-gap solution for developers so they could play with the API before GeForce RTX cards were available.

True, but fully raytracing a scene gets around the serial nature of rasterization. Packing multiple adapters into a system could be doable. Even 6x low to mid range cards could provide acceptable performance while being cheaper than Nvidia's offerings.

BoMbY · Nov 15, 2018

I wonder how well you could implement DXR in a small FPGA? And I also wonder why nobody so far has build a GPU with a small FPGA, which could be used to add/fix some functionality later? At least for professional GPUs the cost for a FPGA could be justified.

Frenetic Pony · Nov 15, 2018

Welp, there's the "half or less" performance of RTX in BFV, a 2080 can't even maintain a playable framerate at 4k max settings, this is for a feature that isn't even apparent unless there's a very smooth, shiny surface onscreen at the moment. I do wonder what exactly the bottleneck is, Turing has it's more divergent shading built in, but trying to grab random memory locations could be a major slowdown. Hooary latency! How would you even design a memory system for random access? I remember some experimental US military(?) project stating that was the goal of a custom type of supercomputer they were building, but it wasn't that long ago that it was announced so I doubt whatever system it's proposed to work on is even built yet.

Regardless, the bigger news is that Sony is skipping E3 next year. Apparently due to a lack of games to show off, which kind of screams "PS5 2020". I'd more likely bet that Sony would stick with AMD than MS would, considering the success they've had with AMD. But in what fashion, will the success of the Switch make Sony go that route? Some back of the napkin math for 7nm and Vega's rather nice tdp at low clock speeds shows they could squeeze a PS4 Pro into a 12 watt tablet, maybe 10. Along with more ram and a much better CPU that'd make a justifiable "next gen" machine a bit larger and louder than a Switch, yet with triple the raw compute of an Xbox One. Is that good enough, or would they go with something like the rumored Navi mainstream card, and get 9/10+ teraflops for a stationary console?

We already know Microsoft is looking at a dual regular/"cloud" console business model, which seems more Nvidia's thing even though AMD has stated their working on it too. It's good they have a regular console as a backup in case the cloud thing implodes though. Which it could, I'm in the Google game streaming beta (it streams AC Odyssey). And it require a 15mbps connection along with a very solid lack of packet drop. On a shared 50mbps connection that can be hard to get, and while that's not the max I could get where I live the 100mbps connection is rather pricey. I live in the SF bay area, so it's not like broadband options are lacking. While LTE ping and packetloss is great, bandwidth is expensive, I can't imagine many people would be able to "cloud" to their mobile device without racking up a huge bill. There seems to be a lot of video game execs straight up getting off on the idea of game streaming as the future, but MS had to quickly backpedal on so much as requiring and internet connection to activate games with the Xbox One. Thus requiring a stable, high bandwidth, low latency connection to game at all seems a stretch for high adoption in the next two or three years.

Anarchist4000 · Nov 16, 2018

BoMbY said:
I wonder how well you could implement DXR in a small FPGA? And I also wonder why nobody so far has build a GPU with a small FPGA, which could be used to add/fix some functionality later? At least for professional GPUs the cost for a FPGA could be justified.

Not difficult, but perhaps a bit wasteful considering the simplicity of the network. AMD uses FPGAs(ACE/HWS) for the work scheduling already. For compute to make a few configurable instructions it could be useful, however there would be register bandwidth constraints. A better solution would be a slightly more complex forwarding network for all compute units. Pass along the result to another ALU instead of writing the result back to SRAM.

Frenetic Pony said:
How would you even design a memory system for random access?

More channels and really tight timings. Which likely means lots of raw bandwidth. Essentially crypto optimizations.

Rootax · Nov 16, 2018

Can a lot of cache help too ?

tunafish · Nov 16, 2018

Rootax said:
Can a lot of cache help too ?

That helps a lot when there is time/space locality in your memory access. There is a lot of that in the typical, non-streaming program usage what we normally call "random access". However, the crypto algorithms that have been tuned for random access exhibit "true" random access, where any address is equally likely. For that, caches do basically nothing at all.

Rootax · Nov 16, 2018

Ok thx, and, sorry if If this sound stupid, but, can a kind of "result cache" (like oracle does for software...) be implemented in hardware and help ? In some case I'm pretty sure same exact calculation have been done few ms / ns before and doesn't need to be done again... Eh... As I am typing that, I realize It may be dumb....

Alexko · Nov 16, 2018

Rootax said:
Ok thx, and, sorry if If this sound stupid, but, can a kind of "result cache" (like oracle does for software...) be implemented in hardware and help ? In some case I'm pretty sure same exact calculation have been done few ms / ns before and doesn't need to be done again... Eh... As I am typing that, I realize It may be dumb....

Are you alluding to this sort of thing?

https://hal.inria.fr/hal-01193175/document

Rootax · Nov 16, 2018

Alexko said:
Are you alluding to this sort of thing?

https://hal.inria.fr/hal-01193175/document

It's way too technical for me, but from what I get, sounds like the same kind of goal yes.

pTmdfx · Nov 17, 2018

Anarchist4000 said:
AMD uses FPGAs(ACE/HWS) for the work scheduling already.

I do not recall any substantial evidence of it being FPGA. If anything it is more likely an embedded microprocessor core with upgradable microcode, especially when the underlying model is essentially command packet queue processing. FPGA sounds like an overkill.

3dilettante · Nov 17, 2018

The command processor and other front end components had their details discussed when the PS4 was hacked.
https://fail0verflow.com/media/33c3-slides/#/74
There's a repository with a disassembler of the ISA, which as noted deals with processing command packets from queues and referencing them with a stored microcode payload. The payload has been referenced at various times with regards to the additional features such as rapid response queues and HSA.
https://github.com/fail0verflow/radeon-tools/tree/master/f32

I believe at one point someone found a linkedin entry from an engineer discussing working on an F32 processor for the ACEs or some other front end block a number of years ago.

An FPGA doesn't seem like it would serve AMD well in this part of the GPU, which is running a loop that is a well-settled algorithm that has been hardware-implemented since forever. Other parts of the GPU's functionality would also not be well-served due to the overhead in hardware and indirection needed to emulate in an FPGA what is again a well-stabilized execution loop. FPGAs work best when there's an algorithm or set of them subject to change or an implementation too niche to justify a physical implementation, or a vendor too limited in resources or needing a quick proof of concept.

Anarchist4000 · Nov 18, 2018

pTmdfx said:
I do not recall any substantial evidence of it being FPGA. If anything it is more likely an embedded microprocessor core with upgradable microcode, especially when the underlying model is essentially command packet queue processing. FPGA sounds like an overkill.

For the command processor sure, but we're talking relatively simple tasks involving complex comparisons that can be updated. A simple PLD might be more reasonable. AMD was able to turn ACEs into HWS with some programmability that appears to occur in parallel. Simple binary comparisons are what were needed involving only a few bits per comparison. I'd have to dig a bit for a reference from a few years back.

3dilettante said:
FPGAs work best when there's an algorithm or set of them subject to change or an implementation too niche to justify a physical implementation, or a vendor too limited in resources or needing a quick proof of concept.

The reference provided above was also from 2009 I believe, prior to the existence of the ACE and async compute scheduling. An ASIC would be ideal, but overkill of a PLD or FPGA makes sense if they foresaw a need to update the logic to feed the command processor. Documentation did show the same hardware being reconfigured based on what the card was doing: ACE, HWS, or some compute thing whose name I can't recall ATM. Same hardware with different dispatch scenarios and microcode. The fact they were dealing with 3-4 bit or less comparisons for queue priority lends itself to the programmable logic. A straight microcontroller could be too slow on those tasks.

3dilettante · Nov 18, 2018

Anarchist4000 said:
For the command processor sure, but we're talking relatively simple tasks involving complex comparisons that can be updated.

The tasks they run are updated by changing the contents of the microcode store, and there have been instances where the front ends could not be updated to change their tasks due to a lack of sufficient microcode store for the various packet formats or functionality changes.
The custom processor can move data or update registers deeper in the GPU based on the microcode software.

The reference provided above was also from 2009 I believe, prior to the existence of the ACE and async compute scheduling.

I provided a reference that is more recent and addresses a GCN variant.

An ASIC would be ideal, but overkill of a PLD or FPGA makes sense if they foresaw a need to update the logic to feed the command processor. Documentation did show the same hardware being reconfigured based on what the card was doing: ACE, HWS, or some compute thing whose name I can't recall ATM.

The marketing diagrams were updated to either match the hardware more closely or to match choices made in the microcode and which pipelines ran which payload.
https://www.phoronix.com/forums/for...ource-amd-linux/856534-amdgpu-questions/page3
(#25 discusses the ME and MEC blocks, #28 explains how slides changed from ACEs to HWS in marketing for the same front end)
My interpretation is that the processors themselves were not changed, their execution loop is to load a packet, cross reference with microcode, and execute the microcode.

A straight microcontroller could be too slow on those tasks.

Too slow to compare a few small values? Even if individually too slow to work through a single-digit number of alternatives, there are 4-8 of them working concurrently.

Azhrei · Nov 22, 2018

Coreteks has a "source" that cites 96 CUs for Navi. 64 was the limit for GCN I believe, does anyone believe this? He also mentions a high end Navi which he calls Navi+ which his source says should be out by the end of next year. Hmm.

Alexko · Nov 22, 2018

Azhrei said:
Coreteks has a "source" that cites 96 CUs for Navi. 64 was the limit for GCN I believe, does anyone believe this? He also mentions a high end Navi which he calls Navi+ which his source says should be out by the end of next year. Hmm.

There's nothing implausible in there, but it could still be entirely made up.

del42sa · Nov 22, 2018

Azhrei said:
Coreteks has a "source" that cites 96 CUs for Navi. 64 was the limit for GCN I believe, does anyone believe this? He also mentions a high end Navi which he calls Navi+ which his source says should be out by the end of next year. Hmm.

completely made up ....

here is another "plausible" contribution from the same youtuber:

Deleted member 13524 · Nov 22, 2018

IIRC, 4x 16 CU "engines" was never a limitation in GCN.

It's just that Fiji was already incredibly wide for its time and AMD thought it would be better to increase clocks + optimize geometry and ALUs + increase L2 cache instead.
Wether that was the best choice with what they had or not is up to debate, though.

del42sa · Nov 22, 2018

ToTTenTranz said:
IIRC, 4x 16 CU "engines" was never a limitation in GCN.

It's just that Fiji was already incredibly wide for its time and AMD thought it would be better to increase clocks + optimize geometry and ALUs + increase L2 cache instead.
Wether that was the best choice with what they had or not is up to debate, though.

It was "never a limitation", blah blah blah..... The truth is that since Hawaii GPU AMD never get past 4 SE design, regardless CU/SE count ratio and three GCN generations of GPU´s

Kaotik · Nov 22, 2018

Current GCN implementation has a limit of 4 SE's, 64 CUs, 64 ROPs, but I think one AMD engineer at some point confirmed they could put in more SE's (and thus CU's and ROPs) if they revamp the frontend

Deleted member 13524 · Nov 22, 2018

del42sa said:
It was "never a limitation", blah blah blah..... The truth is that since Hawaii GPU AMD never get past 4 SE design, regardless CU/SE count ratio and three GCN generations of GPU´s

https://www.anandtech.com/show/11717/the-amd-radeon-rx-vega-64-and-56-review/2

They have made it clear that 4 compute engines is not a fundamental limitation – they know how to build a design with more engines – however to do so would require additional work. In other words, the usual engineering trade-offs apply, with AMD’s engineers focusing on addressing things like HBCC and rasterization as opposed to doing the replumbing necessary for additional compute engines in Vega 10.

These are AMD's statements. You may believe them or you may not.

AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Anarchist4000

BoMbY

Frenetic Pony

Anarchist4000

Rootax

tunafish

Rootax

Alexko

Rootax

pTmdfx

3dilettante

Anarchist4000

3dilettante

Azhrei

Alexko

del42sa

Deleted member 13524

Guest

del42sa

Kaotik

Drunk Member

Deleted member 13524

Guest