AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Yes from a hardware perspective but was there any software really driving the reason to have RTX in 2018 is there any now?
That’s not how adoption works. No hardware base; no games will be coded for it.
It has to be released; early adopters have always lead the charge for new features that eventually trickle down.

There are certainly more games with RT than I have seen games use their other features.
 
We could be looking at 3 different RTRT implementations: Sony + Microsoft XBox + PC AMD.
Though preferrably it should be only one, for developers' sanity.
Though; this is what DXR is for however. I can’t see different implementations necessarily cause optimization issues. It really is just a call to project rays and to return intersected ones in layman’s view. Following that is running the shader against intersected triangles in which your usual shader optimizations would apply.
 
AMD likely again offers more general compute performance which could compensate this.

That was the case for Fury and Vega but with Navi it seems AMD is dialing back the raw flops in favor of efficiency. With the launch of Turing is AMD still considered to have a better compute architecture?
 
We could be looking at 3 different RTRT implementations: Sony + Microsoft XBox + PC AMD.
Though preferrably it should be only one, for developers' sanity.
I can't see MS wanting custom solution over what AMD will put out on PC space, since they're going to run same APIs etc anyway, and with Sony going so much with "what devs want" last gen, I'm hoping they continue on the same line and will use the same thing, too.
 
That was the case for Fury and Vega but with Navi it seems AMD is dialing back the raw flops in favor of efficiency. With the launch of Turing is AMD still considered to have a better compute architecture?
I don't know.
Latest HW i have compared was FuryX vs. GTX1070. Fury was 1.6 times faster so 1 AMD TF was better than one NV TF. For me. But NV is catching up. (R9 280X was a whole 5 times faster than GTX670 although similar fps in games.)
So i guess, looking at TF actually there is little difference because Turing has some interesting improvemnts on compute?
But 5700XT has 9 TF and competitor RTX2070 has 7.5 TF. So even Navi has doubled the ROP count, personally i hope it's still as strong in compute as GCN.
 
I can't see MS wanting custom solution over what AMD will put out on PC space, since they're going to run same APIs etc anyway, and with Sony going so much with "what devs want" last gen, I'm hoping they continue on the same line and will use the same thing, too.
I don't necessarily know if they would know how to design it either.
Even with forward knowledge of what was coming in DX12, they could not get XBO to FL 12_1 or XBX for that matter.
They can probably request things to be put in, I'm not necessarily sure that means they have the engineers that would be able to do something like a custom RT solution; they would likely have to request more engineers and pay a greater expense to try to fit it in (if the architecture would even support that level of customization). I mean whole companies (imgtec) put the entire company to do it.

I very much doubt building a custom solution for RT is very easy (in working well with the rest of the GPU in harmony)
 
I don't know.
Latest HW i have compared was FuryX vs. GTX1070. Fury was 1.6 times faster so 1 AMD TF was better than one NV TF. For me. But NV is catching up. (R9 280X was a whole 5 times faster than GTX670 although similar fps in games.)
So i guess, looking at TF actually there is little difference because Turing has some interesting improvemnts on compute?
But 5700XT has 9 TF and competitor RTX2070 has 7.5 TF. So even Navi has doubled the ROP count, personally i hope it's still as strong in compute as GCN.

RTX 2070 is end of life. The 5700XT's competition is the 9TF 2070 SUPER DUPER edition.
 
I don't know.
Latest HW i have compared was FuryX vs. GTX1070. Fury was 1.6 times faster so 1 AMD TF was better than one NV TF. For me. But NV is catching up. (R9 280X was a whole 5 times faster than GTX670 although similar fps in games.)
So i guess, looking at TF actually there is little difference because Turing has some interesting improvemnts on compute?
But 5700XT has 9 TF and competitor RTX2070 has 7.5 TF. So even Navi has doubled the ROP count, personally i hope it's still as strong in compute as GCN.

Turing caught up to GCN somewhat in terms of async compute at least in graphics workloads. That said Navi has it's own improvements to graphics workloads with single cycle simd32.
 
That’s not how adoption works. No hardware base; no games will be coded for it.
It has to be released; early adopters have always lead the charge for new features that eventually trickle down.

There are certainly more games with RT than I have seen games use their other features.

Where did I say anything about the way it works? I was talking about it from a consumers point of view. Someone brought up that AMD might be competitive from a ray tracing point of view next year.

Then DavidGraham made a post about he would rather have a CPU and GPU from 2025 which I then pointed out as being a bad comparison, it's not such a big stretch to wait a year for tech.

Nowhere did I have a dig at Nvidia or say anything in regards to there tech or them releasing in 2018.
 
That assumes the competition would set on their laurels ... NVIDIA will push their RT angle to the extremes
AMD just needs to set the right price/performance ratio, all while providing decent (i.e. comparable) raytracing performance for current-gen titles.

Though they'd probably have a bit better performance owing to updates/optimizations in the raytracing API specs... like (pure speculation here) ability to consume meshlets (introduced with the mesh shader geometry pipeline) for BVH generation, which may also be useful for implementing geometry shaders / geometry LOD...

we are talking about the here and now
Here and now, they don't have DXR, take it or leave it. Please come back mid-2020 (or 2025, if you will).

it's pretty safe to assume that consoles use next gen RDNA instead of first gen, and I think AMD already confirmed somewhere 2nd gen RDNA includes RT-hardware
Maybe so, but it's still an APU part - so there will be compromises in die area, memory bandwidth and performance/watt in comparison to high-end desktop GPU part.

We could be looking at 3 different RTRT implementations: Sony + Microsoft XBox + PC AMD
I'd think AMD will use some form of heterogeneous integration to put multiple graphics dies, CPU dies, HBM3 dies, and flash memory dies on the same chip package, and scale the number of these dies according to price/performance point… oh right, yes it's only wishful thinking. :cool:
 
Last edited:
There are a number of different efficiency metrics.
Lets never forget that in the desktop market the by far most significant is performance/$. AMD is pretty much competitive here and differences will be small between the manufacturers unless one player in the duopoly makes a major push for market share.
The desktop market is in a bit of doldrums with respect to performance for the price paid these days. The tiers below enthusiast and high-end tend to be most cost-sensitive, while there is more opportunity to extract revenue beyond the performance improvement at the early-adopter and bleeding-edge market.
Being competitive in a range below the leading edge means not distancing itself significantly from existing inventory or already satisfied demand by virtue of arriving after many buyers have already bought marginally lower-performing chips or have the opportunity to buy them discounted.
AMD's not unique in this, but I think the timing and cost structure give it a less-forgiving position.
Nvidia's numbers seem to be dropping as well and may be in part due to this--if that can be teased out from the pricing effect. Perhaps adding something else new like RT was part of a ploy to get the newer generation to differentiate itself more within the constraints of manufacturing and power at hand.


It looks like it does indirectly through occupancy considerations. My interpretation is that is constrained in consider splitting 256 VPGRs budgets (4x64).
256/1 = 256 (4x64, 1 wave)
256/2 = 128 (4x32, 2 wave)
256/3 = 84 (4x21, 3 wave)
256/4 = 64 (4x16, 4 wave)
256/5 = 48 (4x12, 5 wave)
etc.
As you see there is no divider between 1 and 2 which would allow 4x48 VGPRs as resultant. The code then decides to maximize register use within the occupancy bin.
That would seem to be the choice made by the compiler, but that doesn't point to the hardware needing this.
A SIMD can host up to 10 wavefronts, which requires an average allocation of 24 or fewer registers per wavefront. The granularity at a minimum is 24 registers as far as what the hardware must be able to do, and AMD's documentation gives the actual granularity in 4 or 8.
I'm not following what you mean by having 4x64 when discussing the register budget for a wavefront. A single wavefront can address up to 256 registers, and to match each SIMD has that many on its own.



We can make some assumptions from the AMD RT patent, and from what we can guess about NV:
I ran across a link, perhaps on another board or reddit to something from Nvidia, which might be a better starting point than using AMD's decisions to speculate on Nvidia.
http://www.freepatentsonline.com/y2016/0070820.html or perhaps http://www.freepatentsonline.com/9582607.html

There's a stack-based traversal block with a block which does evaluate nodes and decide on traversal direction like AMD, but there's also additional logic that performs the looping that AMD's method passes back to the SIMD hardware.
There may also be some memory compression of the BVH handled by this path.

Which means the shader core likely becomes available to other pending taskes after this command (like hit point shading, or async compute, ...).
Also we have no indication NV RT cores would use TMU or share the cache to access BVH.
From the above, it seems like the SM's local cache heirarchy would be separate from the L0 in the traversal block.

Conclusion is NV RT is likely faster but takes more chip area. AMD likely again offers more general compute performance which could compensate this.
Possibly also more power efficiency for Nvidia. The AMD method has to re-expand its work every node back to the width of a wavefront and involves at a minimum several accesses to the full-width register file.

After that i would make sense to decrease ROPs and increase RT Cores. Up to the point where rasteirzation is implemented only with compute. (Texture filtering remains, ofc)
At least for now, no clear replacements for the order guarantees or optimizations like the Z-buffer and other hardware present themselves. Nvidia is counting on the areas where rasterization is very efficient to remain very efficient, lest they lose the power/compute budget spare room that RT is being inserted into.

Turing caught up to GCN somewhat in terms of async compute at least in graphics workloads. That said Navi has it's own improvements to graphics workloads with single cycle simd32.
At least from a feature perspective and dynamic allocation, I think Pascal might have had similar checkboxes to early GCN. There are some features that AMD touts for later generations, though how many are broadly applicable or noticeable hasn't been clearly tested. The latency-oriented ones seem to be focused on VR or audio, although I'm not sure recent Nvidia chips have garnered many complaints for the former and I'm not sure many care for the latter.
 
Where did I say anything about the way it works? I was talking about it from a consumers point of view. Someone brought up that AMD might be competitive from a ray tracing point of view next year.

Then DavidGraham made a post about he would rather have a CPU and GPU from 2025 which I then pointed out as being a bad comparison, it's not such a big stretch to wait a year for tech.

Nowhere did I have a dig at Nvidia or say anything in regards to there tech or them releasing in 2018.
Right sorry didn’t follow along how that was related. I thought you said he was being hypocritical for buying RTX now before there was any reason to when it’s clearly better to buy later when more software will be finally available.

The topic you guys are on, can probably be solved using some formulas; Monte Carlo ones come to mind IIRC. It tends to come up in decision making algorithms, like discounting the value of future moves over present moves because so much can change between now and then.

In your argument you guys sort of stand in different positions. If you don’t own ray tracing cards today then the present value is very high and continues to be more valuable as more games with RT is released. Especially if a game you want to play has RT and is released soonish.

If you wait until there is vastly better RT later, you are left without it for a significant amount of time so the value degrades. Games tend to have the most value while they are fresh and new.

There is some discounting of value having to wait a year, but not anywhere close to discounting 2025 hardware. I would agree that waiting for Navi is reasonable especially if the price point is low.
 
That would seem to be the choice made by the compiler, but that doesn't point to the hardware needing this.
A SIMD can host up to 10 wavefronts, which requires an average allocation of 24 or fewer registers per wavefront. The granularity at a minimum is 24 registers as far as what the hardware must be able to do, and AMD's documentation gives the actual granularity in 4 or 8.
I'm not following what you mean by having 4x64 when discussing the register budget for a wavefront. A single wavefront can address up to 256 registers, and to match each SIMD has that many on its own.

My kernel is 8x8x1. We have 65536 VGPRs per CU, or 16384 per SIMD. This yields 256 VGPRs per lane (16k/64), with no other wavefront to run in addition. Or 128 for 2 wavefronts, etc.
The 4x notation was a thought-mixup, I thought I can integrate the 4 step cadence/4-way banking into the register-"count", but that doesn't make much sense this way.
I checked LDS utilization and I don't think that one triggers the compiler to bloat usage to 65. So odd.
 
There are a number of different efficiency metrics.
Lets never forget that in the desktop market the by far most significant is performance/$. AMD is pretty much competitive here and differences will be small between the manufacturers unless one player in the duopoly makes a major push for market share.
If you’re a manufacturer you have reason to care about performance/mm2 since you pay for wafer starts. This is modified by different cost structures for different nodes, but even before Navi AMD did quite well. They have a distinct advantage of course using a denser node, pitching their 251mm2 Navi against Nvidias TU104 (rtx2070S) at 545 mm2 and TU106 (rtx2070) at 445 mm2.
To try to avoid the effects of process and attempt to evaluate architectural efficiency, you would look at performance/gate instead, where the same chips have 10.3, 13.6 and 10.6 billion transistors respectively. AMD is definitely competitive.
And of course you have performance/W which is a tricky one because it changes so drastically depending on frequency and voltages in the relevant intervals. This is also where the lack of test data from the rx5700 makes comparisons difficult at the moment, but that will be rectified within days. It matters mostly at the extreme limit of cooling ability. For mid range products, the current manufacturer scheme unfortunately seems to be to push the chips as far as they will go on 200W or so of power which is possible to cool at reasonable cost.

Looking at the overall picture, and bearing in mind that independent test data is lacking, AMD and Nvidia seem to be within spitting distance of each other, apart from performance/mm2 which is mostly a manufacturing concern that isn’t critical in the midrange. It will be interesting to see what Nvidia will achieve once they move to finer lithography.

performance/gate is not a good way to compare architectures. None of those metrics are because it is foolish to try and explain something complicated with a single number. Nvidia have spent transistors in Turing to enable features that Navi is simply not capable of (or would do very slowly). It also ignores other properties of an architecture such as how well it scales upward (hopefully big Navi will fare better in this way). The only way to properly compare architectures is to compare how they are implemented in the real world, and that is looking like a clear win for Turing. I do agree that they have made up a lot of ground with Navi though. Vega was just horrific so I am pleased with what they've managed to do.
 
Last edited:
Looks like AMD has been producing Navi 10 graphics cards like southeast asians grow rice.

I don't remember OCUK ever claiming to have this amount of cards on day one.

https://forums.overclockers.co.uk/posts/32841426/

Gibbo said:
We have to put a product in to ensure the sections go live and sections can take over a day to appear and as its a Sunday launch were making sure it all goes smooth for 2pm, of course that is where cards shall appear and we have around 1000 in stock. :)
 
I haven't see it posted before (sorry if this is repeated info):
AMD is working on next-gen software/hardware hybrid ray tracing technology, shares first details
[...]

As AMD detailed:

“The hybrid approach (doing fixed function acceleration for a single node of the BVH tree and using a shader unit to schedule the processing) addresses the issues with solely hardware based and/or solely software based solutions. Flexibility is preserved since the shader unit can still control the overall calculation and can bypass the fixed function hardware where needed and still get the performance advantage of the fixed function hardware. In addition, by utilizing the texture processor infrastructure, large buffers for ray storage and BVH caching are eliminated that are typically required in a hardware raytracing solution as the existing VGPRs and texture cache can be used in its place, which substantially saves area and complexity of the hardware solution.”

“A texture processor based ray tracing acceleration method and system are described herein. A fixed function BVH intersection testing and traversal (a common and expensive operation in ray tracers) logic is implemented on texture processors. This enables the performance and power efficiency of the ray tracing to be substantially improved without expanding high area and effort costs. High bandwidth paths within the texture processor and shader units that are used for texture processing are reused for BVH intersection testing and traversal. In general, a texture processor receives an instruction from the shader unit that includes ray data and BVH node pointer information. The texture processor fetches the BVH node data from memory using, for example, 16 double word (DW) block loads. The texture processor performs four ray-box intersections and children sorting for box nodes and 1 ray-triangle intersection for triangle nodes. The intersection results are returned to the shader unit.

In particular, a fixed function ray intersection engine is added in parallel to a texture filter pipeline in a texture processor. This enables the shader unit to issue a texture instruction which contains the ray data (ray origin and ray direction) and a pointer to the BVH node in the BVH tree. The texture processor can fetch the BVH node data from memory and supply both the data from the BVH node and the ray data to the fixed function ray intersection engine. The ray intersection engine looks at the data for the BVH node and determines whether it needs to do ray-box intersection or ray-triangle intersection testing. The ray intersection engine configures its ALUs or compute units accordingly and passes the ray data and BVH node data through the configured internal ALUs or compute units to calculate the intersection results. Based on the results of the intersection testing, a state machine determines how the shader unit should advance its internal stack (traversal stack) and traverse the BVH tree. The state machine can be fixed function or programmable. The intersection testing results and/or a list of node pointers which need to be traversed next (in the order they need to be traversed) are returned to the shader unit using the texture data return path. The shader unit reviews the results of the intersection and the indications received to decide how to traverse to the next node in the BVH tree.”
https://www.dsogaming.com/news/amd-...-ray-tracing-technology-shares-first-details/
 
Status
Not open for further replies.
Back
Top