Next Generation Hardware Speculation with a Technical Spin [post E3 2019, pre GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
Mention in err file doesn't look like BVH ray intersect support ;)
The err files generally mention errors for instructions the GPUs recognize, though not always. There's usually a more explicit mention of an opcode not being supported at all, but since this is barely a skeleton of an error file the error types all a generic string.
The first set of errors wouldn't be an unrecognized opcode, but a conflict with a different ISA's version of a deep-learning instruction, which appears to be an opcode from GFX908 (Arcturus?).
s_getreg is a standard instruction, so if there's an error condition perhaps related to the specific hardware register value being flagged it would probably get a different error string than the unhelpful one present.
The image load instructions are also known opcodes, so if there is a problem it may have to do with the specific mode and resources flagged, which would be a different error than the instruction being missing.

It's not definitive proof, but there seems to be a possibility that the bvh encoding is recognized in some way by that version of GFX10.
 
12 core, 14/15TF 28gb ram sounds like a high-end-ish pc :p

True, although it's not that unthinkable for such specs to be accurate for a dev kit.

A 14/15TF Vega would get kicked around by a 10TF Navi. Even the 80CU Arcturus aspect isn't that strange when you consider Navi uses dual CU's.

Doubling of memory used to be quite common in dev kits, so 28GB could still be indicative of the 16-24GB of memory we're likely to see in the final console.

The only thing about that dev kit "leak" that strikes me as particularly weird is the number of CPU cores. I don't know how likely more than 8 cores would be, and I don't recall dev kit CPU's differing from retail in anything but clockspeed.
 
True, although it's not that unthinkable for such specs to be accurate for a dev kit.

A 14/15TF Vega would get kicked around by a 10TF Navi. Even the 80CU Arcturus aspect isn't that strange when you consider Navi uses dual CU's.

Doubling of memory used to be quite common in dev kits, so 28GB could still be indicative of the 16-24GB of memory we're likely to see in the final console.

The only thing about that dev kit "leak" that strikes me as particularly weird is the number of CPU cores. I don't know how likely more than 8 cores would be, and I don't recall dev kit CPU's differing from retail in anything but clockspeed.

What we know now about the Arcturus name is that it belongs to a specific chip, and that software changes for drivers and LLVM indicate the chip is a derivative of the Vega line that is very focused on being an MI-100 deep-learning product--to the point that it lacks a 3D command processor. That seems like a warning flag.

That context aside, the 1 GB L4 cache seems odd as well. First, what we know of Navi or caches that would be reasonable leaves questions about it. Even for a CPU hierarchy, L4 caches are rare, and AMD's GPUs stop at L2. The size doesn't seem to make sense for the bandwidth it gives. Bandwidth-wise it seems rather unimpressive for a GPU (Navi's L2 already delivers that much bandwidth or more.)



As for the speculation about RDNA 2 versus RDNA 1, I think the process node is a potentially weaker indicator of architecture since AMD hasn't restricted families to one process node and semicustom allows for things to hop nodes if the customer pays. What AMD has for a roadmap for its consumer or professional products also doesn't bind its semicustom products.
Normally, I'd tend to focus on what GPUs are launched nearest to the consoles, as the current gen had a representative in Bonaire half a year prior to launch.
However, I'm not confident that it's required that a client GPU with a similar tech baseline to the consoles has to launch before them. It would probably be the case that whatever GPUs are in development and at similar stages with the consoles would be more representative, and that a client GPU could potentially wrap up its final stages faster than a whole console SOC and platform.
However, random delays and AMD's product cycles might not line up like they did with Bonaire, and some things like the front end of Hawaii more closely resembling the PS4's show that things don't march in lockstep.
 
That context aside, the 1 GB L4 cache seems odd as well. First, what we know of Navi or caches that would be reasonable leaves questions about it. Even for a CPU hierarchy, L4 caches are rare, and AMD's GPUs stop at L2. The size doesn't seem to make sense for the bandwidth it gives. Bandwidth-wise it seems rather unimpressive for a GPU (Navi's L2 already delivers that much bandwidth or more.)

The original MS Arcturus rumour was from late January and included infos which weren't widely themed so somebody who either was well informed and just had a pulse on AMD NDA docs/rumours created it or there is more behind it. Then the rumour was "dismissed" and people didn't really touch it anymore for months. The first time the Arcturus name popped up with no real infos was September 2018.

But then Silenti posted this with a video link in May.

https://forum.beyond3d.com/posts/2068150/

which discussed AMD's Zen3 and other product plans and suddenly these infos like a configurable changed Hyperthreading scheduler and L4 IO cache patents pop up it felt like a weird coincidence. Sure, who knows what's real with these youtube HW rumour channels so it's possible the guy was influenced by this original rumour. But discussing this *4* months later with more details sounded unlikely to me. To me these looked more like 2 different information sources.

What does RT really need? A large low latency cache to keep track of the BVH tree. So just from that perspective such a large EDRam Cache made sense to me at least from a design perspective. Crystalwell was from 2013 and used 128MB. Doing a 1GB version 7 years later should be possible and who knows if the size was just to profile real scenarios.
 
We know so far Nvidia will set aside 1GB of VRAM for it.
Where do you get this number from?
I would assume the BVH can have any size depending on scene, also larger than 1GB? Likely 1GB is some kind of average from current games?

Beside BVH, it would be interesting to know if they use a stack per ray (likely) and where they store it.
The only option i see is L1 (which they also use for LDS / texture cache), likely reserving a fixed, small size per ray and if this becomes too small the traversal algorithm needs some restart fallback to still give correct results.
I guess this will be very similar for AMD, maybe the TMU patent did even mention LDS. (They also added some short stack traversal variants to Radeon Rays.)

Also interesting: Does NV use binary tree or MBVH?
AMDs TMU patent mentioned a branching factor of 4, so the tree would have much less levels than a binary one, and memory access can be made more coherent.
(Still, i wonder why not 8, because that's a nice number to divide 3D space like an octree, while 4 would be nice for 2D like quadtree - must have technical reasons.)

However, it's just details. I do not expect a big difference between NV / AMD or desktop / console from the programmers perspective.
Only exception is AMDs option for programmable traversal to enable things like stochastic LOD, fallback to reflection probes if ray takes too long, things like that.
 
Where do you get this number from?
I would assume the BVH can have any size depending on scene, also larger than 1GB? Likely 1GB is some kind of average from current games?

Beside BVH, it would be interesting to know if they use a stack per ray (likely) and where they store it.
The only option i see is L1 (which they also use for LDS / texture cache), likely reserving a fixed, small size per ray and if this becomes too small the traversal algorithm needs some restart fallback to still give correct results.
I guess this will be very similar for AMD, maybe the TMU patent did even mention LDS. (They also added some short stack traversal variants to Radeon Rays.)

Also interesting: Does NV use binary tree or MBVH?
AMDs TMU patent mentioned a branching factor of 4, so the tree would have much less levels than a binary one, and memory access can be made more coherent.
(Still, i wonder why not 8, because that's a nice number to divide 3D space like an octree, while 4 would be nice for 2D like quadtree - must have technical reasons.)

However, it's just details. I do not expect a big difference between NV / AMD or desktop / console from the programmers perspective.
Only exception is AMDs option for programmable traversal to enable things like stochastic LOD, fallback to reflection probes if ray takes too long, things like that.
I forgot who did the testing; but someone kept checking VRAM values with RTX on and RTX off. And RTX on always set aside 1GB of VRAM.
I don't have more information, perhaps one of the first DF tests on BFV, but I believe it may have been @Dictator who did this test. I don't know if this is for 1 title or all titles. It hasn't been brought up since the first reviews on RTX titles. Might be a good time to check again.
 
which discussed AMD's Zen3 and other product plans and suddenly these infos like a configurable changed Hyperthreading scheduler and L4 IO cache patents pop up it felt like a weird coincidence. Sure, who knows what's real with these youtube HW rumour channels so it's possible the guy was influenced by this original rumour. But discussing this *4* months later with more details sounded unlikely to me. To me these looked more like 2 different information sources.
There was a recent slide from AMD that indicates no change in the SMT setup for Zen3, and no mention of an L4 cache.

What does RT really need? A large low latency cache to keep track of the BVH tree. So just from that perspective such a large EDRam Cache made sense to me at least from a design perspective. Crystalwell was from 2013 and used 128MB. Doing a 1GB version 7 years later should be possible and who knows if the size was just to profile real scenarios.
The on-die latencies for AMD's GPUs are very significant, on the order of ~100 cycles for an L1 hit and ~200 for a hit to the L2 for GCN. AMD has touted something like 10% improvement in some unspecified way for Navi.
I think it would be difficult for an L4 in this instance to change the overall picture, since its latency would be additive to the L2 and presumably the L3 that would need to exist between the L2 ad L4.
 
I think it would be difficult for an L4 in this instance to change the overall picture, since its latency would be additive to the L2 and presumably the L3 that would need to exist between the L2 ad L4.
If the cache utilisation is dire in random ray tracing and you're constantly hitting main-memory, a fat cache would make sense. It depends on how well the existing caches can cope with the memory-access patterns of ray-tracing, and whether the tracing can be performed in a way that suits the caches.
 
If the cache utilisation is dire in random ray tracing and you're constantly hitting main-memory, a fat cache would make sense. It depends on how well the existing caches can cope with the memory-access patterns of ray-tracing, and whether the tracing can be performed in a way that suits the caches.
If we're focusing on the portion of ray tracing accelerated by BVH hardware, it's potentially better behaved if there's some level of coherence.
This isn't addressing the BVH building process, although Nvidia seems comfortable with shader hardware performing a lot of the build work for the low-level structure.

Traversal goes through a read-only structure (helps avoid r/w penalties DRAM hates) that is usually formatted to pack data into nodes that are cache-line friendly, and a traversal method that favors depth can provide opportunities for cache line reuse or some locality if there are concurrent ray traversals in a similar path nearby (in terms of location in the BVH and in time).
A likely reason for the RT cores is that they seem to accelerate a version of a stack-based algorithm, which allows for structures that tend to be more compact in memory and the set of intermediates nodes that it might go back to can cache better.

Whether this makes it good rather than just being better than some alternatives hasn't been clearly tested from the analyses I've seen.
Early on, there were signs that RT was not bogging down in terms of bandwidth, but that there were signs of heavier synchronization or contention that may have been improved since then.
Traversal would involve more round trips to memory as the hardware hops from node to node, though in that case an L4 cache outside of 3 or 4 other cache levels in a GPU seems unlikely to have much improvement. AMD's L2 is 50-60% of DRAM latency already, so two more layers of cache in a GPU hierarchy seem risky in terms of getting better latency.
Even if it did, a cache of that size is a big investment for one feature, and keeping it in the hierarchy if RT is thrashing caches means all prior levels of cache are being thrashed before they get to an L4 that may be arguable in its benefits.
 
L1 and L2 cache latency can have an impact on ray tracing. The top level nodes of your global space decomposition structure lives in your caches, because they are hit all the time. The bottom levels almost always miss the caches and hit main memory (ray coherency not withstanding).

On a CPU a first order approximation of cost is to consider traversing the cached top levels free, but if the latency is as high as detailed above, then that's certainly not the case for GCN.

Cheers
 
If the Project Scarlett die is 350mm^2 on 7nm, someone gave an estimate of 40CUs. But what if you assume 7nm+, then what would the CU count be?
 
If the Project Scarlett die is 350mm^2 on 7nm, someone gave an estimate of 40CUs. But what if you assume 7nm+, then what would the CU count be?

Probably the same, because you need to account for more die area in RT-hardware, whether that's larger CU's or fixed function hardware.
 
Probably the same, because you need to account for more die area in RT-hardware, whether that's larger CU's or fixed function hardware.

I think this is absolutely correct. Going from RDNA to RNDA2 (Or 1.5), if anything the CU's will be increasing in size be it added functionality or even larger caches to increase bandwidth further especially for RT.

36 to 40 active CU's is probably the right target.
 
Status
Not open for further replies.
Back
Top