Next Generation Hardware Speculation with a Technical Spin [post E3 2019, pre GDC 2020] [XBSX, PS5]

anexanhume · Jan 3, 2020

Why was a post that has patent disclosure and technical papers moved to a thread literally founded on baseless speculation with no technical merits? Is the goal to have no actual discussion in this thread?

If the thread is only going to be for official disclosures from MS and Sony, it should be titled as such.

BRiT · Jan 3, 2020

anexanhume said:
Why was a post that has patent disclosure and technical papers moved to a thread literally founded on baseless speculation with no technical merits? Is the goal to have no actual discussion in this thread?

If the thread is only going to be for official disclosures from MS and Sony, it should be titled as such.

Sorry, it all started with a baseless statement that AMD has worse RT than Nvidia. The thread and replies were cleaned up before looking at the real meat-and-potato posts.

The technical posts have been moved back into this thread. Please continue on with the technical discussion.

Silent_Buddha · Jan 3, 2020

JoeJ said:
Yeah, they have tons of RT experience, and more research / software experience than AMD in general.
This led me to the initial assumption RTX must be very advanced, including reordering to improve both ray and shading coherence. And AMD could never catch up.
But without all this advanced stuff all that remains is simple tree traversal and triangle intersection, which is all NVs RT cores are doing, and there is no other form of hardware acceleration. As far as we know.

There is also the possibility that all of this was initiated by MS. DX is a combination of what hardware features from IHVs is possible in a given time frame and requests from developers as well a dose of where MS see's PC graphics headed (hence the various hardware tiers for supported features).

It's within the realm of possibility that MS at some point decided that RT was a good place for Graphics to head. There's many factors that could lead to this.

MS looking towards future console hardware and knowing that moore's law WRT to advances in silicon nodes was dead. Large advances have to come from something non-traditional.
Influential software developer's expressing interest in RT.
MS pinging both NV and AMD to see what could be possible in hardware.

That could have gotten both companies rolling WRT to how they could accelerate RT in hardware. Likely both had internal roadmaps for RT appearing at the time of DXRs release. However, only NV were able to meet that timeline while AMD are quite obviously behind. However, if we look at AMD's early roadmaps (example - https://wccftech.com/amd-unveils-polaris-vega-navi-graphics-architectures-hbm2/ ), NAVI was originally slated to come out in a similar timeframe to NV's Turing. Basically the original roadmap had it coming out in 2018.

And then consider the early rumors of a 2019 launch for PS5. If Sony were also working off of AMD's roadmap for Navi coming out in 2018, a 2019 launch for PS5 using NAVI fits quite well.

Anyway, this could explain why NV's support for RT is relatively basic. It may not have been something they were seriously considering until MS came to them (and AMD as well) and basically said, "Can you do this? If so, we're going to start developing the groundwork for RT in DX." Or MS could have gone to them and basically said, "We're planning RT in DX, what can you implement and how long will it take?"

Regards,
SB

3dilettante · Jan 3, 2020

JoeJ said:
The benchmark is my own work on realtime GI. The workloads are breadth first traversals of BVH / point hierarchy, raytracing for visibility, building acceleration structures. But it's not comparable to classic raytracing. Complexity is much higher and random access is mostly avoided. The general structure of programs is load from memory, heavy processing using LDS, write to memory. Rarely i access memory during the processing phase, and there is a lot of integer math, scan algorithms, also a lot of bit packing to reduce LDS. Occupancy is good, overall 70-80%. It's compute only - no rasterization or texture sampling.

The large AMD lead was constant over many years and APIs (OpenGL, OpenCL 1.2, finally Vulkan) The factor 5 i remember from the latest test in Vulkan comparing GTX670 vs. 7950, two years ago.
Many years ago i bought a 280x to see how 'crappy' AMD performs with my stuff, and i could not believe it destroyed Kepler Titan by a factor of two out of the box.
At this time i also switched from OpenGL to OpenCL, which helped a lot with NV performance but only a little with AMD. I concluded neither AMDs hardware nor their drivers are 'crappy'
Adding this to the disappointment of GTX670 not being faster than GTX480, i missed the following NV generations. Also i rewrote my algorithm which i did on CPU.
Years later, after porting results back to GPU (using OpenCL and Vulkan) i saw after the heavy changes the performance difference was the same. Rarely a shader (i have 30-40) shows an exception.
I also compared newer hardware: FuryX vs. GTX1070. And thankfully it showed NV did well. Both cards have the same performance per TF, just AMD offers more TF per dollar. So until i get my hands on Turing and RDNA i don't know how things have changed further.

Recently i learned Kepler has no atomics to LDS, and emulates with main memory. That's certainly a factor but it can't be that large - i always tried things like comparing scan algorithm vs. atomic max and picking the faster per GPU model.
So it remains a mystery why Kepler is so bad.
If you have an idea let me know, but it's too late - seems 670 has died recently :/

One interesting thing is AMD benefits much more from optimization, and i tried really hard here because GI is quite heavy.
Also NV seems much more forgiving to random access, and maybe i'm an exception here, comparing to other compute benchmark workloads.

Nvidia has historically been hostile to OpenCL or non-Cuda compute, and Kepler in particular might be more sensitive than prior and later generations to neglect. Register allocation and access for Kepler was something of a hybrid of G80 and later generations. It had a banked register file, but lacked the Fermi's operand collector hardware and scoreboarding that might have kept instruction issue consistent. Kepler exposed dependence analysis to the compiler like later generations, but lacked operand reuse slots that significantly helped Maxwell and beyond. It kept the G80's more complex way of handling register allocation, whereas Maxwell and later had a straightforward modulo-4 rule for what bank a register ID would hit.
I'm not sure if any profiling showed register bank conflicts being present for Kepler, so perhaps that might explain part of a regression from Fermi.
GCN didn't have this issue, although Navi to some extent seems like it might have a Maxwell or later set of rules for conflicts/reuse (although outside of LLVM code generation it's not mentioned).
GCN in the guise of Arcturus might have changes here versus prior GCN, although that might be for a new class of wider vectors.

Kepler (or GK104 at least) may also have been hobbled by a shift in the balance of occupancy versus ILP from Fermi, versus its smaller L1/data share and registers per warp limits versus later GPUs.
If spills to the L1 were occurring too much at higher occupancies, the response a compiler might have in creating larger context warps with poorer occupancy buy higher ILP and fewer stalls/spills might have been hindered by limits to warp context. Kepler's instruction issue rules were also more complex and might have had more conflicts that later GPUs dispensed with when they streamlined the SMs.
Possibly certain operations falling back to memory or maybe some odd instruction generation that expands out an operation into due to emulation or errata handling might also be in there.

This is all just spitballing because it does seem like your workload experienced either worse disparity than average, or perhaps the emphasis on compute was not significant enough in the time before Kepler became irrelevant most observers.

JoeJ said:
Reminds me about this paper, which also took this architecture as example: https://pdfs.semanticscholar.org/26ef/909381d93060f626231fe7560a5636a947cd.pdf
Don't know if the work is related directly to AMD.

Yeah, they have tons of RT experience, and more research / software experience than AMD in general.
This led me to the initial assumption RTX must be very advanced, including reordering to improve both ray and shading coherence. And AMD could never catch up.
But without all this advanced stuff all that remains is simple tree traversal and triangle intersection, which is all NVs RT cores are doing, and there is no other form of hardware acceleration. As far as we know.
And this means AMD can catch up easily very likely. It also means RT does not waste too much chip area just for that, so makes more sense to begin with anyways.

Somewhere in one of the Turing threads it was indicated that intersection tests could be replaced by custom code, although I think performance was given as a reason to stay with the built-in methods if possible. Assuming custom code means the shader reverts to using the SIMD hardware like AMD's method would be doing all the time, that might be a reason why Nvidia's first available implementation is more autonomous.
I would assume Nvidia did evaluate reordering and coalescing, all of which has some unattractive cost or unresolved complexities I'd guess given their current absence. I think there's still an issue with GPUs trying to expand their processing into an uncomfortable middle-ground in terms of divergence and complexity where their wide hardware and limited straightline performance is not ideal, and might encourage higher clocks that other workloads could scale better with more SIMD hardware.

3dilettante · Jan 3, 2020

anexanhume said:
AMD has been looking at this a while. This is a paper from 2014 in which they propose modifying the ALUs for only a 4-8% area increase. The propose 4 traversal units per CU. This is a 1-for-1 match to the number of TMUs, which is exactly where the hardware resides in more recent AMD patents on ray tracing.

https://pdfs.semanticscholar.org/26ef/909381d93060f626231fe7560a5636a947cd.pdf

This is proposed changes to Hawaii (R-290X). With Navi's enhanced caches, I would think it's already more suitable to the modifications described.

Here's their latest patent:

http://www.freepatentsonline.com/20190197761.pdf

There is an element from the proposed modified Hawaii that bore some similarity in my interpretation to the biased-index access unit and crossbar in the patent I discussed previously. A ray's data was stored across the register banks, and a transpose unit was used to load the data in a way that the SIMD hardware could better use it, which provided better bandwidth for ray evaluation and avoided losing storage and bandwidth due to divergence.
The change in CU layout and functionality was much more heavyweight than that paper.
There's a concession in the TMU patent to the less than ideal fit that ray traversal and evaluation has for the SIMD hardware with the addition of a corner-turn register in the TMU, which takes intersection engine results and turns them back into a lane-based format so that the SIMD hardware can do something with it. Divergence is not considered an issue while in the TMU, but it comes back as a possible concern after every node evaluation. RDNA does seem better off in this regard than GCN if only because its wavefronts can go to 32 and it doesn't take 4 cycles per instruction.
I think this is still a case where some intermediate execution domain might make sense, or some way to co-issue work to a limited extent or allow concurrent evaluation that doesn't put as many bubbles in the vector path (be it a scalar-like domain or maybe SFU-like concurrent execution).

JoeJ · Jan 3, 2020

3dilettante said:
Nvidia has historically been hostile to OpenCL or non-Cuda compute

The most surprising of my performance comparisons was Kepler and Fermi being 2x faster in OpenCL than with OpenGL. (The comparison with AMD was against that better performance.)
This although OpenGL has indirect dispatch, which should have helped a lot in my case.
Seems NV did more investment in OpenCL than expected, or their OpenGL compilers were just bad - at least for me.
This was motivation to post such perf results quite often in forums, in the hope to get feedback on other API comparison experience like Cuda vs. OpenCL, but never got any.

Now that you mention profiling tools, i realize my mistake was simply not yet using them back this time. So i was somewhat blind folded and left with guessing. (Started using CodeXL only after shifting focus to AMD then.)

JoeJ · Jan 3, 2020

Silent_Buddha said:
MS looking towards future console hardware and knowing that moore's law WRT to advances in silicon nodes was dead. Large advances have to come from something non-traditional.

With this in mind, thinking towards the most traditional and most expensive CGI algorithm would be somewhat irrational

Looking back, it was mainly NV who has offered software GPU RT very early (i think it started close to the introduction of Cuda), ImgTec which did it first on HW, and also Intel on CPU, showing serious research on data structures and parallel traversal.
Not to forget that most motivation and groundwork her comes from outside, meaning movie industry like Pixar / Disney, and others working on offline rendering. (Which is also the source of Physically Based Rendering, likely the main progress of current gen era.)

The move towards realtime RT was obvious to happen - the question was only when and how. Because DXR was a surprise with zero public indication before announcement, chip design takes some time and AMD does not have it, it seems much more likely NV was the origin here. My guess: They worked on this secretly, and then they came to MS and said: "Hey, HW RT comes up! ImgTec can do it, so ofc we can do it much better on powerful desktop. Here's our Optix API. Wanna put a DirectX Logo on it and present it with us, looking innovative and awesome?"
Pretty sure that's what has happened.

Silent_Buddha said:
Anyway, this could explain why NV's support for RT is relatively basic. It may not have been something they were seriously considering until MS came to them (and AMD as well) and basically said, "Can you do this? If so, we're going to start developing the groundwork for RT in DX." Or MS could have gone to them and basically said, "We're planning RT in DX, what can you implement and how long will it take?"

The 'basic' support makes the most sense. The more fancy functionality the hardware has, the higher the risk of a limited, short living technology, requiring much higher investments to develop but with little benefit in practical performance to expect.
Surely NV would not want to take the risk in hope to destroy any competition just because of an extra feature. Having the lead is enough, and evolving with feedback leads to better progress.

Betanumerical · Jan 3, 2020

Rootax said:
The "RT is between 2060 and 2070" is from the github leak, so from an AMD employee. It's not official, but the leak avec the ps5 specs is not "baseless" imo.

I wasn't aware that we got RT performance from the GitHub leaks, I know we got the cycles per instructions but we didn't appear to get the clock speed of the specific block (TD) that was running them.

Proelite · Jan 3, 2020

Rootax said:
The "RT is between 2060 and 2070" is from the github leak, so from an AMD employee. It's not official, but the leak avec the ps5 specs is not "baseless" imo.

? I don't think so. I don't see any RT results from the github leak besides theoretical values for Arden / Sparkman that no one so far has being able to place a power level on.

DavidGraham · Jan 3, 2020

Silent_Buddha said:
Anyway, this could explain why NV's support for RT is relatively basic. It may not have been something they were seriously considering until MS came to them (and AMD as well) and basically said, "Can you do this? If so, we're going to start developing the groundwork for RT in DX."

Hardly, NVIDIA had it's eyes on Ray Tracing since the days of Fermi and CUDA (10 years ago), they have been pushing the inclusion of ray tracing in GameWorks effects since the days of Kepler, chief among them are HFTS shadows and VXAO, also they pioneered several Denoising solutions and planned to accelerate them with Volta's Tensor cores, they also co announced DXR alongside Microsoft, before any other manufacturer, they were ready with DXR fallback drivers, and with RTX acceleration even before Turing announcement. Also they had the DXR compatibility layer for non RTX hardware long before anyone else.

All of this means that NVIDIA at least had the more significant hand in pushing for RT in games, if not the initiator for it.

troyan · Jan 3, 2020

Silent_Buddha said:
Anyway, this could explain why NV's support for RT is relatively basic. It may not have been something they were seriously considering until MS came to them (and AMD as well) and basically said, "Can you do this? If so, we're going to start developing the groundwork for RT in DX." Or MS could have gone to them and basically said, "We're planning RT in DX, what can you implement and how long will it take?"

Regards,
SB

Turing is specific designed for Raytracing. Look at the GTX1660TI which can compete with a GTX1080TI in most Raytracing benchmarks and games.
Besides the RT cores the seperation of FP and INT units, the new cache system and the redesigned sheduling overcome the pure compute performance advantage of Pascal.

Jay · Jan 3, 2020

Rootax said:
The "RT is between 2060 and 2070" is from the github leak, so from an AMD employee. It's not official, but the leak avec the ps5 specs is not "baseless" imo.

How would this even be compared across vendors?
How many rays a cycle it can do, even then what other workloads is it doing?

anexanhume · Jan 3, 2020

3dilettante said:
There is an element from the proposed modified Hawaii that bore some similarity in my interpretation to the biased-index access unit and crossbar in the patent I discussed previously. A ray's data was stored across the register banks, and a transpose unit was used to load the data in a way that the SIMD hardware could better use it, which provided better bandwidth for ray evaluation and avoided losing storage and bandwidth due to divergence.
The change in CU layout and functionality was much more heavyweight than that paper.
There's a concession in the TMU patent to the less than ideal fit that ray traversal and evaluation has for the SIMD hardware with the addition of a corner-turn register in the TMU, which takes intersection engine results and turns them back into a lane-based format so that the SIMD hardware can do something with it. Divergence is not considered an issue while in the TMU, but it comes back as a possible concern after every node evaluation. RDNA does seem better off in this regard than GCN if only because its wavefronts can go to 32 and it doesn't take 4 cycles per instruction.
I think this is still a case where some intermediate execution domain might make sense, or some way to co-issue work to a limited extent or allow concurrent evaluation that doesn't put as many bubbles in the vector path (be it a scalar-like domain or maybe SFU-like concurrent execution).

Thanks. Do you think rasterization will suffer a similar performance hit to Turing cards when RTRT is used?

3dilettante · Jan 3, 2020

JoeJ said:
The most surprising of my performance comparisons was Kepler and Fermi being 2x faster in OpenCL than with OpenGL. (The comparison with AMD was against that better performance.)
This although OpenGL has indirect dispatch, which should have helped a lot in my case.
Seems NV did more investment in OpenCL than expected, or their OpenGL compilers were just bad - at least for me.
This was motivation to post such perf results quite often in forums, in the hope to get feedback on other API comparison experience like Cuda vs. OpenCL, but never got any.

Now that you mention profiling tools, i realize my mistake was simply not yet using them back this time. So i was somewhat blind folded and left with guessing. (Started using CodeXL only after shifting focus to AMD then.)

My intent was to state that OpenCL and other forms of compute that weren't CUDA like OpenGL were things Nvidia has been very reluctant to keep up to date or provide the full range of tools for. In the set of platforms Nvidia doesn't like, it's possible to do better for one over the other while holding to the theme that a CUDA solution should be expected to be better supported and optimized than both. Last I saw, Nvidia didn't have finalized support for 2.0 or higher, although it's been a while since I kept tabs on it.

I'm not sure about the timing of your work with GK104 and whether sufficient profiling tools would have been available before your card failed. It's possible that the overlap in time for the card and availability of the tools to evaluate it might have been limited.

anexanhume said:
Thanks. Do you think rasterization will suffer a similar performance hit to Turing cards when RTRT is used?

To preface this, I'm not really sure. I think there may be potential for it to have a higher impact, all else being equal. It might have the ability to provide solutions for unspecified problems with the fixed-function method, but that comes down to whether AMD is successful in leveraging it into doing something better versus using bulkier hardware to do the same thing.

The traversal and intersection testing is one part of the overall performance impact. Building and updating the structures and the cost in terms of memory and on-die occupancy dedicated to RT is a significant impact that is outside the scope of the BVH hardware. In some ways, the occupancy with AMD's method might be worse. The TMU patent's point of comparison in its claims is between embedding itself into the CU hardware, or a fully independent BVH traversal processor with large buffers and dedicated resources.
I think Nvidia has descriptions of its method that indicate it's not the worst-case scenario AMD's patent is comparing itself to in terms of hardware investment. Attempts in the Turing thread at comparing SM hardware in Turing chips with and without RT speculated that the RT hardware is possibly increasing SM size by a single-digit percentage (not sure if it was ~6-7% at the moment). The tensor core hardware was a more substantial contributor to SM growth, but this is all from outsiders trying to infer the layout of the units from die pictures, not an official breakdown. SM area growth is also buffered by it being a subset of a larger whole, since maybe half or more of a GPU is non-SM silicon.

Nvidia's approach isn't very transparent in terms of what its hardware resources or corner cases might be, and AMD's patent isn't meant to be specific enough to make a good comparison.
On a per-unit basis, the CU in AMD's method is going to see more of its issue cycles and memory accesses taken up by traversal versus Nvidia's method. What conversion factor there is between an intersection pipe in an AMD TMU versus an RT core with some unspecified internal arrangement isn't clear to me.
I think AMD's method involves the vector register file significantly more, and there's constant conversion to and from wavefront format and hardware to the specialized unit, which may impact its power efficiency as well.

If AMD can utilize the flexibility of leveraging more of the programmable hardware, maybe it can do better per-unit--depending on an unknown rate of underutilization of Nvidia's fixed-function hardware.
The minimum per-ray computation and power cost seems higher with AMD's method, but if there's a significant pain point with the RT core's method or a significant algorithmic improvement the aggregate cost might be lower (so long as the algorithm's costs and overheads don't swamp the improvement or force out other wavefront types).

I think RDNA's smaller wavefronts and emphasis on reducing execution latency might position it better than GCN would have, so maybe it makes RT hardware more feasible.
Even if the patent is close to what AMD implements, it still might be possible that its execution loop doesn't elaborate on what Nvidia does, and AMD has the option of exposing portions of the loop to programmers or reserving most of it as some kind of internal reserved operations like a form of microcode.

One thing I've mused about before is that sometimes early forms of hardware features can be in released products but never used. It'd be hard to eyeball a few extra ALUs and microcode store in a die shot and divine their purpose. RDNA has a fair number of rough spots and/or outright bugs that could be fixed in a later version. Some of the affected features seem uncomfortably close to being too obvious for a fully-baked architecture, and so RDNA version 1 might be a good way to track the worse offenders down. Perhaps there's some evaluation version of the hardware, or maybe since Navi is that buggy the hardware is also buggy or is compromised by them.

anexanhume · Jan 3, 2020

3dilettante said:
One thing I've mused about before is that sometimes early forms of hardware features can be in released products but never used. It'd be hard to eyeball a few extra ALUs and microcode store in a die shot and divine their purpose. RDNA has a fair number of rough spots and/or outright bugs that could be fixed in a later version. Some of the affected features seem uncomfortably close to being too obvious for a fully-baked architecture, and so RDNA version 1 might be a good way to track the worse offenders down. Perhaps there's some evaluation version of the hardware, or maybe since Navi is that buggy the hardware is also buggy or is compromised by them.

I believe some of these bugs have already been ackowledged as addressed in the forthcoming Navi 12 (same 40 CUs, but HBM memory interface. Anticipated for laptops [read: Apple]).

Proelite · Jan 4, 2020

hmqgg, a verified MS insider on Era, said Arden matches Dante devkit. Considering that devkits for MS usually have fully enabled Cus, I expect Xsx to have all enabled CUs.
Chips with defective GPU parts will be used in Xcloud as Lockhart, and/or in Azure as compute resources. Even chips with defective CPU cores can be repurposed in Azure.

If the yield on Arden is greater than >70%, I think that's what we will see.

RDGoodla · Jan 4, 2020

https://www.techpowerup.com/259890/tsmc-starts-shipping-its-7nm-node-based-on-euv-technology

TSMC 7nm+ entered into volume production in Q2 2019. The yield was similar to original 7nm in October 2019.

I expect 7nm+ will reach the yield of 7nm in 2020.
7nm+ should be more economical for next-gen consoles. If each APU can save $20~30, that’s $600~900 million difference for 30 million consoles using 7nm+ process.

Globalisateur · Jan 4, 2020

Proelite said:
hmqgg, a verified MS insider on Era, said Arden matches Dante devkit. Considering that devkits for MS usually have fully enabled Cus, I expect Xsx to have all enabled CUs.
Chips with defective GPU parts will be used in Xcloud as Lockhart, and/or in Azure as compute resources. Even chips with defective CPU cores can be repurposed in Azure.

If the yield on Arden is greater than >70%, I think that's what we will see.

That doesn't mean he was talking about the number of CUs or total tflops. I think he was talking about the clocks. In another tweet he says he doesn't know if both will have the same tflops number.

AlphaWolf · Jan 4, 2020

RDGoodla said:
https://www.techpowerup.com/259890/tsmc-starts-shipping-its-7nm-node-based-on-euv-technology

TSMC 7nm+ entered into volume production in Q2 2019. The yield was similar to original 7nm in October 2019.

I expect 7nm+ will reach the yield of 7nm in 2020.
7nm+ should be more economical for next-gen consoles. If each APU can save $20~30, that’s $600~900 million difference for 30 million consoles using 7nm+ process.

From where is this savings coming? The greater density? $20 to 30 savings per apu seems highly optimistic. The new node will cost more per wafer. Initially there may be 0 or negative savings.

Frenetic Pony · Jan 4, 2020

3dilettante said:
Nvidia's approach isn't very transparent in terms of what its hardware resources or corner cases might be, and AMD's patent isn't meant to be specific enough to make a good comparison.
On a per-unit basis, the CU in AMD's method is going to see more of its issue cycles and memory accesses taken up by traversal versus Nvidia's method. What conversion factor there is between an intersection pipe in an AMD TMU versus an RT core with some unspecified internal arrangement isn't clear to me.
I think AMD's method involves the vector register file significantly more, and there's constant conversion to and from wavefront format and hardware to the specialized unit, which may impact its power efficiency as well.

If AMD can utilize the flexibility of leveraging more of the programmable hardware, maybe it can do better per-unit--depending on an unknown rate of underutilization of Nvidia's fixed-function hardware.
The minimum per-ray computation and power cost seems higher with AMD's method, but if there's a significant pain point with the RT core's method or a significant algorithmic improvement the aggregate cost might be lower (so long as the algorithm's costs and overheads don't swamp the improvement or force out other wavefront types).

I think RDNA's smaller wavefronts and emphasis on reducing execution latency might position it better than GCN would have, so maybe it makes RT hardware more feasible.
Even if the patent is close to what AMD implements, it still might be possible that its execution loop doesn't elaborate on what Nvidia does, and AMD has the option of exposing portions of the loop to programmers or reserving most of it as some kind of internal reserved operations like a form of microcode.

One thing I've mused about before is that sometimes early forms of hardware features can be in released products but never used. It'd be hard to eyeball a few extra ALUs and microcode store in a die shot and divine their purpose. RDNA has a fair number of rough spots and/or outright bugs that could be fixed in a later version. Some of the affected features seem uncomfortably close to being too obvious for a fully-baked architecture, and so RDNA version 1 might be a good way to track the worse offenders down. Perhaps there's some evaluation version of the hardware, or maybe since Navi is that buggy the hardware is also buggy or is compromised by them.

From AMD's standpoint the ability to have a more programmable pipeline seems a large potential strategic advantage, as they've got the contract for both upcoming "new" generation consoles that will have raytracing, and are thus able to de facto set whatever standard they can get through both Microsoft and Sony as some least common denominator.

Programmable intersection tests could deliver advantages to developers Nvidia's hardware can't deliver, with non box testing showing differing advantages already. Tetahedrons over boxes shows far better intersection testing performance, on the order of several times faster, given the same non specialized hardware. Spheres are slower, but much faster to rebuild an acceleration structure with as transforms only rely on the center of a sphere, and rebuilding is a major pain point for many developers already. On top of that a programmable traversal stage, already proposed by Intel, has many potential advantages as well, with things like stochastic tracing and easily selectable LODs coming into play. As far as is known neither is available to current Nvidia ratracing hardware, outdating their larger competitors lineup would be a major victory for AMD, though certainly bad for any Nvidia customers expecting their gaming hardware to last longer.

Still, regardless of corporations trying to win versus each other programmable pipelines are definitely more desirable for developers as well. Overall taking up a larger die area, or more on die resources when under use, might be a better strategy in the long run, depending on your point of view.

Next Generation Hardware Speculation with a Technical Spin [post E3 2019, pre GDC 2020] [XBSX, PS5]

anexanhume

BRiT

(>• •)>⌐■-■ (⌐■-■)

Silent_Buddha

3dilettante

3dilettante

JoeJ

JoeJ

Betanumerical

Proelite

DavidGraham

troyan

Jay

anexanhume

3dilettante

anexanhume

Proelite

RDGoodla

Globalisateur

Globby

AlphaWolf

Specious Misanthrope

Frenetic Pony

Similar threads