Next Generation Hardware Speculation with a Technical Spin [2018]

Status
Not open for further replies.
Fair point, but I'm not. I'm saying that, of the two features that seem necessary for an RTX approach to ray tracing - ML capability and BVH accelerators - AMD are shipping a product, by the end of this year, that contains one of those.

Honestly, I don't quite get why my posts were moved here, since I was talking about the way AMD's M160 announcement has a bearing on their future GPU designs, and therefore, the likelihood of RTRT in the next generation.
For me, I think it's important to be able to separate the two features - ML isn't necessary for ray tracing, it's effective at approximation for highly complex tasks. In this case, BVH is an actual requirement for ray tracing, the 2080TI can do ray traced games at 1080p60 fps without ML. Getting that to ~4K, however, would require some sort of ML algorithm.
ML can be used for a variety of tasks, it's entirely possible using MI60 as a template, you could have next gen with something similar to it. Where ML helps improve games and move them forward from a variety of possibilities, and to be flexible enough to not force developers to have to use it.

The 'multi' function cores are an interesting topic, because earlier we were discussing about the waste of silicon for fixed functions. But here we get significantly more flexibility, though with less performance in those areas, but all the silicon could be put to full use whether you decided to use the features or not. Tensor cores are very specific, they're specific to running tensor flow applications, I don't know how useful they are outside of tensor flow, but tensor cores are limited to 16bits. Some ML algorithms may require up to 64bit. So the answer is, I don't know, I don't know what games would require or how it would be used.
 
For me, I think it's important to be able to separate the two features - ML isn't necessary for ray tracing, it's effective at approximation for highly complex tasks. In this case, BVH is an actual requirement for ray tracing, the 2080TI can do ray traced games at 1080p60 fps without ML. Getting that to ~4K, however, would require some sort of ML algorithm.

True, and we still don't have any solid metrics to determine whether denoising, AA, and upscaling are better done via ML or algorithm. Nvidia's DLSS is a perfect example: we know it provides solid results, but how comparable is it to the same mm2 of compute?

I agree that it's important to distinguish the two, but given its inclusion in Nvidia's idealised RTX, I'm willing to believe it's likely that ML will play some role in RTRT. It might not, but you and others have posted enough compelling cases for me to believe it.

And that's why I'm excited. Assuming BVH acceleration can be added to CU's, that's all that's needed to evolve the M160 into a capable RTRT GPU. I know that's a massive oversimplification, and that it would probably perform worse than Nvidia's RTX cards, but it still puts some form of RTRT hardware tantalisingly within the grasp of the next generation consoles.

ML can be used for a variety of tasks, it's entirely possible using MI60 as a template, you could have next gen with something similar to it. Where ML helps improve games and move them forward from a variety of possibilities, and to be flexible enough to not force developers to have to use it.

Exactly. Even just an M160 GPU would be interesting for a console. Hybrid algorithm and ML approaches to all kinds of things, most obviously the things I mentioned above: AA, upscaling, and denoising. I think you've posted a couple of videos in reference to other uses too.

And the fact you've highlighted, of not forcing developers to use anything, is key for console success, which is another reason I'm quite excited by this development.

The 'multi' function cores are an interesting topic, because earlier we were discussing about the waste of silicon for fixed functions. But here we get significantly more flexibility, though with less performance in those areas, but all the silicon could be put to full use whether you decided to use the features or not. Tensor cores are very specific, they're specific to running tensor flow applications, I don't know how useful they are outside of tensor flow, but tensor cores are limited to 16bits. Some ML algorithms may require up to 64bit. So the answer is, I don't know, I don't know what games would require or how it would be used.

I don't know either, but the important thing is: no-one knows. Who knows what amount of ML will be useful for every game, every genre, even? No-one can answer that, but put the control over that ratio in the hands of developers, and they can answer it on a per title basis.

The same can be said for RTRT. Assuming the CU'S can be modified to accelerate BVH's, developers can do the same there too: some ratio of rasterisation, some ratio of ML, and some ratio of RT.

In the PC space, Nvidia's fixed function approach is fine, and probably yields better results. People can just keep upgrading their rig, or changing and disabling per game settings.

In the console space, they have to provide a platform that straddles the broadest markets for a minimum of 6 years, and developer flexibility is key to that. "Time to triangle" popped up time and again in PS4 launch interviews with Mark Cerny, and it's a philosophy that's served them well: it's undone the negative perception of PlayStation caused by the PS3, and secured them shed loads of content. At the same time, the flexibility of compute has granted one of their studios, MM, the chance to render in an entirely new way with SDF's.

We have an indication with the M160 that the same approach is likely to be taken next generation. If BVH acceleration can be bolted on to the CU's, ML+BVH's may be that generation's compute.
 
True, and we still don't have any solid metrics to determine whether denoising, AA, and upscaling are better done via ML or algorithm. Nvidia's DLSS is a perfect example: we know it provides solid results, but how comparable is it to the same mm2 of compute?

I agree that it's important to distinguish the two, but given its inclusion in Nvidia's idealised RTX, I'm willing to believe it's likely that ML will play some role in RTRT. It might not, but you and others have posted enough compelling cases for me to believe it.

And that's why I'm excited. Assuming BVH acceleration can be added to CU's, that's all that's needed to evolve the M160 into a capable RTRT GPU. I know that's a massive oversimplification, and that it would probably perform worse than Nvidia's RTX cards, but it still puts some form of RTRT hardware tantalisingly within the grasp of the next generation consoles.



Exactly. Even just an M160 GPU would be interesting for a console. Hybrid algorithm and ML approaches to all kinds of things, most obviously the things I mentioned above: AA, upscaling, and denoising. I think you've posted a couple of videos in reference to other uses too.

And the fact you've highlighted, of not forcing developers to use anything, is key for console success, which is another reason I'm quite excited by this development.



I don't know either, but the important thing is: no-one knows. Who knows what amount of ML will be useful for every game, every genre, even? No-one can answer that, but put the control over that ratio in the hands of developers, and they can answer it on a per title basis.

The same can be said for RTRT. Assuming the CU'S can be modified to accelerate BVH's, developers can do the same there too: some ratio of rasterisation, some ratio of ML, and some ratio of RT.

In the PC space, Nvidia's fixed function approach is fine, and probably yields better results. People can just keep upgrading their rig, or changing and disabling per game settings.

In the console space, they have to provide a platform that straddles the broadest markets for a minimum of 6 years, and developer flexibility is key to that. "Time to triangle" popped up time and again in PS4 launch interviews with Mark Cerny, and it's a philosophy that's served them well: it's undone the negative perception of PlayStation caused by the PS3, and secured them shed loads of content. At the same time, the flexibility of compute has granted one of their studios, MM, the chance to render in an entirely new way with SDF's.

We have an indication with the M160 that the same approach is likely to be taken next generation. If BVH acceleration can be bolted on to the CU's, ML+BVH's may be that generation's compute.
I believe the RT cores are bolted directly into the compute units. I could be wrong, but there's no suggestion that these are separated pieces of silicon.

The only thing you need to consider is that bandwidth is where ML takes a lot more resources. You can look at say, MI60, and see that it's already 7nm... 300W.... 1 TB/s of bandwidth and approximately 14 TF of power on a dedicated GPU.

That's not a great sign imo. It's not going to get any smaller for the consoles, and the power and bandwidth levels are too high. It's interesting to use as some insight what could be coming from AMD. But this particular design doesn't suit consoles well at all.
 
If someone took gpu's and made a graph with time on one axis and raw fp32 flops on other axis I bet the curve would look sad. Even more sad if flops were normalized per watt... Of course things could be viewed different if specific accelerators like ray tracing or tensors are accounted for or if we somehow could take efficiency/programmability of those flops into account.
 
If someone took gpu's and made a graph with time on one axis and raw fp32 flops on other axis I bet the curve would look sad. Even more sad if flops were normalized per watt... Of course things could be viewed different if specific accelerators like ray tracing or tensors are accounted for or if we somehow could take efficiency/programmability of those flops into account.
If we look at "modern GPUs" in the sense of GCN and Fermi and newer from NVIDIA FP32 looks like this (using reported Boost-clocks where applicable, only single GPU top models of each generation)
Not normalized for perf/w

AMD:
upload_2018-11-7_21-57-44.png

And NVIDIA
upload_2018-11-7_21-43-16.png

edit: updated AMD graph with MI60
 

Attachments

  • upload_2018-11-7_21-42-42.png
    upload_2018-11-7_21-42-42.png
    7.7 KB · Views: 20
Last edited:
MI60 may not have the full suite of power saving features that are planned for Navi integrated yet, and it may be pushing higher clocks on an architecture not fully tuned to take full advantage of TSMC's 7nm process (Vega and its predecessor Polaris were deigned for GF 14nm).

Given architectures designed for and more evolved upon TSMC's tools and processes, I think we'll see better perf/watt and possibly even higher densities.

I'm excited about the possibility of tiered consoles with the same off-the-shelf CPU chiplet connected to the main GPU/NB/SB/mem controller via infinity fabric. That might just lead to reduced costs for the customised main chip. Hell, with an organic substrate supporting low cost HBM you might just make having sharing the same main board and main memory arrangement possible ...
 
If we look at "modern GPUs" in the sense of GCN and Fermi and newer from NVIDIA FP32 looks like this (using reported Boost-clocks where applicable, only single GPU top models of each generation)
Not normalized for perf/w

AMD:
View attachment 2723

And NVIDIA
View attachment 2722

edit: updated AMD graph with MI60

For amd it looks like maybe every 3.5 years flops double. Assuming this holds true and applies also in console space that would give good estimate what to expect for next gen console assuming similar box size and cooling solution.
 
For amd it looks like maybe every 3.5 years flops double. Assuming this holds true and applies also in console space that would give good estimate what to expect for next gen console assuming similar box size and cooling solution.
We'll likely be on the same node of 7nm for next gen. Clocks will be slower. The chip will have redundancy built in. What can we realistically expect here?
 
We'll likely be on the same node of 7nm for next gen. Clocks will be slower. The chip will have redundancy built in. What can we realistically expect here?

Well I certainly can't say with any certainty, but I did suggest about power saving / clock boosting architectural features, and building for a given process line ....
 
It depends on how big a step up Navi is going to be? Is it going to be more or less gcn architecture or is it going to be something next gen. I also feel 10 TFlops is most likely but if Navi has some "magic" we might get 12 TFlops.
 
In the PC space, Nvidia's fixed function approach is fine, and probably yields better results.

Nvidia's fixed funtion today yes, by 2021 Nvidia probally comes with something more flexible as was the case with Geforce 3's fixed function vertex/pixel shaders.

People can just keep upgrading their rig, or changing and disabling per game settings.

On console, developers will do that for you to gain performance.
 
In the PC space, Nvidia's fixed function approach is fine, and probably yields better results. People can just keep upgrading their rig, or changing and disabling per game settings.

In the console space, they have to provide a platform that straddles the broadest markets for a minimum of 6 years, and developer flexibility is key to that. "Time to triangle" popped up time and again in PS4 launch interviews with Mark Cerny, and it's a philosophy that's served them well: it's undone the negative perception of PlayStation caused by the PS3, and secured them shed loads of content. At the same time, the flexibility of compute has granted one of their studios, MM, the chance to render in an entirely new way with SDF's.

We have an indication with the M160 that the same approach is likely to be taken next generation. If BVH acceleration can be bolted on to the CU's, ML+BVH's may be that generation's compute.
Crippling the speed of the hardware to accomodate a handful of experimental games with dubious commercial potential doesn't seem like a sound decision to me, specially in console space.
 
Isn't "crippling" a bit excessive? Something like 10-12TF of performance seems a reasonable expectation, based on assumed die sizes for a console and the known die size of the M160. We'll have to see what performance it reaches at lower clockspeeds before we have a clear indication of an approximation of performance at console wattage, but the current performance of 14.8TF for a GPU that was meant as a 7nm pipe cleaner bodes well IMO. We'll have to wait and see, but hopefully there's truth to the reports of some Zen engineers moving over to work on Navi, and hopefully that pays off in the form of improved perf/watt.

But let's be conservative and assume a 10TF GPU, which can have any ratio of performance dedicated to rasterisation, ML, and RT. Would that flexibility really hamper it so much?

I can imagine that all 3 might be somewhat compromised, relative to fixed function hardware, but does the penalty to performance still result in a functional RTRT GPU? To my knowledge, that's an unknown.

Does the penalty to the size of the CU's outweigh just having fixed function hardware? The M160 suggests that AMD might think so.

And if the answer to both of those is yes, does it make it any cheaper to manufacture? Potentially so, if AMD are able to just deploy different amounts of differently clocked multi function CU's to satisfy all of their markets.

RTRT is still nascent in gaming, but it can't hurt to have the entire core gaming industry beavering away on solutions to it. It'll just be the way it always is: lower quality on the consoles.

And if any developers don't want to, that's fine, they can act as if they have a 10TF PS4/XB1 with plenty more memory and a way better CPU.

Edit: I also think it's worth mentioning that Sony's new CEO mentioned ease of manufacturing as key for the PS5 - I'll try to find the source. Considering "ease of manufacturing," AMD's recent chiplet Zen 2 design, their continued push for Infinity Fabric, and their multi purpose CU's, I'm convinced that, if the CU's can be modified to be capable of RT, that's the approach AMD would want to take regardless of developer convenience.
 
Last edited:
I can imagine that all 3 might be somewhat compromised, relative to fixed function hardware, but does the penalty to performance still result in a functional RTRT GPU? To my knowledge, that's an unknown.
For the sake of this discussion going forward, we really should be calling it HRT for hybrid ray tracing. Real time Ray Tracing is imo an entirely ray traced image and probably not an accurate description of the technology itself.

That being said, AMD has yet to reveal how it intends to accelerate ray tracing. The ML solution is interesting, it's strength lying with 32 and 64 bit precision. But on the 16f front, it's going to be beat by Tensor Cores. The 4 and 8 int is interesting, but perhaps unnecessary. These are all items that are useful for corporate space but I'm unsure of how useful in the game space.

That being said if there is no need for INT 4/8 and 64 in the gaming space, perhaps the CUs can shrink.
 
Yes, they wouldn't call it dlss, but there's a bigger point in the presentation clearly showing that it's a fake. 3640 shader cores, can't divide this with 64 to get an CU number. Someone put so much effort in this fake, but doesn't have basic architectural knowledge making such a fault.
 
Status
Not open for further replies.
Back
Top