Next Generation Hardware Speculation with a Technical Spin [post E3 2019, pre GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
I dunno about RT, talking strictly raster performance. I think 12TF Navi next year will be comfortably ahead 2080, let alone 2060.
I'd hope that we see better than 2060 performance but it's important to remember that nVidia's 2060 Super is a 7TF part that's essentially on par (usually within 10%) with AMD's 5700XT, a 9.5TF part. Assuming we aren't getting a radical performance uplift per flop with RDNA2, I think a 12TF part would achieve closer to a 2070 Super than anything else, maybe matching the vanilla 2080, but I wouldn't think it would be faster. But who knows, maybe AMD releases a massive GPU with 128 ROPs with embedded memory so it isn't bandwidth constrained.
 
Do you think this also applies to RT performance alone?
Thing is, even if they do 12TF console which makes sense having a cheaper base, AMD beating NV significantly with their first RT implementation just sounds optimistic to me.
But who knows... never underestimate AMD! :)

Is it that optimistic? GCN was certainly far better than kepler and perhaps somewhat better than paxwell. Nvidia GPUs seem to have more “gotchas” that pop up and hamper performance in newer games. Compute in Kepler, concurrent graphics and compute as well as HDR in paxwell. Integer scaling only on Turing etc.
 
Is it that optimistic? GCN was certainly far better than kepler and perhaps somewhat better than paxwell. Nvidia GPUs seem to have more “gotchas” that pop up and hamper performance in newer games. Compute in Kepler, concurrent graphics and compute as well as HDR in paxwell. Integer scaling only on Turing etc.
Yeah. To me GCN was even five times faster than Kepler in compute. Just nobody talked about it, not even AMD themselves it seemed. When did you ever see a 5x lead over the competition? Never. And today all we hear is how much 'behind' AMD is.
To me GCN is the best GPU architecture ever made, and its drawn power translates to performance. I think AMD does big changes less often, but if they do there is a good chance they take the lead for some time.

However, here are my assumptions why i do not expect them to have faster RT in next gen consoles than RTX:
* CUs handling outer traversal loop adds twofold: 1. CUs busy with RT and less available to other work. 2. Fixed function block handling this would be faster (They mention this option to be optional in the patent.)
* NVs implementation seems as simple (and restricted) as possible. It's unlikely to beat its performance with better architecture (if so they did it badly). Only better process / added functionality like reordering could.
* Now AMD has better process, but somehow i think they will lag behind like they tend to do with all fixed function functionality (e.g. rasterization, tessellation)
* Consoles have to be small and cheap, Moores law is dying, 4K only increases demands to stupid levels... Spending too much chip area to get a leap in RT might not be worth it, if possible at all. If MS/Sony would aim for this, they would not market with the 4K argument.

Notice i feel not pessimistic here myself. In contrast to most game developers i depend on compute performance the most, which is why i see things differently and have differing hopes and expectations.
 
Is it that optimistic? GCN was certainly far better than kepler and perhaps somewhat better than paxwell. Nvidia GPUs seem to have more “gotchas” that pop up and hamper performance in newer games. Compute in Kepler, concurrent graphics and compute as well as HDR in paxwell. Integer scaling only on Turing etc.
Does anyone here think that integer scaling is really a hardware limitation on older nVidia cards, though? There's a user controlled sharpness/softness filter in the settings for DSR on all cards that support that feature, they clearly have control over the scaling and output enough to enable such a feature without whatever extra hardware Turing cards have that "enable" the feature. The implementation may be different, but I can't believe that it would be impossible.
 
Does anyone here think that integer scaling is really a hardware limitation on older nVidia cards, though? There's a user controlled sharpness/softness filter in the settings for DSR on all cards that support that feature, they clearly have control over the scaling and output enough to enable such a feature without whatever extra hardware Turing cards have that "enable" the feature. The implementation may be different, but I can't believe that it would be impossible.
I think it is just NV being arbitrary
 
Does anyone here think that integer scaling is really a hardware limitation on older nVidia cards, though? There's a user controlled sharpness/softness filter in the settings for DSR on all cards that support that feature, they clearly have control over the scaling and output enough to enable such a feature without whatever extra hardware Turing cards have that "enable" the feature. The implementation may be different, but I can't believe that it would be impossible.

I dont think its impossible. I think it may possibly incur a performance hit if/when they do enable it. Turning DSR sharpness down to 0 is not integer scaling if im remembering the DSR launch properly.
 
Huh? 'Integer scaling'? You mean that 'feature' enabling playing PacMan in glorious 4K without blurring pixels?
When i heard about that i laughed and thought you can really sell anything to people if you give it a fancy name.
What do i miss here? What's the difference to unfiltered texture access?
Is my life still worth living without integer scaling?
 
I dont think its impossible. I think it may possibly incur a performance hit if/when they do enable it. Turning DSR sharpness down to 0 is not integer scaling if im remembering the DSR launch properly.
I don't think it will have a performance hit, because what's happening is the same as when you enable GPU scaling. Instead of the GPU rendering and outputting whatever resolution and having the display scale the output to fit itself, the GPU renders resolution defined by the application, scales it to the native display resolution, and then outputs that. From personal testing I've found margin of error performance differences if GPU is enabled or disabled. The only things that would change would be limiting render resolutions to integer multiples of the display's fixed resolution, and not applying a filter to the scaled image.

I understand that DSR sharpness is not the same as integer scaling, I was just using it as example evidence that the output filter isn't a fixed value, it's adjustable.

I'm personally only really interested in integer scaling for laptops. I currently have a laptop with an i7 and a mobile 1050ti, and it plays most games fine a medium settings at 1080p (the native screen resolution), but newer games are really starting really push this. It would be really nice to have a 4k display for browsing, watching movies and normal computer stuff but also nice to have the option to play games at 1080p or even 720p without them looking blurry. A few months ago I looked at upgrading the screen in my laptop just for that reason, because 4k replacement screens are only about $150, but the lack of integer scaling would really hurt the experience I think.
 
I think custom CBR is peanuts in comparison to custom RT, even with current RT likely being simple and no complex reordering like ImgTech had.
I don't think it's worth for Sony to do this investment, considering nobody will deliver a 'good' RT solution anyways, because RT algorithm has bad memory access pattern which can't be fixed.
D
Even POWER VR raytracing can't be a good RT solution?

However, here are my assumptions why i do not expect them to have faster RT in next gen consoles than RTX:
* CUs handling outer traversal loop adds twofold: 1. CUs busy with RT and less available to other work. 2. Fixed function block handling this would be faster (They mention this option to be optional in the patent.)
So how many rays/sec do you expect? 5 Giga rays/sec (RTX 2060)? 7 Giga rays/sec(RTX 2070)?
 
Even POWER VR raytracing can't be a good RT solution?

Which specific ImgTec VR RayTracing are you talking about, only their publicly released version or the version(s) still in R&D?
 
Which specific ImgTec VR RayTracing are you talking about, only their publicly released version or the version(s) still in R&D?
Both versions.

Since he said "considering nobody will deliver a 'good' RT solution anyways ". Does he imply all of POWERVR ray-tracing solutions won't be good solutions?
 
Even POWER VR raytracing can't be a good RT solution?
They use a HW unit for reordering, and reordering is the only way to address the bad memory access pattern.
But it also adds a lot of constant cost, so NVs approach seems preferable to get started IMO.
We should approach this with some patience. For example the progress in denoising was not there yet when ImgTech made their RT GPUs, and also not the idea of traversal shaders for LOD, AFAIK.
So we can expect progress from some other directions, and HW reordering would not help with coherent and short rays (sharper reflections / shadows from smaller human made light sources, AO).
I'm afraid HW reordering would result in underutilized chip area, and hope we can address this with a software solution instead in the future.

ImgTech also has HW BVH build. That's more promising eventually, but i don't know how much of a bottleneck this really is with RTX, which uses compute and some CPU for this. Everybody says it's no big issue.

Mobile GPUs are less powerful, so having FF units for everything makes more sense there. But the lesser FF we use, the more flexibility we have, and the better we can distribute chip area to what we really need or not in a certain game.

To get a 'good' RT solution we would need a totally unrealistic leap in memory technology, so my rating is not meant as critique.


So how many rays/sec do you expect? 5 Giga rays/sec (RTX 2060)? 7 Giga rays/sec(RTX 2070)?
If i have to answer, 4-6. No idea.
Problem is the rays/sec number is scene dependent and thus pointless because there is no standard scene and setting people use to get this number.
It will be also impossible to compare with RTX, assuming traversal shaders are a thing and the feature set becomes too different.
 
To get a 'good' RT solution we would need a totally unrealistic leap in memory technology, so my rating is not meant as critiqu
Is it unrealistic? Or do we require a memory system that has very good random block performance ? Or a cache size just large enough to fit the data structures required by RT?

The only nice thing about consoles is that the manufacturers get to develop the entire system end to end to support a feature. They can build and put things wherever they want to support better ray tracing performance. This is something that individual vendors cannot do.

In the Hellblade thread I believe you wanted to discuss whether it would be possible. VRS tier 2 can support a variety of pixel groupings and sampling areas. If you do VRS before you do Ray tracing you could in theory shoot way less rays because you’re now covering 16x16 pixels in the background areas with 1 ray, and then getting more granular for the in focus items.

next gen may not necessarily be about massive processing power; but newer ways to dramatically reduce workloads.

You can use my previous post on VRS if you need direct access to the sources here

Highlights are mine below.
***
Coarse pixel size support
The shading rates 1x1, 1x2, 2x2 and 2x2 can be requested on all tiers.

There is a cap, AdditionalShadingRatesSupported, to indicate whether 2x4, 4x2, and 4x4 are available on the device.

Screen Space Image (image-based):
On Tier 2 and higher, pixel shading rate can be specified by a screen-space image.

The screen-space image allows the app to create an “LOD mask” image indicating regions of varying quality, such as areas which will be covered by motion blur, depth-of-field blur, transparent objects, or HUD UI elements. The resolution of the image is in macroblocks, not the resolution of the render target. In other words, the subsampling data is specified at a granularity of 8x8 or 16x16 pixel tiles as indicated by the VRS tile size.

Tile size
The app can query an API to know the supported VRS tile size for its device.

Tiles are square, and the size refers to the tile’s width or height in texels.

If the hardware does not support Tier 2 variable rate shading, the capability query for the tile size will yield 0.

If the hardware does support Tier 2 variable rate shading, the tile size is one of

  • 8
  • 16
  • 32
 
Last edited:
We already have importance-sampling for ray counts. I don't think VRS will bring anything significant in that regards save perhaps ease of implementation. Bare in mind VRS isn't an improvement but a compromise, reducing details akin to lossy compression. It's 'good enough' and an improvement in terms overall, but you are losing quality to gain framerate/resolution/improvements elsewhere, which can be written into your raytracing pipeline other ways.
 
@iroboto
If you are just Sending a ray for that large coarsely shaded Region you could end up with a lot of variability or noise and requiring a much more guided denoiser.
Think a dark corner in a game with RT GI - if that large screen space Portion has a ray that happens to Hit sky although it is primarily dsrk, it would be a non representative White sample in a dark area. Not sure if that is very doable if you have to then shift cost to a denoiser, which are imperfekt?

If you Look at Battlefield V though, it works pretty OK there with specular reflections... And also for shadows a bit in Rise of the Tomb Raider... There they send more rays at certain Regions depending how the shadows lay on ultra.

GI though seems messy
 
We already have importance-sampling for ray counts. I don't think VRS will bring anything significant in that regards save perhaps ease of implementation. Bare in mind VRS isn't an improvement but a compromise, reducing details akin to lossy compression. It's 'good enough' and an improvement in terms overall, but you are losing quality to gain framerate/resolution/improvements elsewhere, which can be written into your raytracing pipeline other ways.
Hmm, I didn't realize we had importance sampling for ray counts.

VRS is just a quality degrader without a doubt. But in my mind that works well with the way things are headed. Due to the limit of power available, the best place to get started is to just do less work.

Any scenario where the API allows for an easier time to direct where they should and should not putting resources is a win in my book. Ease of implementation and flexibility equates to adoption; something we see a lack of for other features that have been released in the past (tiled resources)
 
@iroboto
If you are just Sending a ray for that large coarsely shaded Region you could end up with a lot of variability or noise and requiring a much more guided denoiser.
Think a dark corner in a game with RT GI - if that large screen space Portion has a ray that happens to Hit sky although it is primarily dsrk, it would be a non representative White sample in a dark area. Not sure if that is very doable if you have to then shift cost to a denoiser, which are imperfekt?

If you Look at Battlefield V though, it works pretty OK there with specular reflections... And also for shadows a bit in Rise of the Tomb Raider... There they send more rays at certain Regions depending how the shadows lay on ultra.

GI though seems messy
this part is unsure to me as of now. So I'll need correction here if I understood the documentation correctly. So originally with DXR 1.0:
Then, just as rasterization is invoked by Draw() and compute is invoked via Dispatch(), raytracing is invoked via DispatchRays(). DispatchRays() can be called from graphics command lists, compute command lists or bundles.

So there is that separation between calls, so you'll be forced to do denoising.

But with 1.1
Tier 1.1 implementations also support a variant of raytracing that can be invoked from any shader stage (including compute and graphics shaders), but does not involve any other shaders - instead processing happens logically inline with the calling shader. See Inline raytracing.

Tier 1.1 implementations also support GPU initiated DispatchRays() via ExecuteIndirect().

So perhaps a more hybrid approach can be used here without relying on denoising.
Tier 1.1 seems to allow developers the option of (a) Inline Raytracing into another shader. So perhaps use the ray to calculate [a] light value(s) for the coarsely shaded region within a compute or graphics shader and let it handle the rest for the coarsely shaded region.

Or (b) Leverage ExecuteIndirect
Seems harmonious with executeindirect invoke multiple shader calls on the GPU side to gather results and send the results into a different shader for shading all without the intervention of the CPU.

I don't know if this is optimal by any means, but it appears to be an additional option opened up to developers over 1.0.
 
Last edited:
Is it unrealistic? Or do we require a memory system that has very good random block performance ? Or a cache size just large enough to fit the data structures required by RT?
All this would help, but hardware alone will not solve the problem. At some point we will need reordering to keep the work efficient traversal algorithm, even improve it by sharing common work, and also having more coherent memory access all in one. Complicated, and likely there will be no single best solution that fits all needs anytime soon.

VRS tier 2 can support a variety of pixel groupings and sampling areas. If you do VRS before you do Ray tracing you could in theory shoot way less rays because you’re now covering 16x16 pixels in the background areas with 1 ray, and then getting more granular for the in focus items.
VRS (and any other upscaling / dynamic resolution approach) is useful, but it also is counter productive for anything that requires sampling to get a solution. All those techniques reduce the sample count and so the effectiveness of denoising. In theory the net win becomes zero for RT.
So VRS and similar methods are more a solution for ever increasing display resolutions, than for lighting problems.

But it depends on what you do with RT. Shadows from smaller light sources / sharper reflections suffer less than noisy GI or AO.
So we get the same conclusion here than when whining about memory issues due to incoherent rays: RT is powerful and great for shadows (also for area lights) and sharp reflections, but it becomes inefficient for GI / glossy reflections.
And this can't be fixed with hardware efficiently - it's just the RT algorithm itself that has this property.

next gen may not necessarily be about massive processing power; but newer ways to dramatically reduce workloads.
But this is not restricted to next gen, it applies to all software at all times.
And i think in games the focus is too much on low level optimizations - there seems a general assumption any problem has to be solved with hardware features, which always is about low level optimizations but never more.
This is really confusing to me sometimes, but surely it is because i do not actually work on games with release dates, just on potential future technology.

In the Hellblade thread I believe you wanted to discuss whether it would be possible.
No, i only wanted to hear if other people can spot some RT going on or not. I'm not the best at detecting reveling artifacts myself :)
 
Status
Not open for further replies.
Back
Top