Next Generation Hardware Speculation with a Technical Spin [post E3 2019, pre GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
I'm curious how VRS will impact unstable frame rates. If the screen is filled with fine foliage, there's no area of the screen to rate down, so is VRS doing nothing for some framings and do great for others?
 
You could probably force it, compression style (maximum bitrate), to apply enough to get the performance gains needed for a framerate. You'll run a pass for low detail where VRS is a good fit, and if there's not enough low-detail area, change the threshold until there is. Thus getting the equivalent of macroblocking with shader detail. Maybe a big explosion blurs out the detail a bit, and on dense foliage, the foliage detail is just reduced, then increased on the same assets when there are less of them. You could also just up VRS in the periphery, keeping everything centre-screen sharp and reducing detail towards the edges, foveated-rendering style.
 
You could probably force it, compression style (maximum bitrate), to apply enough to get the performance gains needed for a framerate. You'll run a pass for low detail where VRS is a good fit, and if there's not enough low-detail area, change the threshold until there is. Thus getting the equivalent of macroblocking with shader detail. Maybe a big explosion blurs out the detail a bit, and on dense foliage, the foliage detail is just reduced, then increased on the same assets when there are less of them. You could also just up VRS in the periphery, keeping everything centre-screen sharp and reducing detail towards the edges, foveated-rendering style.
Okay, so we might get some really good dynamic VRS replacing the coarse method of dynamic resolution to get stable frame rate.
 
I'm curious how VRS will impact unstable frame rates. If the screen is filled with fine foliage, there's no area of the screen to rate down, so is VRS doing nothing for some framings and do great for others?

You could probably force it, compression style (maximum bitrate), to apply enough to get the performance gains needed for a framerate. You'll run a pass for low detail where VRS is a good fit, and if there's not enough low-detail area, change the threshold until there is. Thus getting the equivalent of macroblocking with shader detail. Maybe a big explosion blurs out the detail a bit, and on dense foliage, the foliage detail is just reduced, then increased on the same assets when there are less of them. You could also just up VRS in the periphery, keeping everything centre-screen sharp and reducing detail towards the edges, foveated-rendering style.

Could also go the foveated approach and do the most LOD in center of screen near cursor. Perhaps rated by amount of motion in a particular frame. The latter would be similar to AMD’s “Radeon Boost” feature: https://www.pcgamesn.com/amd/radeon-boost-performance-benchmarks
 
Summary? I didn’t get anything from the abstract
umm... after trying to decypher some sense out of it, i want to forward the question to the experts here.

In the initial motivation they talk about small workloads, where task allocation and sheduling costs cause inefficiency. That's a main problem for me, so i got excited.

Now i think it describes maybe a low latency way of CPU communicating with GPU, maybe bypassing something like a clumsy API and its command lists.
But i'm usure if the purpose here is to feed the large GPU with compacted workloads from smaller tasks,
or if this GPU coprocessor is just here to process unique tasks that are parallel friendly, but to small in number to make sense for a wide GPU.

I'm left in total confusion.
What i want is basically like NVs task shaders, but dispatching compute with variable workgroup width, and keeping some data flow on chip if possible.
Seems this is something different, but still sounds interesting...
 
Looks like quite a few of patents invented by Ivan Nevraev for Microsoft just went active today.

https://patents.google.com/?inventor=Ivan+Nevraev&after=priority:20160101

Anything of interest? Asking those with more technical know-how than me.

Interesting, there's quite a variety of things in there. Various VRS patents, a game streaming related patent (latency), RT related patents, graphics development related patents, a GPU related patent.

Interesting stuff, but I'll leave it for someone more technical to try to determine what might or might not be applicable to consoles.

Regards,
SB
 
Interesting, there's quite a variety of things in there. Various VRS patents, a game streaming related patent (latency), RT related patents, graphics development related patents, a GPU related patent.

Interesting stuff, but I'll leave it for someone more technical to try to determine what might or might not be applicable to consoles.

Regards,
SB
I would say that the patents co-authored with Mark S. Grossman are a safe bet as intended for consoles. He's the Xbox chief GPU architect.
 
Yeah. To me GCN was even five times faster than Kepler in compute. Just nobody talked about it, not even AMD themselves it seemed. When did you ever see a 5x lead over the competition? Never. And today all we hear is how much 'behind' AMD is.
To me GCN is the best GPU architecture ever made, and its drawn power translates to performance. I think AMD does big changes less often, but if they do there is a good chance they take the lead for some time.
It's been some time since then, so I've probably forgotten many things, but which benchmarks or metrics had a 5x lead? There were some specific use cases like double-precision that I can remember, although that would understandably be of little concern outside of compute like HPC--where AMD's lack of a software foundation negated even leads like that.



This came up in the pre-E3 thread.
https://forum.beyond3d.com/posts/2067755/

I speculated on a few elements of the patent here:
https://forum.beyond3d.com/posts/2069676/

One embodiment is a CU with most of the SIMD resources stripped from the diagram, and other elements like the LDS and export bus removed.
From GCN, it's a loss of 3/4 of the SIMD schedulers, while from Navi it's a loss of 1/2. SIMD-width isn't touched on much, although one passage discusses a 32-thread scenario.

Beyond these changes, the CU is physically organized differently, and its workload is handled differently.
The SIMD is in one embodiment arranged like a dual-issue unit, and there is a tiered register file with a larger 1-read and 1-write file and a smaller multi-ported fast register file. There is a register-access unit that can be used to load different rows from each register bank, and a crossbar that can rearrange values from the register file or the outputs of the VALUs. Possibly, the loss of the LDS may not have removed the hardware involved in the handling of more arbitrary access of the banked structure, and it was repurposed and expanded upon for this. Use cases like efficient matrix transpose operations were noted as a use case for these two rather significant additions to the access hardware.

The workload handling is also notably changed. The scalar path is heavily leveraged to run a persistent thread, which unlike current kernels is expected to run continuously between uses. The persistent kernel monitors a message queue for commands, which it then matches in a lookup table with whatever sequence of instructions need to be run for a task.
The standard path on a current GPU would involve command packets going to a command processor, which then hands off to the dispatch pipeline, which then needs to arbitrate for resources on a CU, which then needs to be initialized, and then the kernel can start. Completion and a return signal is handled indirectly, partly involving the export/message path and possibly a message/interrupt engine. Subsequent kernels or system requests would need to go through this process each time.

The new path has at least an initial startup, but only for the persistent thread. Once it is running, messages written to its queue can skip past all the hand-offs and into the initial instructions of the task. Its generation of messages might also be more direct than the current way CUs communicate to the rest of the system.
This overall kernel has full access to all VGPRs, so it's at least partly in charge of keeping the individual task contexts separate and needs to handle more of the startup and cleanup that might be handled automatically in current hardware. There's some concurrency between tasks, but from the looks of things it's not going to have as many tasks as a full SIMD would. The scalar path may also see more cycles taken up by the persistent kernel rather than direct computation.

There was one possible area of overlap with GFX10 when the patent mentioned shared VGPRs, but this was before the announcements of sub-wave execution, which has a different sort of shared VGPR.
Other than a brief mention of its possibly being narrower than GCN, it's substantially different from GCN and RDNA.

Use cases include packet processing, image recognition, cryptography, and audio. These are cited as workloads that are more latency-sensitive and whose compute and kernel state doesn't change that much.
Sony engineering has commented in the past that audio processing for the PS4 GPU was very limited due to its long latency, and AMD has developed multiple projects for handling this better--be it TrueAudio, high-priority queues, TrueAudio next, and the priority tunneling for Navi. This method might be more responsive.
Perhaps something like cryptography might make sense for the new consoles with their much faster storage subsystems, which I would presume to be compressed and encrypted to a significant degree. Not sure GPU hardware would beat dedicated silicon for just that one task.

Other elements, like image recognition and packet processing might come up in specific client use cases, but I would wonder if this could be useful in HPC as well.
The fast transpose capability is something that might benefit one idea put forward for ray-tracing on an AMD-like GPU architecture (can pack/unpack ray contexts to better work around divergence), although in this instance it would be less integrated than AMD's TMU-based ray-tracing or even Nvidia's RT cores, since this new kind of CU would be much more separate and it may lack portions of standard capability. It's not clear whether such a CU or its task programs would be exposed the same way, as there are various points where an API or microcode could be used rather than direct coding.
 
Thanks for clearing up the patent, i knew you would :) Your insights are quite priceless!

It's been some time since then, so I've probably forgotten many things, but which benchmarks or metrics had a 5x lead? There were some specific use cases like double-precision that I can remember, although that would understandably be of little concern outside of compute like HPC--where AMD's lack of a software foundation negated even leads like that.
The benchmark is my own work on realtime GI. The workloads are breadth first traversals of BVH / point hierarchy, raytracing for visibility, building acceleration structures. But it's not comparable to classic raytracing. Complexity is much higher and random access is mostly avoided. The general structure of programs is load from memory, heavy processing using LDS, write to memory. Rarely i access memory during the processing phase, and there is a lot of integer math, scan algorithms, also a lot of bit packing to reduce LDS. Occupancy is good, overall 70-80%. It's compute only - no rasterization or texture sampling.

The large AMD lead was constant over many years and APIs (OpenGL, OpenCL 1.2, finally Vulkan) The factor 5 i remember from the latest test in Vulkan comparing GTX670 vs. 7950, two years ago.
Many years ago i bought a 280x to see how 'crappy' AMD performs with my stuff, and i could not believe it destroyed Kepler Titan by a factor of two out of the box.
At this time i also switched from OpenGL to OpenCL, which helped a lot with NV performance but only a little with AMD. I concluded neither AMDs hardware nor their drivers are 'crappy' :)
Adding this to the disappointment of GTX670 not being faster than GTX480, i missed the following NV generations. Also i rewrote my algorithm which i did on CPU.
Years later, after porting results back to GPU (using OpenCL and Vulkan) i saw after the heavy changes the performance difference was the same. Rarely a shader (i have 30-40) shows an exception.
I also compared newer hardware: FuryX vs. GTX1070. And thankfully it showed NV did well. Both cards have the same performance per TF, just AMD offers more TF per dollar. So until i get my hands on Turing and RDNA i don't know how things have changed further.

Recently i learned Kepler has no atomics to LDS, and emulates with main memory. That's certainly a factor but it can't be that large - i always tried things like comparing scan algorithm vs. atomic max and picking the faster per GPU model.
So it remains a mystery why Kepler is so bad.
If you have an idea let me know, but it's too late - seems 670 has died recently :/

One interesting thing is AMD benefits much more from optimization, and i tried really hard here because GI is quite heavy.
Also NV seems much more forgiving to random access, and maybe i'm an exception here, comparing to other compute benchmark workloads.
 
It is AMD's first RT implementation so wouldn't be surprised.
AMD has been looking at this a while. This is a paper from 2014 in which they propose modifying the ALUs for only a 4-8% area increase. The propose 4 traversal units per CU. This is a 1-for-1 match to the number of TMUs, which is exactly where the hardware resides in more recent AMD patents on ray tracing.

https://pdfs.semanticscholar.org/26ef/909381d93060f626231fe7560a5636a947cd.pdf

We propose a high performance, GPU integrated, hardware ray tracing system. We present and make use of a new analysis of ray traversal in axis aligned bounding volume hierarchies. This analysis enables compact traversal hardware through the use of reduced precision arithmetic. We also propose a new cache based technique for scheduling ray traversal. With the addition of our compact fixed function traversal unit and cache mechanism, we show that current GPU architectures are well suited for hardware accelerated ray tracing, requiring only small modifications to provide high performance. By making use of existing GPU resources we are able to keep all rays and scheduling traffic on chip and out of caches.

This is proposed changes to Hawaii (R-290X). With Navi's enhanced caches, I would think it's already more suitable to the modifications described.

Here's their latest patent:

http://www.freepatentsonline.com/20190197761.pdf

W77pr318JEHGgLpf783H6CHwbY3Ud9XmsEfpk0C0Lw0.png
https://external-preview.redd.it/W7...bp&s=f68ea3f27e0a6299c7abb86f92e7b3ad38d9c9d4
 
Last edited:
This is proposed changes to Hawaii (R-290X). With Navi's enhanced caches, I would think it's already more suitable to the modifications described.
Reminds me about this paper, which also took this architecture as example: https://pdfs.semanticscholar.org/26ef/909381d93060f626231fe7560a5636a947cd.pdf
Don't know if the work is related directly to AMD.

I bet that NV was 'looking at it for a while' before going into production with their RT.
Yeah, they have tons of RT experience, and more research / software experience than AMD in general.
This led me to the initial assumption RTX must be very advanced, including reordering to improve both ray and shading coherence. And AMD could never catch up.
But without all this advanced stuff all that remains is simple tree traversal and triangle intersection, which is all NVs RT cores are doing, and there is no other form of hardware acceleration. As far as we know.
And this means AMD can catch up easily very likely. It also means RT does not waste too much chip area just for that, so makes more sense to begin with anyways.

Still, we don't know what MS / Sony wants their RT solution to look like. Especially the latter could aim for flexibility because abstractions was not their goal so far. And from MS there are already public proposed DXR extensions requiring more flexible hardware.
I expect the difference being too large for a fair comparison with first gen RTX - flexibility will have a cost even if it's not used for something like a raw MRays/sec test.
 
Status
Not open for further replies.
Back
Top