Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
It's seems like XSX would have half the ML acceleration of a 2060. But again we don't know how relatively performant and AMD/MS DLSS-like solution would be.
There has been research into using INT4 for scaling, so there may be enough performance on XSX compared to DLSS using INT8.

Also recall that MS mentioned resolution scaling in context with ML at Hotchips,so it's definitely something they've been looking at.
 
There has been research into using INT4 for scaling, so there may be enough performance on XSX compared to DLSS using INT8.

Also recall that MS mentioned resolution scaling in context with ML at Hotchips,so it's definitely something they've been looking at.
Can they efficiently do Machine Learning with FP16?
 
Right but a lot of these discussions are framed in general terms of how does ML acceleration work in regards to resolution upscaling and is it worth it?... as opposed to the important question....specifically how does AMD's ML acceleration work with regards to resolution upscaling and is it good?....and the reality is we don't know yet because the hardware/software hasn't been out there yet to really know.

It's seems like XSX would have half the ML acceleration of a 2060. But again we don't know how relatively performant and AMD/MS DLSS-like solution would be.
The speed of the solution will be more dependent on the developers of the model than the hardware itself.
being 2x or 4x slower is not a big deal unless you’re running unlocked framerate. The current dlss solution takes 2.5ms? Or so. 4x that is 10ms. Once again tight for 16.6ms but fair game for 33.3ms.
 
On Ampere the Tensor cores can run simultaneously with both the RT cores and the CUDA cores. On Turing you could only run any 2 at once I believe. Nvidia call it "Second Gen Concurrency" and the following slide shows the impact it has in Wolfenstein:

https://www.kitguru.net/components/...ss/nvidia-rtx-3080-founders-edition-review/2/

Maybe this is why they're implementing GDDR6X?
Then again, GA102 seems to be power limited most of all, and I don't know if running Tensor+RT+CUDA in parallel forces a downclock.
 
What's the scaling order of throughput though? Assuming RPM, is the following accurate?

1 FP32 operation ~= 2 FP16 operation ~= 4 INT8 ~= 8 INT4 operation
if you're strictly only looking RPM yea, but these are fixed calculations.
You want to also look at mixed precision, as you're actually keeping the quality while increasing the performance. If you lower down to int4 fixed, you are losing quality while gaining performance.
 
Last edited:
if you're strictly only looking RPM yea, but these are fixed calculations.
You want to also look at mixed precision, as you're actually keeping the quality while increasing the performance. If you lower down to int4 fixed, you are losing quality while gaining performance. But you're not going to hit the fixed theoretical number.

Right, mostly wanted to confirm if using FP16 it would be half the speed of INT8.
 
Right, mostly wanted to confirm if using FP16 it would be half the speed of INT8.
Yea which is fine ;)
I think most models will still largely be FP16 imo. Perhaps I'm old fashioned. I don't know how much of a mixed network will drop as low as int4. That's like.. supppper lower precision.
 
oh ok.. well then.
I mean, you need to talk about having apples to apples conversations and level on some common ground.
Tensor Cores take up silicon budget and are generally not used in a large number of games. And when they are used, they are only used for a small fraction of the frame.

All RTX owners have paid a massive premium for silicon that is largely under used. And comparing it to a console where the silicon is being used nearly 100% of the time as is the default behaviour for all developers. We are now only coming to a discussion point of how much the traditional rasterizer pipeline will be used over compute.

There's no comparison that needs to be made really, the only question that needs to be asked is whether it's fast enough to run a ML solution on Compute with better quality and perform better vs checkerboarding/temporal injection etc. And that's more of a software development issue than it is a hardware problem. It's more than capable I think.

People need to get out of the mindset that tensor cores are required to run neural networks. We've been running them on CPUs and GPUs well before tensor cores arrived.

Yeah, I guest that the Tensor cores in Turing are more like an investment for the future. We should see more games using them going forward. I agree with the highlighted part but some people answer that with "Yeah, but having to run the ML upscaling step on the CU, is robbing the GPU of cores that could be used for actual traditional rendering tasks."
 
Last edited:
I don't know why some people think that TOPS and INT4 / 8 data measurable on a PC can be directly compared to the closed and more efficient architecture of the consoles. Anyway, MS stated that with minimal silicone modification, 10x more effective ML and 10x more effective raytracing can be achieved on Series consoles. This is why the super resolution technique can be very efficient, e.g. it may use only 1 CPU core, or maybe it can work efficiently on the GPU alone.
 
According to RGT they have designed (or co-designed) their own Geometry Engine and it will be included into RDNA3. This is most probably what Cerny was talking about in his Road to PS5.

They also have their own version of VRS (using the new GE) which should be a better way to do it than MS's VRS.
He is fanboy looking at his post history, including this tweet. He literally did make a statement that racist gravitate to Xbox. Wtf.

Another gem:
 
That job requirement list is ... lol
Good luck

It's just asking for your standard expert specialised ML researcher, game developer, senior game tech lead, Unreal developer, mathematician, passionate about AAA games type person.

P. S. Must be a team player, and mentor everyone else on the team.

Don't see the problem. It's not like they're asking for a lot. :nope:
 
probably doesn't have faster int4 and int8 computation capablilities as xsx that you can use in ml

Why would you think that?

May I quote from the RDNA white paper?

https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

"To accommodate the narrower wavefronts, the vector register file has been reorganized. Each vector general purpose register (vGPR) contains 32 lanes that are 32-bits wide, and a SIMD contains a total of 1,024 vGPRs – 4X the number of registers as in GCN. The registers typically hold single-precision (32-bit) floating-point (FP) data, but are also designed for efficiently RDNA Architecture | 13 handling mixed precision. For larger 64-bit (or double precision) FP data, adjacent registers are combined to hold a full wavefront of data. More importantly, the compute unit vector registers natively support packed data including two half-precision (16-bit) FP values, four 8-bit integers, or eight 4-bit integers."
 
Status
Not open for further replies.
Back
Top