Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

AlNom · Nov 2, 2020

mpg1 said:
It's seems like XSX would have half the ML acceleration of a 2060. But again we don't know how relatively performant and AMD/MS DLSS-like solution would be.

There has been research into using INT4 for scaling, so there may be enough performance on XSX compared to DLSS using INT8.

Also recall that MS mentioned resolution scaling in context with ML at Hotchips,so it's definitely something they've been looking at.

mpg1 · Nov 2, 2020

TheAlSpark said:
There has been research into using INT4 for scaling, so there may be enough performance on XSX compared to DLSS using INT8.

Also recall that MS mentioned resolution scaling in context with ML at Hotchips,so it's definitely something they've been looking at.

They already showed an example over a year ago with their DirectML Super Resolution:

Globalisateur · Nov 2, 2020

TheAlSpark said:
There has been research into using INT4 for scaling, so there may be enough performance on XSX compared to DLSS using INT8.

Also recall that MS mentioned resolution scaling in context with ML at Hotchips,so it's definitely something they've been looking at.

Can they efficiently do Machine Learning with FP16?

iroboto · Nov 2, 2020

mpg1 said:
Right but a lot of these discussions are framed in general terms of how does ML acceleration work in regards to resolution upscaling and is it worth it?... as opposed to the important question....specifically how does AMD's ML acceleration work with regards to resolution upscaling and is it good?....and the reality is we don't know yet because the hardware/software hasn't been out there yet to really know.

It's seems like XSX would have half the ML acceleration of a 2060. But again we don't know how relatively performant and AMD/MS DLSS-like solution would be.

The speed of the solution will be more dependent on the developers of the model than the hardware itself.
being 2x or 4x slower is not a big deal unless you’re running unlocked framerate. The current dlss solution takes 2.5ms? Or so. 4x that is 10ms. Once again tight for 16.6ms but fair game for 33.3ms.

pjbliverpool · Nov 2, 2020

mpg1 said:
They already showed an example over a year ago with their DirectML Super Resolution:

That's just DLSS running through DirectML. Note the thanks to Nvidia for "supplying the model". It's also run on Nvidia hardware/tensor cores.

Deleted member 13524 · Nov 2, 2020

pjbliverpool said:
On Ampere the Tensor cores can run simultaneously with both the RT cores and the CUDA cores. On Turing you could only run any 2 at once I believe. Nvidia call it "Second Gen Concurrency" and the following slide shows the impact it has in Wolfenstein:

https://www.kitguru.net/components/...ss/nvidia-rtx-3080-founders-edition-review/2/

Maybe this is why they're implementing GDDR6X?
Then again, GA102 seems to be power limited most of all, and I don't know if running Tensor+RT+CUDA in parallel forces a downclock.

iroboto · Nov 2, 2020

Globalisateur said:
Can they efficiently do Machine Learning with FP16?

Yes.

BRiT · Nov 2, 2020

iroboto said:
Yes.

What's the scaling order of throughput though? Assuming RPM, is the following accurate?

1 FP32 operation ~= 2 FP16 operation ~= 4 INT8 ~= 8 INT4 operation

iroboto · Nov 2, 2020

BRiT said:
What's the scaling order of throughput though? Assuming RPM, is the following accurate?

1 FP32 operation ~= 2 FP16 operation ~= 4 INT8 ~= 8 INT4 operation

if you're strictly only looking RPM yea, but these are fixed calculations.
You want to also look at mixed precision, as you're actually keeping the quality while increasing the performance. If you lower down to int4 fixed, you are losing quality while gaining performance.

BRiT · Nov 2, 2020

iroboto said:
if you're strictly only looking RPM yea, but these are fixed calculations.
You want to also look at mixed precision, as you're actually keeping the quality while increasing the performance. If you lower down to int4 fixed, you are losing quality while gaining performance. But you're not going to hit the fixed theoretical number.

Right, mostly wanted to confirm if using FP16 it would be half the speed of INT8.

iroboto · Nov 2, 2020

BRiT said:
Right, mostly wanted to confirm if using FP16 it would be half the speed of INT8.

Yea which is fine

I think most models will still largely be FP16 imo. Perhaps I'm old fashioned. I don't know how much of a mixed network will drop as low as int4. That's like.. supppper lower precision.

Alucardx23 · Nov 2, 2020

iroboto said:
oh ok.. well then.
I mean, you need to talk about having apples to apples conversations and level on some common ground.
Tensor Cores take up silicon budget and are generally not used in a large number of games. And when they are used, they are only used for a small fraction of the frame.

All RTX owners have paid a massive premium for silicon that is largely under used. And comparing it to a console where the silicon is being used nearly 100% of the time as is the default behaviour for all developers. We are now only coming to a discussion point of how much the traditional rasterizer pipeline will be used over compute.

There's no comparison that needs to be made really, the only question that needs to be asked is whether it's fast enough to run a ML solution on Compute with better quality and perform better vs checkerboarding/temporal injection etc. And that's more of a software development issue than it is a hardware problem. It's more than capable I think.

People need to get out of the mindset that tensor cores are required to run neural networks. We've been running them on CPUs and GPUs well before tensor cores arrived.

Yeah, I guest that the Tensor cores in Turing are more like an investment for the future. We should see more games using them going forward. I agree with the highlighted part but some people answer that with "Yeah, but having to run the ML upscaling step on the CU, is robbing the GPU of cores that could be used for actual traditional rendering tasks."

QPlayer · Nov 2, 2020

I don't know why some people think that TOPS and INT4 / 8 data measurable on a PC can be directly compared to the closed and more efficient architecture of the consoles. Anyway, MS stated that with minimal silicone modification, 10x more effective ML and 10x more effective raytracing can be achieved on Series consoles. This is why the super resolution technique can be very efficient, e.g. it may use only 1 CPU core, or maybe it can work efficiently on the GPU alone.

RagnarokFF · Nov 3, 2020

Globalisateur said:
According to RGT they have designed (or co-designed) their own Geometry Engine and it will be included into RDNA3. This is most probably what Cerny was talking about in his Road to PS5.

They also have their own version of VRS (using the new GE) which should be a better way to do it than MS's VRS.

https://twitter.com/x/status/1321975282279063552

He is fanboy looking at his post history, including this tweet. He literally did make a statement that racist gravitate to Xbox. Wtf.

https://twitter.com/x/status/1316014761834807296

Another gem:

https://twitter.com/x/status/1315814228255744000

RagnarokFF · Nov 3, 2020

ML topic:
https://careers.microsoft.com/us/en/job/926910/Senior-Machine-Learning-Engineer-The-Coalition

function · Nov 3, 2020

RagnarokFF said:
ML topic:
https://careers.microsoft.com/us/en/job/926910/Senior-Machine-Learning-Engineer-The-Coalition

Presumably as one of MS's AAA studios with an impressive technical pedigree, their learnings will be shared with any MGS team that's interested.

If MS could develop their own set of portable ML extentions for Unreal that would be quite a bonus.

iroboto · Nov 3, 2020

That job requirement list is ... lol
Good luck

cheapchips · Nov 3, 2020

iroboto said:
That job requirement list is ... lol
Good luck

"Do all the ML and then tell everyone else how to also do all the ML"

function · Nov 3, 2020

iroboto said:
That job requirement list is ... lol
Good luck

It's just asking for your standard expert specialised ML researcher, game developer, senior game tech lead, Unreal developer, mathematician, passionate about AAA games type person.

P. S. Must be a team player, and mentor everyone else on the team.

Don't see the problem. It's not like they're asking for a lot. :nope:

Metal_Spirit · Nov 3, 2020

snc said:
probably doesn't have faster int4 and int8 computation capablilities as xsx that you can use in ml

Why would you think that?

May I quote from the RDNA white paper?

https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

"To accommodate the narrower wavefronts, the vector register file has been reorganized. Each vector general purpose register (vGPR) contains 32 lanes that are 32-bits wide, and a SIMD contains a total of 1,024 vGPRs – 4X the number of registers as in GCN. The registers typically hold single-precision (32-bit) floating-point (FP) data, but are also designed for efficiently RDNA Architecture | 13 handling mixed precision. For larger 64-bit (or double precision) FP data, adjacent registers are combined to hold a full wavefront of data. More importantly, the compute unit vector registers natively support packed data including two half-precision (16-bit) FP values, four 8-bit integers, or eight 4-bit integers."

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

AlNom

Moderator

mpg1

Globalisateur

Globby

iroboto

Daft Funk

pjbliverpool

B3D Scallywag

Deleted member 13524

Guest

iroboto

Daft Funk

BRiT

(>• •)>⌐■-■ (⌐■-■)

iroboto

Daft Funk

BRiT

(>• •)>⌐■-■ (⌐■-■)

iroboto

Daft Funk

Alucardx23

QPlayer

RagnarokFF

RagnarokFF

function

None functional

iroboto

Daft Funk

cheapchips

function

None functional

Metal_Spirit

Similar threads