Nvidia Turing Speculation thread [2018]

Rootax · Sep 9, 2018

I guess they'll work on Pascal DXR perfs too. Because even if they work on it, and the perfs are "ok-ish", It would be still lagging behind turing by a large margin I guess, due to the lack of RT hardware, I guess. And they can't let Vega having better DXR perfs than Pascal...

Exposed · Sep 10, 2018

Slow news week.

Malo · Sep 10, 2018

Apparently slides will be available on the 14th

Malo · Sep 11, 2018

So now all reviews are going out the 19th with many reviewers only getting their card for a day for testing. As suggested earlier, it would be great if we could get past this day 1 for views and trust their readership to spend more time doing more detailed extensive testing. Or simply have the inevitable suite of benchmark runs only at limited settings for day 1 then consumers actually wait for additional articles (which generate more views anyway) before deciding on purchasing.

Of course Nvidia are great at generating hype and have created an early Apple-like insatiable need to buy their products before it runs out! While stocks last folks!

It will be interesting to see how the RTX is received though considering the very high price tag and focus on new features that have zero immediate use.

McHuj · Sep 11, 2018

According to Videocardz (via twitter), RTX 2070 will indeed be TU106 and not a further cut down TU104.

I doubt TU104 will only be used for one card, I'm speculating some chips are salvaged for a 2070Ti for next year at some point.

-Sweeper_ · Sep 11, 2018

Bits from NVIDIA Turing Architecture whitepaper

- INT32 cores enable Turing GPUs to execute floating point and non-floating point processes in parallel
- 36% additional throughout for floating point operations.
- “50% improvement in delivered performance per CUDA core”.
- ‘50% increase in effective bandwidth on Turing compared to Pascal’.
- NVDEC decoder with HEV YUV444 10/12b HDR, H.264 8K and VP9 10/12 HDR support

And more...

Click to expand...

https://videocardz.com/77895/the-new-features-of-nvidia-turing-architecture

McHuj · Sep 11, 2018

So the question I have: are GPU shader cores superscalar already or the Turing ones were made superscalar (first ones?)?

This would imply that a significant chunk of work done by the shader core consists of integer ops. Performance could be squeezed out by the compiler by arranging the compiled shader code to effectively scheduler for this.

Konan65 · Sep 12, 2018

I’m wondering if Turing inherits Volta’s approach on thread scheduling in a Warp. With Volta having each thread with its own execution state, the GPU can switch back and forth between thread groups whenever it needs to in order to maximize efficiency. Then the GPU program can sync up later to have all the threads work in parallel. Do we know yet if this is the same for Turing?
Thanks

CarstenS · Sep 12, 2018

I would be surprised, if it weren't in Turing as well. From my understanding, this lends itself quite well to some requirements of raytracing.

fellix · Sep 12, 2018

-Sweeper_ said:
https://videocardz.com/77895/the-new-features-of-nvidia-turing-architecture

A bit strange decision by Nvidia to keep the number of raster pipes on TU104 the same as on the TU102. The fragment scan-out rate will now outpace the ROPs.

trinibwoy · Sep 12, 2018

McHuj said:
So the question I have: are GPU shader cores superscalar already or the Turing ones were made superscalar (first ones?)?

This would imply that a significant chunk of work done by the shader core consists of integer ops. Performance could be squeezed out by the compiler by arranging the compiled shader code to effectively scheduler for this.

Kepler was superscalar but I don’t think that definition applies to Turing. Based on what nvidia has shared so far Volta/Turing can’t issue multiple instructions per clock from the same warp.

The execution rate is half the issue rate though which does allow for both INT and FP pipelines to be occupied concurrently.

silent_guy · Sep 12, 2018

trinibwoy said:
Kepler was superscalar but I don’t think that definition applies to Turing. Based on what nvidia has shared so far Volta/Turing can’t issue multiple instructions per clock from the same warp.

Thanks to the miracle of Google:

From https://devblogs.nvidia.com/inside-volta/ :

So, the new architecture allows to run FP32 instructions at full rate, and use remaining 50% of issue slots to execute all other type of istructions - INT32 for index/cycle calculations, load/store, branches, SFU, FP64 and so on. And unlike Maxwell/Pascal, full GPU utilization doesn't need to pack pairs of coissued instructions into the same thread - each next cycle can execute instructions from differemnt thread, so one thread perfroming series of FP32 instructions and other thread perfroming series of INT32 instructions, will load both blocks by 100%

Which gets the reply “This is correct.”

It doesn’t say explicitly that the instruction have to come from different warps. I always assumed it wasn’t necessary, but that may be simply wrong.

Either way: since the FP32 and INT are issued in an alternating way, I don’t think you can call it superscalar?

trinibwoy · Sep 13, 2018

silent_guy said:
It doesn’t say explicitly that the instruction have to come from different warps. I always assumed it wasn’t necessary, but that may be simply wrong.

Either way: since the FP32 and INT are issued in an alternating way, I don’t think you can call it superscalar?

Yeah, it doesn’t have to come from other warps but it helps with utilization.

Kepler is superscalar - has multiple warp schedulers and dispatchers per SM and could issue multiple instructions per clock from one or more warps.

Maxwell and Pascal are also superscalar - dropped the extra schedulers but still had multiple dispatchers so could issue two instructions per clock from the same warp (needing ILP). Like Kepler, the FP and INT pipelines were a single unit though so couldn’t be used concurrently.

Volta/Turing dropped the extra dispatcher but split the FP and INT pipelines. They also cut the number of execution pipes in half. No more superscalar but now they can keep both the 16-wide INT and 16-wide FP pipelines busy with a single dispatcher. End result is major gains for mixed FP/INT workloads. How much that actually matters for games should become clearer next week.

CarstenS · Sep 13, 2018

You probably won't be able to see independently verified numbers there, because you would have to be able to switch either concurrent FP/INT execution off or all the other things "Turing" in order to see how much this exact feature alone helps in games. I'd love to be proven wrong though!

trinibwoy · Sep 13, 2018

You're probably right .There are other changes also aimed at improving efficiency like the increased cache bandwidth and shorter pipeline so it'll be hard to attribute gains to one specific thing.

BRiT · Sep 14, 2018

Thread for actual Architecture … less speculation and more actual … Nvidia Turing Architecture [2018] https://forum.beyond3d.com/threads/nvidia-turing-architecture-2018.60890/

Nvidia Turing Speculation thread [2018]

Rootax

Exposed

Malo

Yak Mechanicum

Malo

Yak Mechanicum

McHuj

-Sweeper_

McHuj

Konan65

CarstenS

Moderator

fellix

trinibwoy

Meh

silent_guy

trinibwoy

Meh

CarstenS

Moderator

trinibwoy

Meh

BRiT

(>• •)>⌐■-■ (⌐■-■)

Similar threads