GPU Ray Tracing Performance Comparisons [2021-2022]

DegustatoR · Feb 21, 2022

Jawed said:
You're referring to HPC? Ampere is 32, isn't it?

Ampere is same as Volta which is 16 in h/w.

Qesa · Feb 22, 2022

DegustatoR said:
Ampere is same as Volta which is 16 in h/w.

Each SIMD has 16 lanes, but warps are 32 threads which are executed over two cycles (plus pipelining)

DegustatoR · Feb 22, 2022

Qesa said:
Each SIMD has 16 lanes, but warps are 32 threads which are executed over two cycles (plus pipelining)

Well yeah but the h/w is 16 wide which allows for higher granularity of execution on branches amongst other things.

Warp/wave widths are a different topic altogether and they too may become an issue for a pure raytraced future.

Qesa · Feb 22, 2022

DegustatoR said:
Well yeah but the h/w is 16 wide which allows for higher granularity of execution on branches amongst other things.

Warp/wave widths are a different topic altogether and they too may become an issue for a pure raytraced future.

What do you mean by "the h/w"? Instructions need to be done to an entire warp at once, which makes it the smallest unit, not the SIMD size. Otherwise you might as well call GCN waves 16 wide as well.

If current GPUs were able to pick and choose parts of a warp to execute independently then the whole subwarp interleaving paper being discussed would be irrelevant.

DegustatoR · Feb 22, 2022

Qesa said:
What do you mean by "the h/w"? Instructions need to be done to an entire warp at once

They don't need to be done "at once", and they in fact are not since the h/w needs two cycles to go through a warp. This opens up opportunities for a more granular control over how these warps are being executed, whether they are used in full in current h/w or not.

Qesa · Feb 22, 2022

DegustatoR said:
They don't need to be done "at once", and they in fact are not since the h/w needs two cycles to go through a warp. This opens up opportunities for a more granular control over how these warps are being executed, whether they are used in full in current h/w or not.

The same instruction has to be done for those two cycles. There are two SIMDs and the scheduler can only issue one instruction per clock, alternating between the two (or to tensor, SFU or MIO)

Qesa · Feb 22, 2022

I've taken too long so I can't edit, but have a look at the hot chips presentation on Volta. Or just slide 11 here: https://old.hotchips.org/wp-content...HC29.21.132-Volta-Choquette-NVIDIA-Final3.pdf

Consumer Ampere's turned the int SIMD into int/fp and merged the tensor cores into one, but it's otherwise similar

DavidGraham · Feb 22, 2022

The 3080 12GB is 200% faster than 6900XT in Metro Exodus Enhanced Edition!

PSman1700 · Feb 22, 2022

I think 3090 is supposed to take fights with the 6900XT, flagship vs flagship.

Kaotik · Feb 22, 2022

PSman1700 said:
I think 3090 is supposed to take fights with the 6900XT, flagship vs flagship.

Most of us care a lot more about € Vs €. Or do you think Intels upcoming Arc-flagship (expected to be around 6700XT/3070 level) should also be compared just to 6900 XT and 3090 (Ti)?

PSman1700 · Feb 22, 2022

Kaotik said:
Most of us care a lot more about € Vs €. Or do you think Intels upcoming Arc-flagship (expected to be around 6700XT/3070 level) should also be compared just to 6900 XT and 3090 (Ti)?

Well now i see, the 3090 isnt in the same class of performance. 6800XT's fighting it out with the 3080/Ti. The 3090 has no direct AMD competitor, yet.

Kaotik · Feb 22, 2022

PSman1700 said:
Well now i see, the 3090 isnt in the same class of performance. 6800XT's fighting it out with the 3080/Ti. The 3090 has no direct AMD competitor, yet.

6900 XT is cheaper than 3080 Ti and about the same price as 3080 12GB (European prices, just checked from Geizhals). Why should 3080(Ti) be compared to 6800 XT instead of 6900 XT?
As for performance class, other than RT 6900 is on the same class as 3090 despite the price difference

DegustatoR · Feb 22, 2022

Kaotik said:
As for performance class, other than RT 6900 is on the same class as 3090 despite the price difference

It's slower unless you limit the comaprison to low resolutions without RT. But why would you do that?

Kaotik · Feb 22, 2022

DegustatoR said:
It's slower unless you limit the comaprison to low resolutions without RT. But why would you do that?

I don't, just pointed out that saying AMD doesn't have a card in 3090 performance class is false (unless you limit yourself to RT games only). But this is all going besides the point where PSman1700 said 6900 should be compared to 3090 and not 3080/Ti, even though 6900 is priced around 3080 12GB

PSman1700 · Feb 22, 2022

Kaotik said:
I don't, just pointed out that saying AMD doesn't have a card in 3090 performance class is false (unless you limit yourself to RT games only). But this is all going besides the point where PSman1700 said 6900 should be compared to 3090 and not 3080/Ti, even though 6900 is priced around 3080 12GB

I ment mainly in performance, as prices are in fantasy land now anyways. In performance, the 3090/Ti is in its own class, unless you would want to omit RT games, which is nearly impossible these days.

TopSpoiler · Feb 22, 2022

DegustatoR said:
Ampere is same as Volta which is 16 in h/w.

Qesa said:
Each SIMD has 16 lanes, but warps are 32 threads which are executed over two cycles (plus pipelining)

From the GA102 white paper:

GA10X includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores, and is capable of executing either 16 FP32 operations OR 16 INT32 operations per clock. As a result of this new design, each GA10x SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock.

Qesa · Feb 22, 2022

TopSpoiler said:
From the GA102 white paper:

That doesn't contradict what I've been saying. Each (32-wide) warp is sent to either the int/fp pipe or the dedicated fp, to be executed over two cycles.

On clock 0, the scheduler sends a fp instruction from warp 0 to the fp SIMD. 16 of the 32 threads start execution.

On clock 1, the scheduler sends an instruction from warp 1 to the fp/int SIMD. The other 16 threads of warp 0 start execution. Depending on whether warp 1 is doing an int or fp instruction, the subcore is now doing either 16+16 or 32 fp only.

But it still can't do anything more fine grained than the 32-thread warp size.

trinibwoy · Feb 23, 2022

Qesa said:
That doesn't contradict what I've been saying. Each (32-wide) warp is sent to either the int/fp pipe or the dedicated fp, to be executed over two cycles.

On clock 0, the scheduler sends a fp instruction from warp 0 to the fp SIMD. 16 of the 32 threads start execution.

On clock 1, the scheduler sends an instruction from warp 1 to the fp/int SIMD. The other 16 threads of warp 0 start execution. Depending on whether warp 1 is doing an int or fp instruction, the subcore is now doing either 16+16 or 32 fp only.

But it still can't do anything more fine grained than the 32-thread warp size.

Nvidia has been doing this since G80. Hardware was 8-wide but execution / branching granularity was 32 threads. Hasn’t changed since 2006.

DegustatoR · Feb 23, 2022

trinibwoy said:
Nvidia has been doing this since G80. Hardware was 8-wide but execution / branching granularity was 32 threads. Hasn’t changed since 2006.

It had actually. Maxwell and Pascal were 32 wide in h/w.

trinibwoy · Feb 23, 2022

DegustatoR said:
It had actually. Maxwell and Pascal were 32 wide in h/w.

I meant the warp size. It’s always been 32. Hardware width is important for latency but doesn’t help with branching granularity or anything that the software sees.

GPU Ray Tracing Performance Comparisons [2021-2022]

DegustatoR

Qesa

DegustatoR

Qesa

DegustatoR

Qesa

Qesa

DavidGraham

PSman1700

Kaotik

Drunk Member

PSman1700

Kaotik

Drunk Member

DegustatoR

Kaotik

Drunk Member

PSman1700

TopSpoiler

Qesa

trinibwoy

Meh

DegustatoR

trinibwoy

Meh

Similar threads