GPU Ray Tracing Performance Comparisons [2021-2022]

Each SIMD has 16 lanes, but warps are 32 threads which are executed over two cycles (plus pipelining)
Well yeah but the h/w is 16 wide which allows for higher granularity of execution on branches amongst other things.

Warp/wave widths are a different topic altogether and they too may become an issue for a pure raytraced future.
 
Well yeah but the h/w is 16 wide which allows for higher granularity of execution on branches amongst other things.

Warp/wave widths are a different topic altogether and they too may become an issue for a pure raytraced future.

What do you mean by "the h/w"? Instructions need to be done to an entire warp at once, which makes it the smallest unit, not the SIMD size. Otherwise you might as well call GCN waves 16 wide as well.

If current GPUs were able to pick and choose parts of a warp to execute independently then the whole subwarp interleaving paper being discussed would be irrelevant.
 
What do you mean by "the h/w"? Instructions need to be done to an entire warp at once
They don't need to be done "at once", and they in fact are not since the h/w needs two cycles to go through a warp. This opens up opportunities for a more granular control over how these warps are being executed, whether they are used in full in current h/w or not.
 
They don't need to be done "at once", and they in fact are not since the h/w needs two cycles to go through a warp. This opens up opportunities for a more granular control over how these warps are being executed, whether they are used in full in current h/w or not.
The same instruction has to be done for those two cycles. There are two SIMDs and the scheduler can only issue one instruction per clock, alternating between the two (or to tensor, SFU or MIO)
 
Most of us care a lot more about € Vs €. Or do you think Intels upcoming Arc-flagship (expected to be around 6700XT/3070 level) should also be compared just to 6900 XT and 3090 (Ti)?

Well now i see, the 3090 isnt in the same class of performance. 6800XT's fighting it out with the 3080/Ti. The 3090 has no direct AMD competitor, yet.
 
Well now i see, the 3090 isnt in the same class of performance. 6800XT's fighting it out with the 3080/Ti. The 3090 has no direct AMD competitor, yet.
6900 XT is cheaper than 3080 Ti and about the same price as 3080 12GB (European prices, just checked from Geizhals). Why should 3080(Ti) be compared to 6800 XT instead of 6900 XT?
As for performance class, other than RT 6900 is on the same class as 3090 despite the price difference
 
It's slower unless you limit the comaprison to low resolutions without RT. But why would you do that?
I don't, just pointed out that saying AMD doesn't have a card in 3090 performance class is false (unless you limit yourself to RT games only). But this is all going besides the point where PSman1700 said 6900 should be compared to 3090 and not 3080/Ti, even though 6900 is priced around 3080 12GB
 
I don't, just pointed out that saying AMD doesn't have a card in 3090 performance class is false (unless you limit yourself to RT games only). But this is all going besides the point where PSman1700 said 6900 should be compared to 3090 and not 3080/Ti, even though 6900 is priced around 3080 12GB

I ment mainly in performance, as prices are in fantasy land now anyways. In performance, the 3090/Ti is in its own class, unless you would want to omit RT games, which is nearly impossible these days.
 
Ampere is same as Volta which is 16 in h/w.

Each SIMD has 16 lanes, but warps are 32 threads which are executed over two cycles (plus pipelining)

From the GA102 white paper:
GA10X includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores, and is capable of executing either 16 FP32 operations OR 16 INT32 operations per clock. As a result of this new design, each GA10x SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock.
 
From the GA102 white paper:
That doesn't contradict what I've been saying. Each (32-wide) warp is sent to either the int/fp pipe or the dedicated fp, to be executed over two cycles.

On clock 0, the scheduler sends a fp instruction from warp 0 to the fp SIMD. 16 of the 32 threads start execution.

On clock 1, the scheduler sends an instruction from warp 1 to the fp/int SIMD. The other 16 threads of warp 0 start execution. Depending on whether warp 1 is doing an int or fp instruction, the subcore is now doing either 16+16 or 32 fp only.

But it still can't do anything more fine grained than the 32-thread warp size.
 
That doesn't contradict what I've been saying. Each (32-wide) warp is sent to either the int/fp pipe or the dedicated fp, to be executed over two cycles.

On clock 0, the scheduler sends a fp instruction from warp 0 to the fp SIMD. 16 of the 32 threads start execution.

On clock 1, the scheduler sends an instruction from warp 1 to the fp/int SIMD. The other 16 threads of warp 0 start execution. Depending on whether warp 1 is doing an int or fp instruction, the subcore is now doing either 16+16 or 32 fp only.

But it still can't do anything more fine grained than the 32-thread warp size.

Nvidia has been doing this since G80. Hardware was 8-wide but execution / branching granularity was 32 threads. Hasn’t changed since 2006.
 
Back
Top