DegustatoR
Legend
Ampere is same as Volta which is 16 in h/w.You're referring to HPC? Ampere is 32, isn't it?
Ampere is same as Volta which is 16 in h/w.You're referring to HPC? Ampere is 32, isn't it?
Each SIMD has 16 lanes, but warps are 32 threads which are executed over two cycles (plus pipelining)Ampere is same as Volta which is 16 in h/w.
Well yeah but the h/w is 16 wide which allows for higher granularity of execution on branches amongst other things.Each SIMD has 16 lanes, but warps are 32 threads which are executed over two cycles (plus pipelining)
Well yeah but the h/w is 16 wide which allows for higher granularity of execution on branches amongst other things.
Warp/wave widths are a different topic altogether and they too may become an issue for a pure raytraced future.
They don't need to be done "at once", and they in fact are not since the h/w needs two cycles to go through a warp. This opens up opportunities for a more granular control over how these warps are being executed, whether they are used in full in current h/w or not.What do you mean by "the h/w"? Instructions need to be done to an entire warp at once
The same instruction has to be done for those two cycles. There are two SIMDs and the scheduler can only issue one instruction per clock, alternating between the two (or to tensor, SFU or MIO)They don't need to be done "at once", and they in fact are not since the h/w needs two cycles to go through a warp. This opens up opportunities for a more granular control over how these warps are being executed, whether they are used in full in current h/w or not.
Most of us care a lot more about € Vs €. Or do you think Intels upcoming Arc-flagship (expected to be around 6700XT/3070 level) should also be compared just to 6900 XT and 3090 (Ti)?I think 3090 is supposed to take fights with the 6900XT, flagship vs flagship.
Most of us care a lot more about € Vs €. Or do you think Intels upcoming Arc-flagship (expected to be around 6700XT/3070 level) should also be compared just to 6900 XT and 3090 (Ti)?
6900 XT is cheaper than 3080 Ti and about the same price as 3080 12GB (European prices, just checked from Geizhals). Why should 3080(Ti) be compared to 6800 XT instead of 6900 XT?Well now i see, the 3090 isnt in the same class of performance. 6800XT's fighting it out with the 3080/Ti. The 3090 has no direct AMD competitor, yet.
It's slower unless you limit the comaprison to low resolutions without RT. But why would you do that?As for performance class, other than RT 6900 is on the same class as 3090 despite the price difference
I don't, just pointed out that saying AMD doesn't have a card in 3090 performance class is false (unless you limit yourself to RT games only). But this is all going besides the point where PSman1700 said 6900 should be compared to 3090 and not 3080/Ti, even though 6900 is priced around 3080 12GBIt's slower unless you limit the comaprison to low resolutions without RT. But why would you do that?
I don't, just pointed out that saying AMD doesn't have a card in 3090 performance class is false (unless you limit yourself to RT games only). But this is all going besides the point where PSman1700 said 6900 should be compared to 3090 and not 3080/Ti, even though 6900 is priced around 3080 12GB
Ampere is same as Volta which is 16 in h/w.
Each SIMD has 16 lanes, but warps are 32 threads which are executed over two cycles (plus pipelining)
GA10X includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores, and is capable of executing either 16 FP32 operations OR 16 INT32 operations per clock. As a result of this new design, each GA10x SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock.
That doesn't contradict what I've been saying. Each (32-wide) warp is sent to either the int/fp pipe or the dedicated fp, to be executed over two cycles.From the GA102 white paper:
That doesn't contradict what I've been saying. Each (32-wide) warp is sent to either the int/fp pipe or the dedicated fp, to be executed over two cycles.
On clock 0, the scheduler sends a fp instruction from warp 0 to the fp SIMD. 16 of the 32 threads start execution.
On clock 1, the scheduler sends an instruction from warp 1 to the fp/int SIMD. The other 16 threads of warp 0 start execution. Depending on whether warp 1 is doing an int or fp instruction, the subcore is now doing either 16+16 or 32 fp only.
But it still can't do anything more fine grained than the 32-thread warp size.
It had actually. Maxwell and Pascal were 32 wide in h/w.Nvidia has been doing this since G80. Hardware was 8-wide but execution / branching granularity was 32 threads. Hasn’t changed since 2006.
It had actually. Maxwell and Pascal were 32 wide in h/w.