Nvidia Post-Volta (Ampere?) Rumor and Speculation Thread

Status
Not open for further replies.
Turing is already a better "gaming architecture" than RDNA2.

I submit that rdna is superior, to turing @ gaming.
Take the original 2070 Turing's die (tu-106) and compare it to the 5700xtx's (navi-10) die.

Look at the transistor count, SPs, ROPS, etc. Then look at the performance..
 
Take the original 2070 Turing's die (tu-106) and compare it to the 5700xtx's (navi-10) die.
Take that Navi die, put it at 16nm and come back and ask the same question.

The 2070 also supports a whole lot more of functional units, Tensor cores, Ray Tracing cores, INT32 units ..etc, which means it does more functions at the same transistor budget as Navi, and with an older process.
 
Take that Navi die, put it at 16nm and come back and ask the same question.

The 2070 also supports a whole lot more of functional units, Tensor cores, Ray Tracing cores, INT32 units ..etc, which means it does more functions at the same transistor budget as Navi, and with an older process.

Turing is a much better arch then rdna1 and probably then rdna2 too. Their next 7nm arch should be here soon, then they compete on the same nodes.
 
Register file size or bandwidth? Register file size doesn't matter but bandwidth does. If they don't increase bandwidth then the scheduler can't gather all the operands for 32 FP32 FMAs + 16 INT32 ops in one cycle.

So issuing an INT operation will cause single cycle bubbles in the FP32 execution pipeline. It's no worse than Pascal. Would be interesting to know how much a 16 wide INT32 pipeline costs vs the dual purpose FP32/INT32 pipes in Pascal and Navi.

As depicted in the SM diagram, it's only capable of issuing a single warp per clock cycle anyway, which makes me believe it's fake. Not just int instructions will cause fp bubbles - ld/st and mufu will too, and you'd have to go back as far as big fermi to see that happen.

It could be believable if it's a dual issue scheduler like small fermi-pascal had, but that ain't what's in the diagram
 
The various units can only process half (or less) of a warp per cycle. Dispatching one warp per cycle means you can do 1 int and 1 fp warp every 2 cycles, or (in the rumored SM configuration, as opposed to Turing) 2 fp warps every 2 cycles. I don’t think register file bandwidth would need to change at all, since RF bandwidth requirements for concurrent int+fp warps and concurrent fp+fp warps are the same. (Assuming the int32 execution units support 3 input 1 output instructions like multiply-accumulate.)
 
RUMORS: NVIDIA Ampere GPU Massive Die Size, Specifications, Architecture and More!

https://wccftech.com/rumors-nvidia-ampere-gpu-massive-die-size-specifications-architecture-and-more

Could be fake, could be real or someway in between the two.

Fun none the less.

From what I can tell that's literally beyond TSMC's reticle limit for 12nm. Not to mention just a ton of fake looking numbers "Double the performance, 50% more "cores" at the same time!!!" etc.

Feels fake, I won't say totally fake, 2080ti has a huge die size too, but how are they going to pack that much more into only another 50mm of die size? That's less than a 10 percent increase.
 
From what I can tell that's literally beyond TSMC's reticle limit for 12nm. Not to mention just a ton of fake looking numbers "Double the performance, 50% more "cores" at the same time!!!" etc.

Feels fake, I won't say totally fake, 2080ti has a huge die size too, but how are they going to pack that much more into only another 50mm of die size? That's less than a 10 percent increase.
Reticle is 858mm^2 or so.
 
The various units can only process half (or less) of a warp per cycle. Dispatching one warp per cycle means you can do 1 int and 1 fp warp every 2 cycles, or (in the rumored SM configuration, as opposed to Turing) 2 fp warps every 2 cycles. I don’t think register file bandwidth would need to change at all, since RF bandwidth requirements for concurrent int+fp warps and concurrent fp+fp warps are the same. (Assuming the int32 execution units support 3 input 1 output instructions like multiply-accumulate.)

32 FP + 16 INT ops certainly require more operands than the current Turing config of 16 FP+ 16 INT.
 
From what I can tell that's literally beyond TSMC's reticle limit for 12nm. Not to mention just a ton of fake looking numbers "Double the performance, 50% more "cores" at the same time!!!" etc.

Feels fake, I won't say totally fake, 2080ti has a huge die size too, but how are they going to pack that much more into only another 50mm of die size? That's less than a 10 percent increase.

Ignore all the other stuff written there. It's just Wccftech mixing so many different rumours.
As Benetanegia wrote it's of course not 12nm. It would be 826 mm² in 7nm (probably EUV). Nothing special about it for Nvidia after Volta. Nvidia is going to the max for HPC Chips. But this die size gives us no indication of consumer gpus.

If they need to go to 826 mm² for 70-75% more performance as written in the nextplattform article about the GA100 supercomputer, that's pretty underwhelming. Consumer chips will have reduced Die size, so we can expect max. 40-50% speed increase.
 
Last edited:
Performance as in theoretical TFLOPS throughput or in real, non-cherry-picked applications within the same TDP budget?
 
32 FP + 16 INT ops certainly require more operands than the current Turing config of 16 FP+ 16 INT.
But the rumored config seems to be 16 int +16 fp or 16 fp + 16 fp, which does not. Supporting concurrent execution of 3 warps would require increased RF bandwidth as you say. It would also require the ability to dispatch 3 or more warps every 2 cycles, and if maintaining the current level of latency hiding ability is important, increased RF size to support additional warps.
 
One thing which was in the leaks a long time ago, NVIDIA wants to improve the rasterizer. What they can do to run it better?
 
One thing which was in the leaks a long time ago, NVIDIA wants to improve the rasterizer. What they can do to run it better?
Could they be referring to improvements for Mesh shaders? They re-vamped the rasterization pipeline when they added programmable mesh shaders.
March can't come soon enough!
 
Do modern chips have a scaduel issue because rasterizer don’t get enough pixel done? Or why they want to improve it. I can understand that rasterizer is at the front, so when the Frontend is lame, the rest of the chip is also lame.

a big question for me is, have rasterizer data and shading data have to be processed in line or can you do shading work for a pixel which is even not rasterizer?
 
Do modern chips have a scaduel issue because rasterizer don’t get enough pixel done? Or why they want to improve it. I can understand that rasterizer is at the front, so when the Frontend is lame, the rest of the chip is also lame.

a big question for me is, have rasterizer data and shading data have to be processed in line or can you do shading work for a pixel which is even not rasterizer?

There is no bottleneck as long as the rasterizer and ROPs can handle the same number of pixels. Any improvements to the rasterizer are likely to improve functionality and not raw speed.

For any non trivial application the shader core should not be bottlenecked by the rasterizer. Specific use cases like a depth pre-pass may lean more on the rasterizer but your typical shader will be memory or compute bound.
 
But the rumored config seems to be 16 int +16 fp or 16 fp + 16 fp, which does not. Supporting concurrent execution of 3 warps would require increased RF bandwidth as you say. It would also require the ability to dispatch 3 or more warps every 2 cycles, and if maintaining the current level of latency hiding ability is important, increased RF size to support additional warps.

I’m not following you. The diagram posted on twitter shows 32 FP units and 16 INT units.

Turing can schedule one full warp per clock. It takes 2 clocks to actually execute a warp because the execution units are only 16 wide. This allows the Turing scheduler to switch between issuing INT and FP ops each clock for full utilization of all execution units.

If nvidia goes back to 32 wide execution for FP then there will be no free clock in which to issue INT ops and there will be bubbles in the FP pipeline.
 
I'm reading the diagram as 2 16 wide FP units, and 1 16 wide INT unit, so that the scheduler can switch between issuing to different FP units every cycle (unlike Turing), or switch between INT and FP units (like Turing). Yes, one wouldn't be able to use all 3 16-wide units concurrently, so there'd be a bubble in at least one INT or FP unit every cycle. But it seems like a pretty non-invasive way to increase peak FP throughput without having to scale other aspects of the SM. If power spent in instruction execution is relatively small compared to the cost of obtaining/moving operands, then this design seems like it doubles peak FP throughput without increasing peak SM power consumption very much. So it all seems pretty plausible to me...
 
I'm reading the diagram as 2 16 wide FP units, and 1 16 wide INT unit, so that the scheduler can switch between issuing to different FP units every cycle (unlike Turing), or switch between INT and FP units (like Turing). Yes, one wouldn't be able to use all 3 16-wide units concurrently, so there'd be a bubble in at least one INT or FP unit every cycle. But it seems like a pretty non-invasive way to increase peak FP throughput without having to scale other aspects of the SM. If power spent in instruction execution is relatively small compared to the cost of obtaining/moving operands, then this design seems like it doubles peak FP throughput without increasing peak SM power consumption very much. So it all seems pretty plausible to me...

Ah, now I get it. Yeah that would be interesting and a relatively cheap way to increase FP throughput. It begs the question though of why go through all that trouble instead of just using Pascal style 32-wide combined INT+FP units.
 
Ah, now I get it. Yeah that would be interesting and a relatively cheap way to increase FP throughput. It begs the question though of why go through all that trouble instead of just using Pascal style 32-wide combined INT+FP units.

For the same reason they included the INT unit in the first place, I suppose (it's more efficient?). This move (if at all real and not fake) would simply fix the FP:INT ratio. Because, right now in Turing there's nothing to switch to, nothing to schedule to the INT pipe in 64% of cases, since there's supposedly only 36 INT per 100 FP instructions.

Turing can schedule one full warp per clock. It takes 2 clocks to actually execute a warp because the execution units are only 16 wide. This allows the Turing scheduler to switch between issuing INT and FP ops each clock for full utilization of all execution units.

If nvidia goes back to 32 wide execution for FP then there will be no free clock in which to issue INT ops and there will be bubbles in the FP pipeline.
 
Status
Not open for further replies.
Back
Top