Nvidia Ampere Discussion [2020-05-14]

Seems about right. I'm super curious about overclocking. Overclocking will speed up the rops, the raster engines, the texture units, but I would assume in an OpenCL test those wouldn't be the bottleneck. That means the gpu most likely doesn't have enough bandwidth in all situations for the number of cuda cores, or is there something else in the front-end that could explain it? Is it possible this test is mixing int32, but in an unequal ratio to fp32, like a 2:1 split for fp32 to int32?
OpenCL on NV h/w has never been very good so still - we need more data.
 
But there is no progress to demonstrate so the slide is misleading at best. A Pascal SM retired 256 FP32 flops per clock and INT32 had to compete for instruction slots, same as Ampere.
Aren't now AMD ALUs more complete than Nvidias technically?. All can make int ops, meanwhile now only half of them in Ampere.
 
Aren't now AMD ALUs more complete than Nvidias technically?. All can make int ops, meanwhile now only half of them in Ampere.

I think the problem on turing was a lot of the int32 alu sat idle. I believe the CUs in AMD would be the same where they have dedicated fp and int units, no?
 
Not true. Supported modes are only 128+0 or 64+64.
what do you mean? Sorry, I don't understand.... Do you mean that if a cuda core is going to perform both int32 and fp32 ops at the same time, the resources of that cuda core are halved? Or that it isn't as efficient?

afaik, native integer cores appeared for the first time ever in Turing, that's why only Turing graphics card and the newer ones allow for the famous Integer Scaling.

gamescom-2019-geforce-game-ready-driver-integer-scaling-nvcpl-option-850px.png


 
Aren't now AMD ALUs more complete than Nvidias technically?. All can make int ops, meanwhile now only half of them in Ampere.

Yes, same as Pascal. I think the prevailing assumption is that those combined FP and INT pipelines share transistors to some extent so one pipeline isn’t completely idle. No idea if that’s true though.

Nvidia clearly thought it made sense to split the FP and INT functionality into separate pipelines for whatever reason.
 
Can somebody explain me why they build : fp32 + (int32 or fp32) instead of fp32 + int32 + fp32?
They have all 3 units on the chip, why not use all together the same time?
is the scheduling so much more complicated? Maybe issue with cache size and latency?
FP32+FP32+INT32 is physically bigger than FP32+[FP32|INT32], it needs another real datapath inside the SM and the three units could not be scheduled at the same time with the current scheduler. Remember, the FP32-INT32 SIMD is not physically separated but combined in one SIMD.
 
what do you mean? Sorry, I don't understand.... Do you mean that if a cuda core is going to perform both int32 and fp32 ops at the same time, the resources of that cuda core are halved? Or that it isn't as efficient?

afaik, native integer cores appeared for the first time ever in Turing, that's why only Turing graphics card and the newer ones allow for the famous Integer Scaling.

gamescom-2019-geforce-game-ready-driver-integer-scaling-nvcpl-option-850px.png


I was wrong. It can be any number of INT instructions in multiples of 16 if i understood the replies correctly. So 112+16, 80+48 etc. The Computerbase article had me thinking it was either 128FP or 64FP + 64INT with no other combinations.
 
Nvidia clearly thought it made sense to split the FP and INT functionality into separate pipelines for whatever reason
It helps alot in ray tracing, even in software ray tracing, this is the reason GTX Turing performs better than Pascal in ray tracing despite having fewer resources, and this is also the reason Turing surpasses both RDNA 1 and Pascal in software ray tracing (such as CryEngine and World of Tanks).
 
Nvidia clearly thought it made sense to split the FP and INT functionality into separate pipelines for whatever reason.
This might be cost optimisation: the fixed overheads of an SM partition such as the instruction decoder can be scaled up for relatively little cost while getting a large increase in theoretical throughput. Presumably the fixed overheads grew due to the addition of tensor cores, so then the step to make these two instructions co-issue in Turing seems low cost (well, then there's the compiler). The addition of tensor cores forced them to add overhead and so normal shader instruction processing saw a big change?
 
Last edited:
Corsair has announced they are developing a custom cable that will be fully compatible with all Type 3 and Type 4 CORSAIR modular power supplies to allow for a clean connection. The new cable connects two PCIe / CPU PSU ports directly to the new 12-pin connector and is currently undergoing development and testing. Corsair also now recommends a PSU rating of 850 watts or higher for the RTX 3090. The Corsair 12-pin cable should be available for sale by September 17th the same day as the RTX 30 series cards, pricing wasn't announced but you can sign up to be notified here.
https://www.techpowerup.com/271907/corsair-working-on-direct-12-pin-nvidia-ampere-power-cable
 
This might be cost optimisation: the fixed overheads of an SM partition such as the instruction decoder can be scaled up for relatively little cost while getting a large increase in theoretical throughput. Presumably the fixed overheads grew due to the addition of tensor cores, so then the step to make these two instructions co-issue in Turing seems low cost (well, then there's the compiler). The addition of tensor cores forced them to add overhead and so normal shader instruction processing saw a big change?

That's a reasonable take. End result seems quite unbalanced though.
 
Quick question: What would happen, if there were more schedulers for feeding more of [FP32|FP32+INT32|TC|L/S|SFU] per clock (i.e. 3 schedulers/dispatch) with the occasional [RT|TMU] thrown in the mix?
Right, Powaaahhhh! ;)
 
Quick question: What would happen, if there were more schedulers for feeding more of [FP32|FP32+INT32|TC|L/S|SFU] per clock (i.e. 3 schedulers/dispatch) with the occasional [RT|TMU] thrown in the mix?
Right, Powaaahhhh! ;)

There probably wouldn't be any real benefit. The TMU, SFU + L/S pipelines execute over many clocks.
 
Here we go again ...

Ethereum Miners Eye NVIDIA’s RTX 30 Series GPU as RTX 3080 Offers 3-4x Better Performance in Eth

Images have surfaced on China’s Baidu forums showing crypto-miners hoarding NVIDIA’s new GeForce RTX 3080 graphics cards in the dozens. Considering that the launch date is still several days away, it’s a surprise that these miners were able to get their hands on so many units:
...
In the above image, you can see a mining farm using up to 8 GeForce RTX 3080 cards, possibly the iChill variant from Inno3D. As per the miners, the RTX 3080 is nearly 3-4x faster than the RTX 2080 in terms of Ethereum mining capabilities. You’re looking at 115 Mh/s, while the RTX 2080 manages just about 30-40 Mh/s.
https://www.hardwaretimes.com/ether...x-3080-offers-3-4x-better-performance-in-eth/
 
Soon you'll look at that $1500 price tag as being a bargain compared to the $2400 or higher... Le Sigh.
 
Back
Top