Nvidia Ampere Discussion [2020-05-14]

DegustatoR · Sep 7, 2020

Scott_Arm said:
Seems about right. I'm super curious about overclocking. Overclocking will speed up the rops, the raster engines, the texture units, but I would assume in an OpenCL test those wouldn't be the bottleneck. That means the gpu most likely doesn't have enough bandwidth in all situations for the number of cuda cores, or is there something else in the front-end that could explain it? Is it possible this test is mixing int32, but in an unequal ratio to fp32, like a 2:1 split for fp32 to int32?

OpenCL on NV h/w has never been very good so still - we need more data.

Love_In_Rio · Sep 7, 2020

trinibwoy said:
But there is no progress to demonstrate so the slide is misleading at best. A Pascal SM retired 256 FP32 flops per clock and INT32 had to compete for instruction slots, same as Ampere.

Aren't now AMD ALUs more complete than Nvidias technically?. All can make int ops, meanwhile now only half of them in Ampere.

Scott_Arm · Sep 7, 2020

Love_In_Rio said:
Aren't now AMD ALUs more complete than Nvidias technically?. All can make int ops, meanwhile now only half of them in Ampere.

I think the problem on turing was a lot of the int32 alu sat idle. I believe the CUs in AMD would be the same where they have dedicated fp and int units, no?

Scott_Arm · Sep 7, 2020

DegustatoR said:
OpenCL on NV h/w has never been very good so still - we need more data.

I think we kind of expected this type of utilization though, because they doubled fp32 alu without doubling anything else. I would not expect this to improve to 90% or anything like that.

Cyan · Sep 7, 2020

techuse said:
Not true. Supported modes are only 128+0 or 64+64.

what do you mean? Sorry, I don't understand.... Do you mean that if a cuda core is going to perform both int32 and fp32 ops at the same time, the resources of that cuda core are halved? Or that it isn't as efficient?

afaik, native integer cores appeared for the first time ever in Turing, that's why only Turing graphics card and the newer ones allow for the famous Integer Scaling.

gamescom-2019-geforce-game-ready-driver-integer-scaling-nvcpl-option-850px.png

trinibwoy · Sep 7, 2020

Love_In_Rio said:
Aren't now AMD ALUs more complete than Nvidias technically?. All can make int ops, meanwhile now only half of them in Ampere.

Yes, same as Pascal. I think the prevailing assumption is that those combined FP and INT pipelines share transistors to some extent so one pipeline isn’t completely idle. No idea if that’s true though.

Nvidia clearly thought it made sense to split the FP and INT functionality into separate pipelines for whatever reason.

CarstenS · Sep 7, 2020

Digidi said:
Can somebody explain me why they build : fp32 + (int32 or fp32) instead of fp32 + int32 + fp32?
They have all 3 units on the chip, why not use all together the same time?
is the scheduling so much more complicated? Maybe issue with cache size and latency?

FP32+FP32+INT32 is physically bigger than FP32+[FP32|INT32], it needs another real datapath inside the SM and the three units could not be scheduled at the same time with the current scheduler. Remember, the FP32-INT32 SIMD is not physically separated but combined in one SIMD.

techuse · Sep 7, 2020

Cyan said:
what do you mean? Sorry, I don't understand.... Do you mean that if a cuda core is going to perform both int32 and fp32 ops at the same time, the resources of that cuda core are halved? Or that it isn't as efficient?

afaik, native integer cores appeared for the first time ever in Turing, that's why only Turing graphics card and the newer ones allow for the famous Integer Scaling.

I was wrong. It can be any number of INT instructions in multiples of 16 if i understood the replies correctly. So 112+16, 80+48 etc. The Computerbase article had me thinking it was either 128FP or 64FP + 64INT with no other combinations.

DavidGraham · Sep 7, 2020

trinibwoy said:
Nvidia clearly thought it made sense to split the FP and INT functionality into separate pipelines for whatever reason

It helps alot in ray tracing, even in software ray tracing, this is the reason GTX Turing performs better than Pascal in ray tracing despite having fewer resources, and this is also the reason Turing surpasses both RDNA 1 and Pascal in software ray tracing (such as CryEngine and World of Tanks).

Jawed · Sep 8, 2020

trinibwoy said:
Nvidia clearly thought it made sense to split the FP and INT functionality into separate pipelines for whatever reason.

This might be cost optimisation: the fixed overheads of an SM partition such as the instruction decoder can be scaled up for relatively little cost while getting a large increase in theoretical throughput. Presumably the fixed overheads grew due to the addition of tensor cores, so then the step to make these two instructions co-issue in Turing seems low cost (well, then there's the compiler). The addition of tensor cores forced them to add overhead and so normal shader instruction processing saw a big change?

Deleted member 2197 · Sep 8, 2020

Corsair has announced they are developing a custom cable that will be fully compatible with all Type 3 and Type 4 CORSAIR modular power supplies to allow for a clean connection. The new cable connects two PCIe / CPU PSU ports directly to the new 12-pin connector and is currently undergoing development and testing. Corsair also now recommends a PSU rating of 850 watts or higher for the RTX 3090. The Corsair 12-pin cable should be available for sale by September 17th the same day as the RTX 30 series cards, pricing wasn't announced but you can sign up to be notified here.

https://www.techpowerup.com/271907/corsair-working-on-direct-12-pin-nvidia-ampere-power-cable

trinibwoy · Sep 8, 2020

Jawed said:
This might be cost optimisation: the fixed overheads of an SM partition such as the instruction decoder can be scaled up for relatively little cost while getting a large increase in theoretical throughput. Presumably the fixed overheads grew due to the addition of tensor cores, so then the step to make these two instructions co-issue in Turing seems low cost (well, then there's the compiler). The addition of tensor cores forced them to add overhead and so normal shader instruction processing saw a big change?

That's a reasonable take. End result seems quite unbalanced though.

CarstenS · Sep 8, 2020

fellix · Sep 8, 2020

CarstenS said:
Quick question: What would happen, if there were more schedulers for feeding more of [FP32|FP32+INT32|TC|L/S|SFU] per clock (i.e. 3 schedulers/dispatch) with the occasional [RT|TMU] thrown in the mix?
Right, Powaaahhhh!

Fermi will happen.

trinibwoy · Sep 8, 2020

CarstenS said:
Quick question: What would happen, if there were more schedulers for feeding more of [FP32|FP32+INT32|TC|L/S|SFU] per clock (i.e. 3 schedulers/dispatch) with the occasional [RT|TMU] thrown in the mix?
Right, Powaaahhhh!

There probably wouldn't be any real benefit. The TMU, SFU + L/S pipelines execute over many clocks.

CarstenS · Sep 8, 2020

trinibwoy said:
There probably wouldn't be any real benefit. The TMU, SFU + L/S pipelines execute over many clocks.

I don't understand what's so unbalanced then?

trinibwoy · Sep 8, 2020

CarstenS said:
I don't understand what's so unbalanced then?

Flops vs everything else (bandwidth, geometry, RT)

fellix · Sep 8, 2020

CarstenS said:
I don't understand what's so unbalanced then?

Kepler: hold my underfed 192 FMA lanes...

Deleted member 2197 · Sep 8, 2020

Here we go again ...

Ethereum Miners Eye NVIDIA’s RTX 30 Series GPU as RTX 3080 Offers 3-4x Better Performance in Eth

Images have surfaced on China’s Baidu forums showing crypto-miners hoarding NVIDIA’s new GeForce RTX 3080 graphics cards in the dozens. Considering that the launch date is still several days away, it’s a surprise that these miners were able to get their hands on so many units:
...
In the above image, you can see a mining farm using up to 8 GeForce RTX 3080 cards, possibly the iChill variant from Inno3D. As per the miners, the RTX 3080 is nearly 3-4x faster than the RTX 2080 in terms of Ethereum mining capabilities. You’re looking at 115 Mh/s, while the RTX 2080 manages just about 30-40 Mh/s.

https://www.hardwaretimes.com/ether...x-3080-offers-3-4x-better-performance-in-eth/

BRiT · Sep 8, 2020

Soon you'll look at that $1500 price tag as being a bargain compared to the $2400 or higher... Le Sigh.

Nvidia Ampere Discussion [2020-05-14]

DegustatoR

Love_In_Rio

Scott_Arm

Scott_Arm

Cyan

orange

trinibwoy

Meh

CarstenS

Moderator

techuse

DavidGraham

Jawed

Deleted member 2197

Guest

trinibwoy

Meh

CarstenS

Moderator

fellix

trinibwoy

Meh

CarstenS

Moderator

trinibwoy

Meh

fellix

Deleted member 2197

Guest

BRiT

(>• •)>⌐■-■ (⌐■-■)

Similar threads