Nvidia Ampere Discussion [2020-05-14]

PSman1700 · Sep 2, 2020

Davros said:
If no one's posted it Digital Foundry benchmarked a 3080 {on an Intel board so no pci-e 4} it performed between 70 - 80% faster than a 2080 in non raytracing scenarios.

And that was across the board for a range of current modern games. Some games saw a 100% or more improvement.

DegustatoR · Sep 2, 2020

CarstenS said:
concurrent execution: 36x INT32 for every 100 FP32 over a variety of gaming workloads

Yeah, which means that they can save power when there's enough INT32 code in a warp to do just 16 FP32 and 16 INT32. When there isn't going with 32 FP32 will net you anywhere from 3 to 100% more performance.

CarstenS said:
Going FP32+[FP32|INT32] would obviously reduce performance compared to FP32+FP32+INT32.

You can't go with FP32+FP32+INT32 since warps are 32 wide - there aren't enough data to fill 48 SIMD lanes. Which is why there were always two choices for Ampere: either double width FP32 SIMD instead of 16-wide FP32+INT32 of Turing or what we seemingly got in the form of 16-wide SIMDs capable of both FP32 and INT32 but only 32 lanes total per clock - as 32 FP32 or 16 FP32 + 16 INT32. H/w can handle this in three ways really: either 2 SIMDs capable of FP32/INT32 or one FP32 and one FP32/INT32 or 3 SIMDs with INT32 still being a dedicated one. The latter option is IMO the closest to your typical NV h/w design choices.

Pinstripe · Sep 2, 2020

CarstenS said:
I was confused at the mentioning of data paths at first, too. But I think he just meant the blocks the Warps are assigned to.
FP32+[FP32|INT32] is still my go-to choice with the following reasoning from available material:
1st: https://www.nvidia.com/content/dam/...ure/NVIDIA-Turing-Architecture-Whitepaper.pdf
Page 13, concurrent execution: 36x INT32 for every 100 FP32 over a variety of gaming workloads

Going FP32+[FP32|INT32] would obviously reduce performance compared to FP32+FP32+INT32.

Evidence 2:

Chart at timestamp 19:05 shows 3080 vs. 2080 perf in 4k which for the games, i.e. workloads using aforementioned mix, is at roughly 1.6x to 1.7x

200 (2x FP32) - 36 (INT32 share) ist 164 and almost exactly where the perf seems to be at.

You are comparing a 3080 having vastly more CUDA cores with a 2080.

A comparison of a RTX 3070 with a RTX 2080 would be more interesting (2944 vs 2944 FP32). There, the rasterization performance gain is a lot smaller.

CarstenS · Sep 2, 2020

DegustatoR said:
Yeah, which means that they can save power when there's enough INT32 code in a warp to do just 16 FP32 and 16 INT32. When there isn't going with 32 FP32 will net you anywhere from 3 to 100% more performance.

You can't go with FP32+FP32+INT32 since warps are 32 wide - there aren't enough data to fill 48 SIMD lanes. Which is why there were always two choices for Ampere: either double width FP32 SIMD instead of 16-wide FP32+INT32 of Turing or what we seemingly got in the form of 16-wide SIMDs capable of both FP32 and INT32 but only 32 lanes total per clock - as 32 FP32 or 16 FP32 + 16 INT32. H/w can handle this in three ways really: either 2 SIMDs capable of FP32/INT32 or one FP32 and one FP32/INT32 or 3 SIMDs with INT32 still being a dedicated one. The latter option is IMO the closest to your typical NV h/w design choices.

Not sure if I follow you here. Weren't you the one who brought up FP32+FP32+INT32? And what's warp width got to with it? Each warp goes into one block in it's entirety until the instruction is completed, which normally is two clocks for SIMDs are 16-wide and warps have 32 items.

DegustatoR said:
Btw, do we know if gaming Ampere will keep TF32 support of GA100 on its tensor cores?

Since it's also supporting the Sparsity thingie, it's highly likely IMHO.

CarstenS · Sep 2, 2020

Pinstripe said:
You are comparing a 3080 having vastly more CUDA cores with a 2080.

A comparison of a RTX 3070 with a RTX 2080 would be more interesting (2944 vs 2944 FP32). There, the rasterization performance gain is a lot smaller.

Nvidia is doing these comparisons, not me. It's the only data point from an official source so far. Additionally, I compare not the products, but the perf increase theoretical (2.7x FP32) vs. what's seen in games (~1.6-1.7x).

Benetanegia · Sep 2, 2020

CarstenS said:
200 (2x FP32) - 36 (INT32 share) ist 164 and almost exactly where the perf seems to be at.

That's not entirely correct because it's 36 INT per 100 FP, and it's not - 72 either because that would be if the total number was 200+72. For 200 total, the actual mix is 147 and 53, you can use the rule of three to confirm that it's the same percentage as 100 36.
And I guessed it easily because I actually kind of arrived at that several pages ago, in my own investigation. Basically both Turing and Ampere can do 200 ops over same period of cycles, but Turing only actually does 136 in your typical game. So 200 / 136 = 1,47.

DegustatoR · Sep 2, 2020

CarstenS said:
Not sure if I follow you here. Weren't you the one who brought up FP32+FP32+INT32? And what's warp width got to with it? Each warp goes into one block in it's entirety until the instruction is completed, which normally is two clocks for SIMDs are 16-wide and warps have 32 items.

As a h/w config, yeah, not as what gets scheduled per clock as this can't be more than 32 lanes due to warp width being 32. Two 16 wide units can be scheduled each clock because they are running 32 wide warps so the scheduler can fill them both each next cycle in turn while the other unit is executing what was scheduled in the previous clock. With three 16 wide units you have a scheduling issue as one of them will be idle any clock cycle with 32 wide warps. Hence why it's 32 FP32 or 16 FP32 + 16 INT32 and not 16+16+16.

CarstenS said:
Since it's also supporting the Sparsity thingie, it's highly likely IMHO.

Would be interesting to see if they'll be able to fit something gaming (graphics) related onto it.

CarstenS · Sep 2, 2020

Benetanegia said:
That's not entirely correct because it's 36 INT per 100 FP, and it's not - 72 either because that would be if the total number was 200+72. For 200 total, the actual mix is 147 and 53, you can use the rule of three to confirm that it's the same percentage as 100 36.
And I guessed it easily because I actually kind of arrived at that several pages ago, in my own investigation. Basically both Turing and Ampere can do 200 ops over same period of cycles, but Turing only actually does 136 in your typical game. So 200 / 136 = 1,47.

You're right, thanks for the correction! So it's roughly 12% above that 1.47, which means other factors like additional ROPs and more L1 bandwith play their roles too.

DegustatoR said:
Hence why it's 32 FP32 or 16 FP32 + 16 INT32 and not 16+16+16.

Never argued that point, in fact, I said it as well: FP32+[FP32 or INT32] with 16-wide each of course.

Wesker · Sep 3, 2020

Lenovo may have leaked the existence of an RTX 3070 Super/Ti SKU with 16GB of RAM:
https://www.tweaktown.com/news/7491...rce-rtx-3070-ti-rocks-16gb-of-vram/index.html

It wouldn't surprise me if Nvidia are also prepping a 20GB equipped RTX 3080 SKU just in case RDNA2 performs better than expected.

Scott_Arm · Sep 3, 2020

I expect the 3080 ti will be a 3080 with 20 GB, and very little other improvements. They can't totally cannibalize the 3090, so it'll probably be a 3080 with 20GB for like $1K.

Frenetic Pony · Sep 3, 2020

Wesker said:
Lenovo may have leaked the existence of an RTX 3070 Super/Ti SKU with 16GB of RAM:
https://www.tweaktown.com/news/7491...rce-rtx-3070-ti-rocks-16gb-of-vram/index.html

It wouldn't surprise me if Nvidia are also prepping a 20GB equipped RTX 3080 SKU just in case RDNA2 performs better than expected.

And there we are, could've done it from the get go but no, gotta eke out that little bit of extra profit from the earliest adopters. Better thing to wait too, I'd expect the 6700xt or whatever to be around the same performance as a 3070 for $400-500, depending on just how competitive it is.

Do wonder if we'll see a 3080ti or whatever as well, another 8gb stick in. Since I'd expect "Big Navi" at lest in the lower end scenario, to compete with a 3080 with at least 12gb of ram, if not 24, it'd look better for Nvidia to have the ram numbers.

Voxilla · Sep 3, 2020

trinibwoy said:
What does this actually mean? Instead of a crossbar for data transfers between the GPCs and ROPs, that traffic now has to route through GPCs?

"Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs."

Also, how is the combined FP32/INT32 pipeline any different to Pascal/Maxwell/Kepler etc? This doesn't sound like a new thing.

Doing a bit of ROP math assuming GA102 has 84SMs total.

96 ROPs = 6 GPCs = 14 SMs per GPC or
192 ROPs = 12 GPCs = 7 SMs per GPC.

If it's the latter that's going to be a chunky increase in raw rasterization throughput.

There is good evidence that there are 7 GPCs see my previous post.
That would mean 7 * 16 ROPs = 112 ROPs for a full GA102

I'm still very curious to the number of TMUs.
As L1 cache bandwidth has doubled, there is a good chance also TMUs have doubled from 4 to 8 per SM.
That would mean a whopping 8*84 = 672 TMUs.

Voxilla said:
Very impressive, aside the power consumption.
From the die photo, it looks like 7 GPCs x 12 SMs = 84 SMs
No word on TMUs, but rumors are 8 per SM.

eastmen · Sep 3, 2020

this for real ? lol wow thats big

Scott_Arm · Sep 3, 2020

eastmen said:
this for real ? lol wow thats big

3080 is a 28cm card, so looks right. RTX2080 and GTX1080 were 26cm long.

eastmen · Sep 3, 2020

Scott_Arm said:
3080 is a 28cm card, so looks right. RTX2080 and GTX1080 were 26cm long.

sigh i have a vega 56. I hope a 3080 fits my case lol

fellix · Sep 3, 2020

Voxilla said:
There is good evidence that there are 7 GPCs see my previous post.
That would mean 7 * 16 ROPs = 112 ROPs for a full GA102

Nvidia has a trend to increase the number of multi-processors per GPC since Kepler (or keep the GPC unit count hard capped at six), so GA102's config is more likely 6*14 with 2 MPs disabled for the RTX3090 SKU.

Scott_Arm · Sep 3, 2020

eastmen said:
sigh i have a vega 56. I hope a 3080 fits my case lol

Vega 56 is almost a 27cm card, so 3080 is at worst 2 cm longer.

Edit: Yah Vega 56 is 26.8cm and 3080 is 28.5 cm.

eastmen · Sep 3, 2020

Scott_Arm said:
Vega 56 is almost a 27cm card, so 3080 is at worst 2 cm longer.

well thank you. I was going to go get my measuring tape and take a gander but i don't have to now.

Only thing worrying me with the 3080 is the 10gigs.

my guess is if AMD is competitive with navi 2 we might see a 3080ti with 20 gigs ?

Voxilla · Sep 3, 2020

fellix said:
Nvidia has a trend to increase the number of multi-processors per GPC since Kepler (or keep the GPC unit count hard capped at six), so GA102's config is more likely 6*14 with 2 MPs disabled for the RTX3090 SKU.

Look at the die, there are 7 columns of 12 SMs, being 7 GPCs.

Scott_Arm · Sep 3, 2020

Seems weird to add so many ROPs. Bandwidth went up, but not that much. Unless they've improved their compression even further.

Nvidia Ampere Discussion [2020-05-14]

PSman1700

DegustatoR

Pinstripe

CarstenS

Moderator

CarstenS

Moderator

Benetanegia

DegustatoR

CarstenS

Moderator

Wesker

Scott_Arm

Frenetic Pony

Voxilla

eastmen

Scott_Arm

eastmen

fellix

Scott_Arm

eastmen

Voxilla

Scott_Arm

Similar threads