Nvidia Ampere Discussion [2020-05-14]

concurrent execution: 36x INT32 for every 100 FP32 over a variety of gaming workloads
Yeah, which means that they can save power when there's enough INT32 code in a warp to do just 16 FP32 and 16 INT32. When there isn't going with 32 FP32 will net you anywhere from 3 to 100% more performance.

Going FP32+[FP32|INT32] would obviously reduce performance compared to FP32+FP32+INT32.
You can't go with FP32+FP32+INT32 since warps are 32 wide - there aren't enough data to fill 48 SIMD lanes. Which is why there were always two choices for Ampere: either double width FP32 SIMD instead of 16-wide FP32+INT32 of Turing or what we seemingly got in the form of 16-wide SIMDs capable of both FP32 and INT32 but only 32 lanes total per clock - as 32 FP32 or 16 FP32 + 16 INT32. H/w can handle this in three ways really: either 2 SIMDs capable of FP32/INT32 or one FP32 and one FP32/INT32 or 3 SIMDs with INT32 still being a dedicated one. The latter option is IMO the closest to your typical NV h/w design choices.
 
I was confused at the mentioning of data paths at first, too. But I think he just meant the blocks the Warps are assigned to.
FP32+[FP32|INT32] is still my go-to choice with the following reasoning from available material:
1st: https://www.nvidia.com/content/dam/...ure/NVIDIA-Turing-Architecture-Whitepaper.pdf
Page 13, concurrent execution: 36x INT32 for every 100 FP32 over a variety of gaming workloads

Going FP32+[FP32|INT32] would obviously reduce performance compared to FP32+FP32+INT32.

Evidence 2:
Chart at timestamp 19:05 shows 3080 vs. 2080 perf in 4k which for the games, i.e. workloads using aforementioned mix, is at roughly 1.6x to 1.7x

200 (2x FP32) - 36 (INT32 share) ist 164 and almost exactly where the perf seems to be at.

You are comparing a 3080 having vastly more CUDA cores with a 2080.

A comparison of a RTX 3070 with a RTX 2080 would be more interesting (2944 vs 2944 FP32). There, the rasterization performance gain is a lot smaller.
 
Yeah, which means that they can save power when there's enough INT32 code in a warp to do just 16 FP32 and 16 INT32. When there isn't going with 32 FP32 will net you anywhere from 3 to 100% more performance.


You can't go with FP32+FP32+INT32 since warps are 32 wide - there aren't enough data to fill 48 SIMD lanes. Which is why there were always two choices for Ampere: either double width FP32 SIMD instead of 16-wide FP32+INT32 of Turing or what we seemingly got in the form of 16-wide SIMDs capable of both FP32 and INT32 but only 32 lanes total per clock - as 32 FP32 or 16 FP32 + 16 INT32. H/w can handle this in three ways really: either 2 SIMDs capable of FP32/INT32 or one FP32 and one FP32/INT32 or 3 SIMDs with INT32 still being a dedicated one. The latter option is IMO the closest to your typical NV h/w design choices.
Not sure if I follow you here. Weren't you the one who brought up FP32+FP32+INT32? And what's warp width got to with it? Each warp goes into one block in it's entirety until the instruction is completed, which normally is two clocks for SIMDs are 16-wide and warps have 32 items.

Btw, do we know if gaming Ampere will keep TF32 support of GA100 on its tensor cores?
Since it's also supporting the Sparsity thingie, it's highly likely IMHO.
 
You are comparing a 3080 having vastly more CUDA cores with a 2080.

A comparison of a RTX 3070 with a RTX 2080 would be more interesting (2944 vs 2944 FP32). There, the rasterization performance gain is a lot smaller.
Nvidia is doing these comparisons, not me. It's the only data point from an official source so far. Additionally, I compare not the products, but the perf increase theoretical (2.7x FP32) vs. what's seen in games (~1.6-1.7x).
 
200 (2x FP32) - 36 (INT32 share) ist 164 and almost exactly where the perf seems to be at.

That's not entirely correct because it's 36 INT per 100 FP, and it's not - 72 either because that would be if the total number was 200+72. For 200 total, the actual mix is 147 and 53, you can use the rule of three to confirm that it's the same percentage as 100 36.
And I guessed it easily because I actually kind of arrived at that several pages ago, in my own investigation. Basically both Turing and Ampere can do 200 ops over same period of cycles, but Turing only actually does 136 in your typical game. So 200 / 136 = 1,47.
 
Not sure if I follow you here. Weren't you the one who brought up FP32+FP32+INT32? And what's warp width got to with it? Each warp goes into one block in it's entirety until the instruction is completed, which normally is two clocks for SIMDs are 16-wide and warps have 32 items.
As a h/w config, yeah, not as what gets scheduled per clock as this can't be more than 32 lanes due to warp width being 32. Two 16 wide units can be scheduled each clock because they are running 32 wide warps so the scheduler can fill them both each next cycle in turn while the other unit is executing what was scheduled in the previous clock. With three 16 wide units you have a scheduling issue as one of them will be idle any clock cycle with 32 wide warps. Hence why it's 32 FP32 or 16 FP32 + 16 INT32 and not 16+16+16.

Since it's also supporting the Sparsity thingie, it's highly likely IMHO.
Would be interesting to see if they'll be able to fit something gaming (graphics) related onto it.
 
That's not entirely correct because it's 36 INT per 100 FP, and it's not - 72 either because that would be if the total number was 200+72. For 200 total, the actual mix is 147 and 53, you can use the rule of three to confirm that it's the same percentage as 100 36.
And I guessed it easily because I actually kind of arrived at that several pages ago, in my own investigation. Basically both Turing and Ampere can do 200 ops over same period of cycles, but Turing only actually does 136 in your typical game. So 200 / 136 = 1,47.
You're right, thanks for the correction! So it's roughly 12% above that 1.47, which means other factors like additional ROPs and more L1 bandwith play their roles too.

Hence why it's 32 FP32 or 16 FP32 + 16 INT32 and not 16+16+16.
Never argued that point, in fact, I said it as well: FP32+[FP32 or INT32] with 16-wide each of course.
 
Lenovo may have leaked the existence of an RTX 3070 Super/Ti SKU with 16GB of RAM:
https://www.tweaktown.com/news/7491...rce-rtx-3070-ti-rocks-16gb-of-vram/index.html

It wouldn't surprise me if Nvidia are also prepping a 20GB equipped RTX 3080 SKU just in case RDNA2 performs better than expected.

And there we are, could've done it from the get go but no, gotta eke out that little bit of extra profit from the earliest adopters. Better thing to wait too, I'd expect the 6700xt or whatever to be around the same performance as a 3070 for $400-500, depending on just how competitive it is.

Do wonder if we'll see a 3080ti or whatever as well, another 8gb stick in. Since I'd expect "Big Navi" at lest in the lower end scenario, to compete with a 3080 with at least 12gb of ram, if not 24, it'd look better for Nvidia to have the ram numbers.
 
What does this actually mean? Instead of a crossbar for data transfers between the GPCs and ROPs, that traffic now has to route through GPCs?

"Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs."

Also, how is the combined FP32/INT32 pipeline any different to Pascal/Maxwell/Kepler etc? This doesn't sound like a new thing.

Doing a bit of ROP math assuming GA102 has 84SMs total.

96 ROPs = 6 GPCs = 14 SMs per GPC or
192 ROPs = 12 GPCs = 7 SMs per GPC.

If it's the latter that's going to be a chunky increase in raw rasterization throughput.

There is good evidence that there are 7 GPCs see my previous post.
That would mean 7 * 16 ROPs = 112 ROPs for a full GA102

I'm still very curious to the number of TMUs.
As L1 cache bandwidth has doubled, there is a good chance also TMUs have doubled from 4 to 8 per SM.
That would mean a whopping 8*84 = 672 TMUs.

Very impressive, aside the power consumption.
From the die photo, it looks like 7 GPCs x 12 SMs = 84 SMs
No word on TMUs, but rumors are 8 per SM.
 
w9KLtU9.png
this for real ? lol wow thats big
 
There is good evidence that there are 7 GPCs see my previous post.
That would mean 7 * 16 ROPs = 112 ROPs for a full GA102
Nvidia has a trend to increase the number of multi-processors per GPC since Kepler (or keep the GPC unit count hard capped at six), so GA102's config is more likely 6*14 with 2 MPs disabled for the RTX3090 SKU.
 
Vega 56 is almost a 27cm card, so 3080 is at worst 2 cm longer.
well thank you. I was going to go get my measuring tape and take a gander but i don't have to now.

Only thing worrying me with the 3080 is the 10gigs.

my guess is if AMD is competitive with navi 2 we might see a 3080ti with 20 gigs ?
 
Nvidia has a trend to increase the number of multi-processors per GPC since Kepler (or keep the GPC unit count hard capped at six), so GA102's config is more likely 6*14 with 2 MPs disabled for the RTX3090 SKU.

Look at the die, there are 7 columns of 12 SMs, being 7 GPCs.

geforce-rtx-ampere-410-dl.jpg
 
Back
Top