Nvidia Ampere Discussion [2020-05-14]

With 60% higher bandwidth? I think that’s really unlikely.

That's a pretty big bump in bandwidth, no? Bandwidth also increase more than shader cores. I'm curious to see how the cache changed, if it changed much at all.

2080 Super 496 GB/s -> 3080 760 GB/s (53% increase)
2080 Super 3072 cuda cores -> 3080 4352 cuda cores (41.6% increase)
 
That's a pretty big bump in bandwidth, no? Bandwidth also increase more than shader cores. I'm curious to see how the cache changed, if it changed much at all.

2080 Super 496 GB/s -> 3080 760 GB/s (53% increase)
2080 Super 3072 cuda cores -> 3080 4352 cuda cores (41.6% increase)
19 FLOPS/Byte is nothing out of this world. TU106 had similar characteristics and also TU104/Navi10 were not far away with 22. Compared to TU102 with 24 it's a bit larger gap, I'll give you that.
 
Looking good. And only 2 slots too.
What is the use case of a 2-slot card versus say 3-slots? Are you thinking specifically of a small form factor case that literally constrains a graphics card to 2-slots?

If not, what else?

That's a pretty big bump in bandwidth, no? Bandwidth also increase more than shader cores. I'm curious to see how the cache changed, if it changed much at all.

2080 Super 496 GB/s -> 3080 760 GB/s (53% increase)
2080 Super 3072 cuda cores -> 3080 4352 cuda cores (41.6% increase)
If the tensor ALUs can be used for graphics shading, then these numbers are deceptive. At the same time, I expect tensor ALUs have very heavy constrictions on what instructions can be issued (ADD, MUL and MAD) and how they're sequenced (dependencies).

Beyond that, I think we should expect "second generation" real time ray tracing to get a massive boost in performance. For example, there might be large benefits in moving ray queries around the GPU, so that they follow the data, rather than trying to get all the data to all the rays. This is purely my speculation, but I'd like to compare this with how NVidia fully parallelised geometry processing, which was a revolution for tessellation. And again with tile-based rasterisation. And again with render target compression.

Hardware algorithms to speed up well-defined bandwidth-eating monsters are the entire reason we have such nice graphics.

Bandwidth is always the enemy, if you're building graphics hardware you know this decades in advance. Plain accelerated BVH traversal only gets us to 1987. There's more than 30 years of good ideas since then to put into hardware :)
 
What is the use case of a 2-slot card versus say 3-slots? Are you thinking specifically of a small form factor case that literally constrains a graphics card to 2-slots?

If not, what else?

No practical use, just engineering curiosity. My last 2 cards have been 3-slot AIB behemoths.

Official looking Gainward slides claim 7nm for GA102. Still no word whether that's TSMC or Samsung but I would be extremely surprised if Nvidia gambled on an unproven EUV process.

My bet is TSMC.

Gainward-Ge-Force-RTX-3090-Gainward-Ge-Force-RTX-3080-Phoenix-Golden-Sample-Custom-Graphics-Cards-1.jpg
 
What is the use case of a 2-slot card versus say 3-slots? Are you thinking specifically of a small form factor case that literally constrains a graphics card to 2-slots?

If not, what else?
Many Mini-ITX cases allow for a 2-wide card, but not much more. These SFF thingies seem to enjoy rising popularity for a couple of years now.
 
Beyond that, I think we should expect "second generation" real time ray tracing to get a massive boost in performance. For example, there might be large benefits in moving ray queries around the GPU, so that they follow the data, rather than trying to get all the data to all the rays. This is purely my speculation, but I'd like to compare this with how NVidia fully parallelised geometry processing, which was a revolution for tessellation. And again with tile-based rasterisation. And again with render target compression.

Hardware algorithms to speed up well-defined bandwidth-eating monsters are the entire reason we have such nice graphics.

Bandwidth is always the enemy, if you're building graphics hardware you know this decades in advance. Plain accelerated BVH traversal only gets us to 1987. There's more than 30 years of good ideas since then to put into hardware :)

GA102 has 15,8b more transistors than TU102. Even after reducing the transistors for 17% more compute units, 2rd and 3rd RT/Tensor cores there would be a whole TU (10,8b transistors) unused. So, maybe they are really going all in with Raytracing?!
 
No practical use, just engineering curiosity. My last 2 cards have been 3-slot AIB behemoths.

Official looking Gainward slides claim 7nm for GA102. Still no word whether that's TSMC or Samsung but I would be extremely surprised if Nvidia gambled on an unproven EUV process.

My bet is TSMC.

Gainward-Ge-Force-RTX-3090-Gainward-Ge-Force-RTX-3080-Phoenix-Golden-Sample-Custom-Graphics-Cards-1.jpg
It also says HDMI 2.1 - finally.

Additionally, I wonder what became of the VR-Link thingie with an USB-C outlet. That one added 30 watts of TBP last gen. Maybe that's why the messaging emphasises TGP.
 
GA102 has 15,8b more transistors than TU102. Even after reducing the transistors for 17% more compute units, 2rd and 3rd RT/Tensor cores there would be a whole TU (10,8b transistors) unused. So, maybe they are really going all in with Raytracing?!

Yeah clearly a GA102 SM is much more powerful than a TU102 SM. I would be surprised if tensors are to blame though. 4K DLSS 2.0 only takes ~1.5ms on a 2080 Ti. There's no need for gaming Ampere to go nuts with tensor performance.

Of course it's going to be a combination of things. RT most definitely got an upgrade. Maybe they've doubled ROPs again to help use all that bandwidth.
 
I'm wondering whether there are "macros" that would run on the tensor ALUs that would greatly contribute to BVH or other acceleration-structure traversal algorithms. e.g. sorting based on prefix-sum running on tensor cores? So it might be worthwhile to add some functionality/memory/data-networks to tensor ALUs for these macros, whilst not making them fully general INT/FLOAT ALUs.
 
Alright so, 20 teraflops for 3090, about 16 for the 3080. That 3080 has a rather worryingly low amount of ram for such an expensive card. Then again, they're Nvidia, we could easily see it ramp up to 20gb for an FE model as rumored or after AMD launches their cards in a few months.
 
GA102 has 15,8b more transistors than TU102. Even after reducing the transistors for 17% more compute units, 2rd and 3rd RT/Tensor cores there would be a whole TU (10,8b transistors) unused. So, maybe they are really going all in with Raytracing?!
Where does that transistor number come from? Double confirmed? ;)
Alright so, 20 teraflops for 3090, about 16 for the 3080. That 3080 has a rather worryingly low amount of ram for such an expensive card. Then again, they're Nvidia, we could easily see it ramp up to 20gb for an FE model as rumored or after AMD launches their cards in a few months.
3090 with 5248 ALUs @1725 MHz (for the Gainward thingie) is more like 18.1 TFLOPS.
 
Where does that transistor number come from? Double confirmed? ;)

3090 with 5248 ALUs @1725 MHz (for the Gainward thingie) is more like 18.1 TFLOPS.

Just assuming they're using all the bandwidth for their highest end card, especially since the memory is clocked just that much higher than the lower one. End is result is just under twenty two teraflops max (edit, off due to dropping a small number from 2080, derp), though with the huge 3 slot cooler and giant TPD maybe max clock speed can't be maintained that long.
 
Last edited:
I wonder how high the 3090/80 will ultimately boost to?

The 2080ti has advertised base boost at 1545mhz, but we all know they're typically at ~1800-1900mhz. Pretty easy to hit 2000mhz on my FE anyway.

I wonder if the 30 series will boost higher? Not much longer to wait I guess anyway lol.

Based on TF's there's around 40% between the 2080ti and the 3090.. but who's to say that architectural improvements wont push actual performance even further still?
 
Back
Top