NVidia Ada Speculation, Rumours and Discussion

Status
Not open for further replies.
I don't know. Going from specs the throughputs for FP32 and FP16 are the same on Ampere. Then you should presumably be able to run them concurrently (you'd need two async workloads of course). But how this is happening and with what speeds should be tested and I haven't seen any data on this.
Are you talking about Gaming-Ampere here? Because:
upload_2021-8-5_12-53-57.png
That seems to indicate, that they can utilize both FP32 and FP64-cores full bore at the same time with FP16 precision.
 
Last edited:
TBH aiming CDNA (Vega) on pure HPC is not that weird given the SW side of the business. Targeting AI/ML requires top-notch SW. AMD's SW is traditionally far from that.

Yes, but the problem is that the science community is moving forward, exploring the nature of AI/DL for HPC workload: https://developer.nvidia.com/blog/ai-detects-gravitational-waves-faster-than-real-time/
And some information about mixed precision from Oak Ridge: https://www.olcf.ornl.gov/2021/06/28/benchmarking-mixed-precision-performance/
 
It doesn’t though since nobody counts tensors when talking about theoretical flops for gaming.
"Tensors", no. FP16 most definitely yes. Or should do, because FP16 is relevant in gaming.

Maybe. But the topic was how application performance scaled with Ampere’s doubled FP32. We have lots of evidence that game performance did not scale anywhere close to the flops increase. However there are other workloads where it came reasonably close.
And what we appear to have discovered is that there's a requirement for two distinct hardware threads to be issued-to across the two FP32-capable datapaths in order to get at those FLOPS. And what's interesting about that is that it means you need twice as many hardware threads in flight on the SM partition as would be required were there just a single FP32 (combined with integer) SIMD, e.g. as seen in RDNA.

That makes occupancy more brittle - or if you prefer it makes performance variation with count of hardware threads in the partition more brittle.

RDNA appears to spend more transistors, relatively speaking, on scheduling: there's less SIMDs per scheduler. It appears AMD's focus in RDNA has been on reduced brittleness: reducing the ratio between best and worst cases, which also requires more, more-local, cache and bigger register files. So more of the same in RDNA 3.

The change from VLIW to "scalar" was very much driven by the desire to reduce brittleness. A lot of problems were seen with instructions that were scalar or vec2, "wasting" VLIW throughput.

It'll be interesting to see if 2022 brings us conditional routing to help with the problems caused by divergent control flow. As ray tracing becomes dominant for AAA graphics, it seems that branching is getting harder to avoid in shaders. Brittleness there truly is disastrous.
 
Are you talking about Gaming-Ampere here? Because:
View attachment 5774
That seems to indicate, that they can utilize both FP32 and FP64-cores full bore at the same time with FP16 precision.

The individual peak rates for each data type don’t mean that the chip can sustain all of those rates concurrently. Clearly gaming Ampere can’t run peak FP32 at the same time as peak INT32 because they share the same execution units. What we don’t know is which pipes can run concurrently with tensor FP16.
 
"Tensors", no. FP16 most definitely yes. Or should do, because FP16 is relevant in gaming.

How relevant though? Are FP16 ops a significant percentage of total gaming flops?

And what we appear to have discovered is that there's a requirement for two distinct hardware threads to be issued-to across the two FP32-capable datapaths in order to get at those FLOPS. And what's interesting about that is that it means you need twice as many hardware threads in flight on the SM partition as would be required were there just a single FP32 (combined with integer) SIMD, e.g. as seen in RDNA.

We don't actually know how flexible the scheduling is. For all we know the scheduler can pick an op from another warp or an independent op from a warp that's already running on the other pipe. If it can take advantage of either ILP or TLP it's a lot less worrisome. Maybe someone will bother to run micro benchmarks on Ampere and figure it out.
 
The individual peak rates for each data type don’t mean that the chip can sustain all of those rates concurrently. Clearly gaming Ampere can’t run peak FP32 at the same time as peak INT32 because they share the same execution units. What we don’t know is which pipes can run concurrently with tensor FP16.
I did not mean to imply that this was the case. Rather, that to achieve 78 FP16-TFlops on GA100, you would need to utilize the FP32 and the FP64 units simultaneously with packed math.
But since DegustatoR confirmed that he was not referring to Ampere in general but to Gaming-Ampere, my posting is not relevant here anyway.
 
How relevant though? Are FP16 ops a significant percentage of total gaming flops?
Apparently so, because it's been in GPUs for a while now...

I know that's glib. I don't know how we can "measure" that. Percentage of game optimisation slide decks that mention FP16?

Irony is that shader model 2 had "half" as an intrinsic format and it was spurned when floats got to 24- and 32-bit. There were rendering problems with widespread use of half back in those days. So their selective use appears to be the norm.

First Steps When Implementing FP16 - GPUOpen

At present, FP16 is typically introduced to a shader retrospectively to improve its performance. The new FP16 code requires conversion instructions to integrate and coexist with FP32 code. The programmer must take care to ensure these instruction do not equal or exceed the time saved. Is is important to keep large blocks of computation as purely FP16 or FP32 in order to limit this overhead. Indeed, shaders such as post-process or gbuffer exports as FP16 can run entirely in FP16 mode.
 
I did not mean to imply that this was the case. Rather, that to achieve 78 FP16-TFlops on GA100, you would need to utilize the FP32 and the FP64 units simultaneously with packed math.
But since DegustatoR confirmed that he was not referring to Ampere in general but to Gaming-Ampere, my posting is not relevant here anyway.

Got it.
 
There is no such question. You forget that 75TF top end means ~25TF low end, and even that will not be enough to run games from last year at maximum settings. The lineup isn't made out of one GPU.

And even beyond that scaling RT and compute based raster is far from over. Games aren't really hitting the point at which we can say "well, we don't need better graphics now".
Agree to all that as well.
What's new now is this: Prev gen: 2tf vs. 10tf PC high end. Current gen: 5tf vs. 75tf PC high end.
We can (and probably will) ignore this, and just crank up all settings as usual until the GPU is at its limit. Easy and done. But will it be enough visual improvement to justify the premium price? Surely not. The true potential won't be utilized.
We'll never be at the point where better gfx won't show improvements. But imo we are already at the point where those high end gfx are just too expensive.
"Will people spend increasing amounts of money for HW, or not?" If you say there is no such question, then you can just answer me this.
 
10TF for the visuals we're currently getting is fucking miserable.
So, all games of current gen will look shitty? Maybe. But it's what we have to deal with, and it's what we sell to people promising it's top notch gfx for the next years.
Ray tracing is probably going to look like a saviour in a few years' time, because it forced devs to actually pay attention to the hardware.
What? How does RT forces or even helps devs to understand / optimize HW? It's all blackboxed. It's slow by definition. Devs contribution is zero. They only use it - it is developed by HW vendors.
Sure, it's a way to bring any HW to its knees with little effort. But i'd rather like to see RT being used at a reasonable cost / benefit ratio, and some other things beside if possible.
 
10TF for the visuals we're currently getting is fucking miserable.

Ray tracing is probably going to look like a saviour in a few years' time, because it forced devs to actually pay attention to the hardware.

Ive been working through my Steam backlog. Mostly PS3 era games and just started getting into the PS4 generation. There are clear improvements but the visual upgrade is nowhere near the increase in shading horsepower in the same timeframe. I doubt it will be any different in the PS5 generation. Games will not take full advantage of PC hardware.
 
10TF for the visuals we're currently getting is fucking miserable.

Ray tracing is probably going to look like a saviour in a few years' time, because it forced devs to actually pay attention to the hardware.
I’d say the visuals we are getting for 1.8 TF are very impressive. Thats where the target has been since 2013.
 
Using mixed precision in cases where FP64 isnt necessary increases effciency by x-times.
Yes, but MI200 hasn't like 20 or 30 % higher FP64 performance than A100, but allmost 5-times higher. Mixed precision is fine, but it's worthless once the task uses a specific precision.

now with GA100 they can tackle every workload with one product.
That's wishful thinking. GA100 is AI-focused. Its compute power is barely better than MI60 from 2018. Not only MI200 will be several-times faster in general compute, but A100 will lose the crown even in FP16 tensor, BF16 tensor, FP32 tensor (MI200 is almost 5-times faster). According to some leaks (not sure how reliable) MI200 will be also ~2,4-times faster in FP64 tensor. A100 will keep its position in INT4, INT8 and TF32 tensor.

That makes single purpose products like AMD's CDNA less competitive and cost ineffective for most companies and cloud providers.
So why are there companies waiting for MI200 and not buying A100? Why does Oak Ridge prefer MI200 over A100 for the world's fastest and world's first exascale supercomputer? Why did Pawsey ordered MI200 and not A100, if the MI200 is less competitive and less cost ineffective? Maybe they are completely stupid and should visit Beyond3D more often for helpful advice :smile2:
 
10TF for the visuals we're currently getting is fucking miserable.

Ray tracing is probably going to look like a saviour in a few years' time, because it forced devs to actually pay attention to the hardware.

Good were not stuck on 10TF hardware.

I’d say the visuals we are getting for 1.8 TF are very impressive. Thats where the target has been since 2013.

Thats your opinion, ofcourse.
 
Yes, but MI200 hasn't like 20 or 30 % higher FP64 performance than A100, but allmost 5-times higher. Mixed precision is fine, but it's worthless once the task uses a specific precision.

Sure. 5x better with FP64...

That's wishful thinking. GA100 is AI-focused. Its compute power is barely better than MI60 from 2018. Not only MI200 will be several-times faster in general compute, but A100 will lose the crown even in FP16 tensor, BF16 tensor, FP32 tensor (MI200 is almost 5-times faster). According to some leaks (not sure how reliable) MI200 will be also ~2,4-times faster in FP64 tensor. A100 will keep its position in INT4, INT8 and TF32 tensor.

Sure. Reality again. GA100 delivers 320 TFLOPs FP16 performance. So no, AMD wont deliver more performance. I cant even believe that you think that AMD would even be able to do it.

So why are there companies waiting for MI200 and not buying A100? Why does Oak Ridge prefer MI200 over A100 for the world's fastest and world's first exascale supercomputer? Why did Pawsey ordered MI200 and not A100, if the MI200 is less competitive and less cost ineffective? Maybe they are completely stupid and should visit Beyond3D more often for helpful advice :smile2:

Those arent companies. Companies dont wait they buy real products.
 
Why does Oak Ridge prefer MI200 over A100 for the world's fastest and world's first exascale supercomputer?
Government grants?
I always thought TOP 10 HPC was all about the dick metering on a global scale, no?
The one who promises longer fp64 flops at the smallest energy footprints wins.
Which is something nice to brag about in press, but kind of useless even for the scientists who will use this supercomputer. DGEMM has little to do with real HPC tasks, which are mostly bandwidth and scaling bound.
 
Yes, but MI200 hasn't like 20 or 30 % higher FP64 performance than A100, but allmost 5-times higher. Mixed precision is fine, but it's worthless once the task uses a specific precision.
Is the FP64 theoretical peak rate ~5x higher when compared against A100 when it is using the tensor cores (~20TF/s) or without them (~10TF/s)?

Also what's MI200 TDP?
 
Status
Not open for further replies.
Back
Top