NVidia Ada Speculation, Rumours and Discussion

Status
Not open for further replies.
Where I did mention L2 cache. I said that in a 3D stack you can easily add not ony the connection from the compute die to the cache die but also synchronization signals that can propagate through the entire stack, because the 3D stack is basically equivalent to a very big IC with some latency penalties in the interconnection points. I wonder why I even try to explain trivial concepts to someone that thinks a +30% gain in terms of perf/W on the same process node is bad. It's clear that either 1) you are not understanding what a 3D stack is, thus not understanding that every single signal can be passed though the TSVs and not only memory inteface ones or 2) You are trying to spread FUD. As it's quite evident by your post hystory that the second is quite the most probable one, let me put in ignore so I avoid my eyes getting hurt by your FUD spreading.

Then dont quote me. I mentioned L2 cache because this is used to share informationen between compute units on a chip. With a chiplet design this has to go over the shared "off" chip cache. "Off" chip bandwidth is not the problem it is always effciency and data locality.
 
H100 has no display outs or RT gear.
Impossible.
Please do not selectively quote partial sentences to fit whatever you think it is.
I explicitly said "If the constellation is such, that only with a heavily overclocked Hopper they can claim perf kingship in the desktop and/or gaming space, …" right before the part you quoted. Just saw this now.
 
Last edited:
NVIDIA Lovelace vs AMD RDNA 3, what has not been told about their GPUs (techunwrapped.com)

But, the article speculates that Lovelace has been confused for Hopper. And then seemingly goes wild: "Hence, we think that the configuration of 144 Shader Units could correspond to Hopper and not Lovelace, since it is said that Lovelace will be a multi-chip GPU."

Separately there's noise in Twitterverse that Lovelace will have much higher clocks, similar to RDNA 2/3. That's one way of using less SMs, which is the central bone of contention that the article focuses on.

That article is all over the place. This bit doesn't make any sense -

"There is another rumor that speaks of change in the organization of your GPUs in the next generation by NVIDIA, where the minimum unit will be the SM and the subcores will disappear, so the SM unit will have a general scheduler instead of having one in each subcore, in that aspect it will look much more like the architecture from AMD where the lowest level cache is shared for all SM equally."

A partition within an SM is generally equivalent to a SIMD within a CU. Each has its own execution units, wavefront scheduler and register file. If Nvidia gets rid of partitions it will make their architecture less like RDNA not more. Also I have no idea what they mean by AMD's architecture sharing the lowest level cache across all SMs - that's definitely not true. Each CU has a private L0.
 
Think he's referring to the Mac-Lineup based on RDNA2, where they have a dual GPU card.
But you're right, TFLOPS rarely tell the whole story. Especially in realtime graphics, there's so much more.

That's why it'll be very interesting to see whether or not MCM will really behave like a singular large GPU from the first try or if anyone might need some iterations. I guess it'll be a load better already than SLI or Crossfire ever was.
 
Yes, the W6800X Duo has a clockrate of less than 2000Mhz, around 15% higher than the A6000 with less memory and bandwith per each chip. Even a 3090 with nearly 1TB/s off chip bandwidth delivers more compute performance with 50W less power.

nVidia doubled compute throughput with Ampere over Turing and didnt scale every (fixed) function with it. It was a genius move. Yet people here think that AMD can just improve effiency on an architecture level by 4x+ to put three "RDNA2" chips on a package for 75 TFLOPs. At the same time nVidia would have a problem to even double compute performance with Lovelace while using 33%+ more power than Ampere.

These speculation are not grounded by reality and are just baseless.
 
nVidia doubled compute throughput with Ampere over Turing
They doubled FP32 units, not compute power. Thus teraflops becomes an even worse unit to compare different architectures. Same for benchmarks, depending on how much math they do.
 
That's not a solution. AMD has moved in the exactly opposite direction with RDNA for a reason - it had failed with GCN at finding async workloads on PC. Straightforward ports from consoles simply didn't have enough of work to fill PC's GPUs.
I don't think AMD has moved into the opposite direction in terms of async compute support working less well than on GCN?
I also think it's the devs fault they never utilized compute at all, aside of some culling, binning lights to a grid, and other trivial stuff. But AMD obviously was wrong in assuming they just would. If they had invaded studios with researched applications and support like NV does for RT for example, they now would not need to build a monster GPU just to take the lead.
So if we get a 75tf GPU now, 7 times more powerful than consoles, then i don't see why we are worried it could not scale just those console games 7 times faster, or with 7 times more pixels, both being just pointless.
No, if we want to utilize this monster, we likely have to add some more features to the game, and those features surely use some compute and are thus well suited for async workloads to compensate speculated shortcomings. And i guess we may use more parallelism than just one single compute workload beside gfx work.

But that's just my opinion. If you ask me if we really need such GPUs for games, i'm really not sure.
 
nVidia doubled compute throughput with Ampere over Turing and didnt scale every (fixed) function with it. It was a genius move.
As far as I remember, common consensus from reviews was that this doubling of thru... didn't work and didn't scale well. Low throughout Navi21 was on par with 3090 (of course excluding raytraycing). Probably in special scenarios ampere rocks but in regular games is on par.
 
No shit.
Twice the FMA per the same amount of r/w ports is what client amperage is.
Ampere didn't double the ports and reg file because Turing already had these sufficient for running INT32 in parallel.
Ampere hits its FP32 peaks fine when the code is pure FP32. Gaming code isn't though and thus it doesn't show double throughput.
If we assume that Lovelace will be just Ampere scaled up then it will scale just as well as RDNA2 in comparison to RDNA1 did.
 
They doubled FP32 units, not compute power. Thus teraflops becomes an even worse unit to compare different architectures. Same for benchmarks, depending on how much math they do.

What's the definition of compute power?

If you're talking about the register file, IIRC it has been providing operands for two instructions per clock since the introduction of GV100.

Yeah, where is this idea coming from that Ampere doesn't have enough regfile bandwidth to feed two 16-wide FP32 pipes. The register file has been providing enough operands for 32 FMAs per clock since forever when the pipes were 32-wide. Each Ampere FP32 pipe is a SIMD-16 and takes 2 clocks to execute each instruction over the 32-wide warp. So the operand collector just alternates between pipes every other clock. It's the same total width as a Pascal SM - 128 FMAs per clock.
 
Is there a tool that lists out the exact machine code that is scheduled on Ampere? Like the various AMD tools that list out the ISA code?
 
Is there a tool that lists out the exact machine code that is scheduled on Ampere? Like the various AMD tools that list out the ISA code?
I am not 100% sure but if the 2 instructions per clock come from different warps what you’re asking for probably doesn’t exist.
 
Status
Not open for further replies.
Back
Top