2304 bytes/clock for 96MB infinity cache, 192B/slice assuming same number of slices. That'll be 1536 bytes/clock for N32 (1.5x N21, 2x N22) and 768 bytes/clock for N33 (same as N22, 3x N23) so crazy bandwidth numbers all around. N31 has more L3 bandwidth per cycle than N21's L2, 2048 bytes/clockDid they reduce the clocks a lot (~1.4GHz), go 4MB slices and double width per slice (24 slices, 128B/clock)? 192B/clock, 12x 8MB (same as RDNA2), 1.88GHz?
It's not baffling. Moores law is dead, so the only way to get serious speedups is to have serious higher price as well.Nvidia is going to charge far more than they used to for things that imply the exact same names. Literally all they had to do was type a different set of names in some marketing doc and they'd have solved these unnecessary problems, but somehow they couldn't manage it. Baffling.
Bumped L0/L1 capacity also means smaller LLC parts like N33 have considerably higher memperf over N23.That'll be 1536 bytes/clock for N32 (1.5x N21, 2x N22) and 768 bytes/clock for N33 (same as N22, 3x N23) so crazy bandwidth numbers all around
Or you can do what AMD did and focus on tiles and raw PPA.so the only way to get serious speedups is to have serious higher price as well.
That would be true if 6800XT wouldn't exist in the world.The problem is Nvidia and Nvidia fans using the crypto priced Ti versions of Ampere as comparisons to justify the new pricing, ignoring the rest of the stack as it's not convenient for them.
Exactly.The problem is Nvidia and Nvidia fans using the crypto priced Ti versions of Ampere as comparisons to justify the new pricing, ignoring the rest of the stack as it's not convenient for them.
Those who are able to spend $1200 on a GPU will spend as much on a GPU no matter what it's named. Those who are able to spend $800 would not spend $1200 on GPU just because it would be named differently. It's weird to even suggest the opposite.
Tiles are the same one trick pony as the magic tm infinity cache and the PPA is a zero sum game, where you mostly trade 1 parameter for another with rare exceptions.did and focus on tiles and raw PPA
Oh no, my boy.Tiles are the same one trick pony
Good news we're not moving away from hybrid renderer garbage for at least a decade to come.Moreover, there is just 1 way of rendering things right and that's RT
The tech brief slide deck which is somewhere here.Any ideas where it comes from?
Yes. From my perspective AMD always does the right thing. The origin of my NV attitude was the day i bought some midrange GCN to estimate console performance of my stuff, and then it blew even a Titan out of the water in terms of performance. I felt like being scammed all the years before.Or you can do what AMD did and focus on tiles and raw PPA.
There's a limit to that obviously but still a way saner approach that burning N5/N3e/N2/you-name-it wafers for lulz.
Just fool them with enough of RT and AI snakeoil
VOPD - see the speculation thread, there's nothing more to say, frankly.On the SIMDs, is there any idea here on how often it will dual issue ? I mean vaguely of course. I assume it's most of the time or it would hardly seem worth it.
Is wave32 the norm?
How can wave64 do any type of operation across all the ALUs, but in wave32 mode half the ALUs can do only floats?
What exactly is VOPD?
This is sort of funny because some places are saying that Navi 31 has 6,144 shaders, and others are saying 12,288 shaders, so I specifically asked AMD's Mike Mantor — the Chief GPU Architect and the main guy behind the RDNA 3 design — whether it was 6,144 or 12,288. He pulled out a calculator, punched in some numbers, and said, "Yeah, it should be 12,288." And yet, in some ways, it's not.
AMD's own slides in a different presentation (above) say 6,144 SPs and 96 CUs for the 7900 XTX, and 84 CUs with 5,376 SPs for the 7900 XT, so AMD is taking the approach of using the lower number. However, raw FP32 compute (and matrix compute) has doubled. Personally, it makes more sense to me to call it 128 SPs per CU rather than 64, and the overall design looks similar to Nvidia's Ampere and Ada Lovelace architectures. Those now have 128 FP32 CUDA cores per Streaming Multiprocessor (SM), but also 64 INT32 units.
That they haven't stated it can is probably an answer.VOPD - see the speculation thread, there's nothing more to say, frankly.
Pixel shaders seem likely to be wave64.
RDNA 2 seems to perform best with wave32 for ray tracing. Compute shader appears to have been the original motivation for AMD to introduce native 32-work item hardware threads in RDNA. It also seems to go well with vertex shading, as far as I can tell (which isn't much).
The asymmetric float/int + float thing across the 32 + 32 lanes is confirmed here:
which seems to marry up with Super-SIMD as I wrote about here:
AMD: RDNA 3 Speculation, Rumours and Discussion
Correct, in the current market AMD mostly exists as a pricing lever in the eyes of consumer population. So they're not gonna actively compete on price or try to push major GPU volumes etc. Consumers are so dim that a 6700xt can currently be had for $3xx on ebay. That's fantastic deal, Metro EE...forum.beyond3d.com
Here we can see discussion of a side ALU and a full ALU, which aligns with the float/int + float arrangement. Notice this slide refers to the Compute Unit Pair and does not mention "Work Group Processor":
This slide makes me think that AMD is describing ALU functions as separate units, when in fact they are existing units that are repurposed. I think that's what "64 Multi-Precision" is describing as well as "Multi-Purpose ALU".
The slides look like a hurried mess and discussion doesn't seem to clarify much:
In terms of "FLOPS", this does look like it's designed to achieve maximum floating point throughput whether in wave32 or wave64 modes, and when falling back to VOPD when co-issue from a wave32 hardware thread isn't available.
What is unclear is whether two independent hardware threads (2 separate wave32) can dual-issue. AMD has a history of misusing "dual-issue" so we'll have to wait to find out, I suppose.
Yes, that's what I tend to think.That they haven't stated it can is probably an answer.
I wonder if their 17% CU improvement in IPC includes their expectation of 2xfp32 rates. Loosely aligns with the 40-50% perf increase they have shown in rasterized games.Yes, that's what I tend to think.
I suspect you're right, but it wildly contradicts the "extract maximum value from each transistor" statement on that first slide I included.I wonder if their 17% CU improvement in IPC includes their expectation of 2xfp32 rates. Loosely aligns with the 40-50% perf increase they have shown in rasterized games.