AMD Execution Thread [2023]

pTmdfx · Dec 28, 2023

Arun said:
It sounds like it should be possible but how exactly would you do it in practice? You'd need full knowledge of the entire hashing algorithm *and* you'd need the virtual=>physical memory mapping to remain consistent across way more than 4KiB.

I just realized optimizing for LLC locality via access pattern seems a long stretch for MI300A, which has one-forth of the LLC and memory channels not being bound to any GPU XCD. It really seems like the partitioned modes are the only way to make full use of the LLC peak bandwidth.

Either way, given that the LLC is memory-side, and that everything is power-of-two, It could be as simple as a power-of-two scheme where PA bit 14:13 select a home IOD and PA bit 12:8 select a channel within an IOD. Though there seems a plausible argument as well, where striping first across IODs gives better power performance probabilistically (and to a lesser extent, the local effective LLC bandwidth)

Arun · Dec 28, 2023

pTmdfx said:
I just realized optimizing for LLC locality via access pattern seems a long stretch for MI300A, which has one-forth of the LLC and memory channels not being bound to any GPU XCD. It really seems like the partitioned modes are the only way to make full use of the LLC peak bandwidth.

Yeah, for inference workloads where the NVIDIA 48GB L40/L40S is suitable, splitting MI300X into 4 partitions with 2 XCDs and 48GB each seems like it would be optimal, although it means you only have 64MiB LLC instead of 256MiB if everything is running the same workload/kernel in non-partitioned mode so it's not an obvious win.

pTmdfx said:
Either way, given that the LLC is memory-side, and that everything is power-of-two, It could be as simple as a power-of-two scheme where PA bit 14:13 select a home IOD and PA bit 12:8 select a channel within an IOD. Though there seems a plausible argument as well, where striping first across IODs gives better power performance probabilistically (and to a lesser extent, the local effective LLC bandwidth)

I'd be incredibly surprised if there was basically no real hashing whatsoever (which is what you're describing)... that's just asking for trouble!

You'd want to at least use an extra bit (e.g. bit 15) for hashing to avoid simple pathological cases, and probably wayyy more bits than that ideally (it's cheap hardwise-wise). Either way the hashing algorithm may or may not be trivial to invert in software; hard to say without actually knowing what it is...

I'm tempted to try to make a test to determine the hashing algorithm of GPUs now, and see whether making requests for the entire DRAM row/page consecutively improves how close to peak bandwidth I can get (and/or whether it reduces power)... not sure how much time it'd take though.

del42sa · Dec 30, 2023

https://videocardz.com/newz/gigabyte-leak-lists-unreleased-amd-radeon-rx-7600-xt-with-16gb-of-memory

RX7600XT will feature 16GB of RAM

digitalwanderer · Dec 30, 2023

del42sa said:
https://videocardz.com/newz/gigabyte-leak-lists-unreleased-amd-radeon-rx-7600-xt-with-16gb-of-memory

RX7600XT will feature 16GB of RAM

That's some good news, here's hoping for 256 bus!

DegustatoR · Dec 30, 2023

digitalwanderer said:
That's some good news, here's hoping for 256 bus!

It will be the same 128 bit as on non XT of course.

digitalwanderer · Dec 30, 2023

DegustatoR said:
It will be the same 128 bit as on non XT of course.

I really hope you're wrong and it's at least 196. If they're going to increase the memory it makes sense in every regard except for product stack, but that's because the product stack at that tier currently sucks so badly.

DegustatoR · Dec 30, 2023

digitalwanderer said:
I really hope you're wrong and it's at least 196. If they're going to increase the memory it makes sense in every regard except for product stack, but that's because the product stack at that tier currently sucks so badly.

It's the same Navi 33 which has a 128 bit bus. 7700XT is above it and that one has 192 bits.

I'd argue that Navi 33 doesn't need more than 128 bits really. It's also remains to be seen how much faster than the non-XT it will be to justify these 16GBs.

Seanspeed · Dec 30, 2023

DegustatoR said:
It's the same Navi 33 which has a 128 bit bus. 7700XT is above it and that one has 192 bits.

I'd argue that Navi 33 doesn't need more than 128 bits really. It's also remains to be seen how much faster than the non-XT it will be to justify these 16GBs.

We really dont know. I'd say it's equally nonsensical for this to still be Navi 33, given a 7600 is already a fully-enabled Navi 33. There's very little performance left to extract from it, and similarly very little reason for it to need 16GB when it's already unsuitable for high resolution gaming.

The 16GB part is hard to reconcile either way. One will step on the 7700XT's toes for bandwidth, the other will be a pretty much meaningless improvement to the 7600.

Another, perhaps even less plausible option is that it's ported to 5nm and clocked like crazy. Or maybe more plausibly it could be Navi 32 but with only two MCD's instead of 3 or 4 with further reduced cores from 7700XT. The latter is what I'd guess.

Qesa · Dec 31, 2023

The only plausible configurations are an overclocked 7600 or further cut down 7700XT with 2 MCDs

Frenetic Pony · Dec 31, 2023

Qesa said:
The only plausible configurations are an overclocked 7600 or further cut down 7700XT with 2 MCDs

Besides, RDNA4 is appearing faster than one might expect. Heard June (not for the big Halo product but the smaller ones) launch isn't out of the question, or at least that it would be possible by then even if that's not the plan.

Thus 7600xt is a custom only non launch, AMD have moved on to preparing for the next round.

AMD Execution Thread [2023]

pTmdfx

Arun

Unknown.

del42sa

digitalwanderer

DegustatoR

digitalwanderer

DegustatoR

Seanspeed

Qesa

Frenetic Pony

Similar threads