AMD Execution Thread [2023]

Status
Not open for further replies.
I expect a ~52CU 7700XT to perform similar to a 6800 for about $500, again representing a minimal improvement in performance per dollar from previous gen.
This is what I'd expect and it would be another signal that the stagnation in perf/price is spreading into high end now.
 
If it's only niche, and we're talking sort of custom SFF PC's, I'd worry about the cost issue.

I'm just talking early examples of where this notebook-focused APU may be used in 'desktops'. Strix Halo will primarily be a laptop chip, and laptops aren't niche. But yes, the cost is the big question with any powerful APU.

I mean, if you can build a faster traditional PC for the same money, then we're dwindling the advantage down to almost just form factor alone, while also losing all ability to upgrade CPU, memory or GPU in the future which definitely has value.

All depends on what's actually available in the low/midrange discrete market. The hollowing out of that segment is part of the reason why a powerful (relatively) APU might actually be cost-effecient right now. When we were in the era of sub-$300 GPU's increasing performance 30-50% every 2-3 years that would make the value potential of a desktop APU far more difficult. Now...?

Maybe Valve could try and bring back the 'Steam Machine', except do what they did with Steam Deck and subsidize the cost as much as practically possible in order to push adoption? That could maybe help get around lower volume cost issues.

Hard to say, I think the novelty of the portability factor allows people to overlook deficiencies with game support on the Steam Deck that they would be far less forgiving on a desktop.
 
If it's only niche, and we're talking sort of custom SFF PC's, I'd worry about the cost issue. I mean, if you can build a faster traditional PC for the same money, then we're dwindling the advantage down to almost just form factor alone, while also losing all ability to upgrade CPU, memory or GPU in the future which definitely has value.

Maybe Valve could try and bring back the 'Steam Machine', except do what they did with Steam Deck and subsidize the cost as much as practically possible in order to push adoption? That could maybe help get around lower volume cost issues.
would be smarter to make a steam deck 2 based around a future amd apu and use the same apu in a small box the size of a fire tv or nuc. You should be able to release it for a lower price than a steam deck 2 since you wouldn't have the screen or battery to price in. You might even be able to run the apu at higher clocks due to better cooling.
 
will have to try to hide the MI200's poor sales until Q4 of this year.

And they actually doing it.🫣

Lisa Su -- President and Chief Executive Officer

Yeah. Vivek, I don't know that I would go into quite that granularity. What we will say is the GPU sales in the first half of the year were very low as we were sort of in a product transition timing as we go into the second half of the year. In particular, the fourth quarter, we'll have MI300 ramp.
 

About the same as 6950XT on average.
I didn't realize the article was in German from the blurb and thought you had just jumped the shark over the 7900 GRE. My apologies, I blame my meds, but it cracked me up so hard I had to share. :ROFLMAO:
 
The 4090 being only 6% faster than 3090Ti just means they are using a ton of unoptimized code.

The blogpost says:

4090 has 2x more FP16 performance than 7900 XTX, while 3090 Ti has 1.3x more FP16 performance than 7900 XTX. Latency sensitive LLM inference is mostly memory bound, so the FP16 performance is not a bottleneck here.
 
The blogpost says:
That's one form of inference workload, using this alone to imply that performance is close between vendors is misleading, you need to use a compute bound workload to show whether vendors are really close to each other or not, using a pure memory bound workload is just equalizing all variables to a single bottlneck shared by all GPUs and doesn't indicate that AMD is close to NVIDIA at all, since a 4090 is barely any faster than a 3090Ti.
 
That's one form of inference workload, using this alone to imply that performance is close between vendors is misleading, you need to use a compute bound workload to show whether vendors are really close to each other or not, using a pure memory bound workload is just equalizing all variables to a single bottlneck shared by all GPUs and doesn't indicate that AMD is close to NVIDIA at all, since a 4090 is barely any faster than a 3090Ti.
You need to use real world workload, not compute bound or memory bound or whatever bound.
Asking for compute bound benchmark is just as dishonest as asking whatever bound that favors whateveryourbrandhappenstobe
Not all loads are created equal and you can't dismiss them based on whatever works well/bad on your preferred manufacturer
 
Workloads, as in pleural and preferably a mix of memory and compute, not a single memory bound workload and then use that to claim some sweeping conclusions regarding performance parity.
The point was that some are memory bound, some compute bound, you can't just pick whatever suits your personal preference as ground truth.
Take real world bench applicable to your chosen task, it's irrelevant what bound it is
 
That's one form of inference workload, using this alone to imply that performance is close between vendors is misleading, you need to use a compute bound workload to show whether vendors are really close to each other or not, using a pure memory bound workload is just equalizing all variables to a single bottlneck shared by all GPUs and doesn't indicate that AMD is close to NVIDIA at all, since a 4090 is barely any faster than a 3090Ti.
LLMs that are all the rage those days are extremely membound; it's basically the ETH mining of mathmonkey workloads.
 
The point was that some are memory bound, some compute bound, you can't just pick whatever suits your personal preference as ground truth.
Take real world bench applicable to your chosen task, it's irrelevant what bound it is
Inference due to low precision datatype, it's mem bound.
 
LLMs that are all the rage those days are extremely membound; it's basically the ETH mining of mathmonkey workloads.
The lack of any significant gains of the 4090 over the 3090 Ti in the graphs above shows that it certainly isn't memory-bound.

NNs are mostly matrix multiplication workloads, and LLMs are huge matrix multiplication workloads, which are perfect for tiling optimizations and caching to save bandwidth. There are no random memory accesses, and data types are slim compared to those used for training. A large on-chip cache of 4090 should provide a significant bandwidth amplification in such mem-bound cases, yet it doesn't provide anything here. I would rather assume the performance numbers are limited by whatever they do in their code rather than by anything else.
 
The blogpost says: 4090 has 2x more FP16 performance than 7900 XTX, while 3090 Ti has 1.3x more FP16 performance than 7900 XTX. Latency sensitive LLM inference is mostly memory bound, so the FP16 performance is not a bottleneck here.

Now i'm again confused about this fp16 topic.
Looking up techPowerUp i see:

4090: 82.58 TFLOPS fp16 (same as fp32)

7900XTX: 122.8 TFLOPS fp16 (twice as fp32)

So why do they say 4090 is 2x 7900XTX, when it's actually less?

I've asked about this here before, because i thought RDNA3 has ditched general double rate fp16. And the given number is still high only because of a single dot product instruction which is still double rate, and aims for AI acceleration.
But this would not make so much sense - likely the needed conversations fp32->fp16->fp32 would kill the benefit. And somebody has replied RDNA3 still has full double rate fp16 like former gens.

And for the 4090 i could imagine it has a faster fp16 path, but maybe only for AI instructions, which maybe does not show up on techPowerUp.

Maybe somebody can clarify. (I'm interested in fp16 for usual shader code, not AI acceleration. If IHVs move away from fp16, i would remove related optimizations from my todo list.)
 
So why do they say 4090 is 2x 7900XTX, when it's actually less?
Because they compare matrix multiplication instructions.
I thought the 4x FP16 is available only for WMMA instructions in RDNA 3?
It doesn't look like it can be used outside of MMs given that there is a large math density in MMs with operand sharing, which you wouldn't typically find in shaders.
 
That's one form of inference workload, using this alone to imply that performance is close between vendors is misleading, you need to use a compute bound workload to show whether vendors are really close to each other or not, using a pure memory bound workload is just equalizing all variables to a single bottlneck shared by all GPUs and doesn't indicate that AMD is close to NVIDIA at all, since a 4090 is barely any faster than a 3090Ti.
You are the one who is misleading.
You first claimed that the code is unoptimized, then you switched you claim (do i need to say this shows bad form?) around saying that the benchmark is poorly choosen.

Also mind the context; this is a scientist who is happy they succeed in achieving good performance with AMD Gpus in a scenario that is probably relevant to them.
I guess they now also need to send you a formal letter of appology now for being misleading
 
I would rather assume the performance numbers are limited by whatever they do in their code rather than by anything else.
I just read through the blog and noticed that they do single-batch inference, which is often not enough to saturate a GPU, especially for less complex models. So, it might be bandwidth-limited after all in that specific use case.
 
That's one form of inference workload, using this alone to imply that performance is close between vendors is misleading
If performance is close between vendors in that workload then there is nothing misleading in posting that.

Workloads, as in pleural and preferably a mix of memory and compute, not a single memory bound workload and then use that to claim some sweeping conclusions regarding performance parity.
And who made these sweeping conclusions based on a single memory bound workload that you are talking about?
 
Status
Not open for further replies.
Back
Top