Why Barts is really VLIW4, not VLIW5 (and more on HD 5830/6790 being mainly 128-bit)

thanks for that answer mczak, ive asked the question a couple of times but never really got an answer
so in simple terms vliw4 is groups of 4 while gcn are single entities ?

Technically, GCN's shaders would qualify as scalars, yes. But they can get starved by the front end quite easily in graphics contexts, so you only get the full benefits of them being scalar with very long shader programms. Didn't try in compute mode yet, but some results indicate that they do better using the ACEs (Asynchronous Compute Engines, i think it's really just a fast path bypassing the rasterizer) than the rasterizers.


edit:
Right, since Gipsel chose to be nitpicky:
GCN cores are four 16-wide vector units, scheduled in a round-robin fashion for a four-cycle execution time for SPFP-math (Add, Mul etc.) out of a private pool of up to 10 wavefronts each, taking longer on DPFP-math or special functions either 8 or 16 clocks depending on whether or not there's a Mul involved and being able to execute scalar workloads with no loss in efficiency under specific circumstances supported by a variety of SRAM arrays (organized in register files, r/w caches and data shares) and a real scalar coprocessor which can share resources over four GCN cores.

I chose to reduce the bolded part to the max as I though appropriate for the discussion, assuming people would know what I meant by that. My mistake probably.
 
Last edited by a moderator:
Math does not lie.
According to this book it does.
Proofiness: The Dark Arts of Mathematical Deception

Why does Barts XT perform so well in games against Cayman specs-wise if it's VLIW5 rather than VLIW4?
I don't know what's happening in your example, but a VLIW5 SIMD has more shader power than a VLIW4 SIMD so it could perform faster if the shader's co-issue well and have transcendental instructions. The advantage of VLIW4 is smaller area and much of the time the extra unit doesn't provide an advantage.
 
Sorry to quote myself, but I'm a bit confused about the parallel discussion going on in these two threads:
http://forum.beyond3d.com/showpost.php?p=1623427&postcount=156

Short excerpt, so that there's at least something posted on the interent.
A roughly even mixture of a lengthy shader not doing anything useful with MUL, MADD, MIN, MAX and SQRT (and AMD program from HD2900 launch basically)

HD 5870: 1.206 GI/s. (Giga-Instructions per second)
HD 6870: 893 GI/s.
HD 6970: 877 GI/s.
HD 7970: 1.101 GI/s.
 
Sorry to quote myself, but I'm a bit confused about the parallel discussion going on in these two threads:
http://forum.beyond3d.com/showpost.php?p=1623427&postcount=156

Short excerpt, so that there's at least something posted on the interent.
A roughly even mixture of a lengthy shader not doing anything useful with MUL, MADD, MIN, MAX and SQRT (and AMD program from HD2900 launch basically)

HD 5870: 1.206 GI/s. (Giga-Instructions per second)
HD 6870: 893 GI/s.
HD 6970: 877 GI/s.
HD 7970: 1.101 GI/s.

Ah HA! That proves it, Tahiti is actually VLIW5. Or maybe it's 5870 that is scalar? :p After all the maths don't lie. :LOL:

Regards,
SB

PS - if anyone took this seriously slap yourself now. :)
 
Bo_Fox: Your logic is wrong. You don't take into account, that gaming performance of AMD's DX11 VLIW5 GPUs doesn't scale well with the number of SIMDs. E.g. HD 5850 with 1440 SPs and HD 5870 with 1600 SPs. Set the same clock for both of them and you'll see, that despite >11 % higher computing power real-game performance will be just 2-3 % better. That's the main reason why Barts (14 SIMDs) is almost as fast as Cypress with 20 SIMDs.

As for HD 5830 - the GPU has 256bit memory interface, but as far as I remember, by disabling half of the ROPs, interface between ROPs and memory controller (or was it L2?) was effectively halved. All functional units with the exception of ROPs could utilize all the available bandwidth.
 
Hopefully, 7 days is enough for Bo_Fox to examine the new information presented within the thread instead of resorting to tit-for-tat, knee-jerk, highly agitating communication.

Thank you for your patience,
AlS
 
As for HD 5830 - the GPU has 256bit memory interface, but as far as I remember, by disabling half of the ROPs, interface between ROPs and memory controller (or was it L2?) was effectively halved. All functional units with the exception of ROPs could utilize all the available bandwidth.
ROPs do in fact still have access to all the memory partitions; though one is now communicating to two partitions rather than just one.
 
Technically, GCN's shaders would qualify as scalars, yes. But they can get starved by the front end quite easily in graphics contexts, so you only get the full benefits of them being scalar with very long shader programms.
I think my past crusade against the ridiculousness of terming something "scalar vector unit" or "scalar SIMD unit" was not as successful as I wished. GCN's SIMDs are vector units of course (the Wavefront size is the vector size). And AMD also named them simply "vector ALUs". There is no reason to be confused about the terms as in the past (or with nVidia's terminology, they use vector units too). The GCN architecture actually features some real scalar units. But that is a single one in each CU (shared between 4 vector ALUs), which doesn't exactly qualify as "shader unit" by the usual terms.
 
Why does Barts XT perform so well in games against Cayman specs-wise if it's VLIW5 rather than VLIW4?
It doesn't. In some games one architecture is more efficient, and in others vice versa (which, BTW, is very clear evidence that Barts and Cayman have different architectures). Look at Crysis and Stalker, where the 6950 is 23-30% and 29-40% faster, respectively, than the 6870.

Why does Barts XT absolutely destroy HD 5850 and HD 5870, specs-wise, by a ridiculous margin?
In what world does that happen? Or are you normalizing performance to shader count?

You're making the same mistake that many people do: Shader performance is just one part of overall performance, and often less than half of a game's rendering time is limited by shaders. This is very clear when you compare the 9600GT to the 9800GT. Both are 256-bit, 16 ROP cards with equal bandwidth and similar clocks. However, the 9600GT has only 64 SPs to the 8800GT/9800GT's 112, yet the former is almost as fast as the latter in games. That's why the 7950 gets crushed by the 7970 in compute benchmarks, but only lags a bit in most games. By your logic, then, the 7950 and 9600GT are more efficient than the 7970 and 9800GT, and must have a better architecture.

Why does HD 6790 perform about the same as HD 5830 if the latter has 33% more shader and texturing power, with other specs being roughly the same - if BOTH are VLIW5?
The 5830 has always been an underperformer, taking a bigger hit vs the 5850 than the 6790 takes vs the 6850, despite similar handicaps. It's an outlier, so that comparison is meaningless.
 
I think my past crusade against the ridiculousness of terming something "scalar vector unit" or "scalar SIMD unit" was not as successful as I wished. GCN's SIMDs are vector units of course (the Wavefront size is the vector size). And AMD also named them simply "vector ALUs". There is no reason to be confused about the terms as in the past (or with nVidia's terminology, they use vector units too). The GCN architecture actually features some real scalar units. But that is a single one in each CU (shared between 4 vector ALUs), which doesn't exactly qualify as "shader unit" by the usual terms.

Yeah sorry. I did it on purpose so you could storm in ranting about how dumb I am. ;)

So: GCN cores are four 16-wide vector units, scheduled in a round-robin fashion for a four-cycle execution time for SPFP-math (Add, Mul etc.) out of a private pool of up to 10 wavefronts each, taking longer on DPFP-math or special functions either 8 or 16 clocks depending on whether or not there's a Mul involved and being able to execute scalar workloads with no loss in efficiency under specific circumstances supported by a variety of SRAM arrays (organized in register files, r/w caches and data shares) and a real scalar coprocessor which can share resources over four GCN cores.
 
Last edited by a moderator:
I guess I forgot the smiley in the post above. :LOL:

I know that you know it. And also that it is kind of hard not to use these stupid (in my opinion) marketing driven terms from time to time. ;)
 
Yeah sorry. I did it on purpose so you could storm in ranting about how dumb I am.

So: GCN cores are four 16-wide vector units, scheduled in a round-robin fashion for a four-cycle execution time for SPFP-math (Add, Mul etc.) out of a private pool of up to 10 wavefronts each, taking longer on DPFP-math or special functions either 8 or 16 clocks depending on whether or not there's a Mul involved and being able to execute scalar workloads with no loss in efficiency under specific circumstances supported by a variety of SRAM arrays (organized in register files, r/w caches and data shares) and a real scalar coprocessor which can share resources over four GCN cores.

The presentation said at least 10 wavefronts per SIMD.
I'm not sure how you are characterizing the scalar unit as a coprocessor sharing resources over four cores. There is a write-only scalar cache that is shared between the CUs, but this is not the only shared cache. The scalar unit is tied closely in each CU.
 
I guess I forgot the smiley in the post above. :LOL:

I know that you know it. And also that it is kind of hard not to use these stupid (in my opinion) marketing driven terms from time to time. ;)

You know, they're not quite as marketing-driven as they might seem. Michael Shebanow, for instance, insists that SIMD lanes in Tesla/Fermi ought to be called cores.
 
You know, they're not quite as marketing-driven as they might seem. Michael Shebanow, for instance, insists that SIMD lanes in Tesla/Fermi ought to be called cores.
Where does he do that? And what is the reasoning? I can't think of any reasonable ones (besides artificially inflating the "core" count). It is simply silly to call an SIMD lane a core.

Edit: I hope you don't mean the presentations linked there. I stopped reading when seeing the definition of "SIMT" and "threads" is not even self-consistent ("threads" within a warp [vector] don't execute independently as claimed because they all share a single instruction pointer, which is said in the exact same sentence, case closed). That terminology is just a huge pile of crap and confuses the people. :(

Edit2: Hennessy and Patterson: "Computer Architectures" would be a good start for this nV fellow. :D
 
Last edited by a moderator:
Where does he do that? And what is the reasoning? I can't think of any reasonable ones (besides artificially inflating the "core" count). It is simply silly to call an SIMD lane a core.

Edit: I hope you don't mean the presentations linked there. I stopped reading when seeing the definition of "SIMT" and "threads" is not even self-consistent ("threads" within a warp [vector] don't execute independently as claimed because they all share a single instruction pointer, which is said in the exact same sentence, case closed). That terminology is just a huge pile of crap and confuses the people. :(

Edit2: Hennessy and Patterson: "Computer Architectures" would be a good start for this nV fellow. :D

Sorry, I don't have any links. His argument (which I didn't get firsthand) as I understood and remember it, is that in multi-core systems, cores always share resources. It might be just RAM, it could include a memory controller (K8, Conroe, etc.) a last-level cache (Barcelona, Nehalem…) or a lot more (Bulldozer) but fundamentally, as long as you have distinct execution units executing instructions from distinct threads, no matter how much logic and memory they may share, they're cores.

Hopefully that is at least close to his original rationale.

For the record, I don't agree either, but I can see where he's coming from.
 
but fundamentally, as long as you have distinct execution units executing instructions from distinct threads, no matter how much logic and memory they may share, they're cores.
As said above, that's exactly what is missing. ;)
One has a vectorALU, where each physical lane processes an element (or two or four) of a vector. There is one single instruction (from a single instruction pointer) for the whole vector. That's a single thread caclulating something on an SIMD processor. Nothing else.

I mean, if already well known textbooks on that matter complain about the "quirky jargon" (yes, that is written in it!) used by nVidia, we don't have to think much further.
 
Last edited by a moderator:
Back
Top