Why Barts is really VLIW4, not VLIW5 (and more on HD 5830/6790 being mainly 128-bit)

iMacmatician · Feb 29, 2012

DarthShader said:
He has one point though, his trolling was quite elaborate and insisted on numbers and math. So why not simply post some benchmarks numbers, like Carsten tried, or where the difference between VLIW 4 and 5 is shown, to stuff this guys mouth with crow?

Apparently just posting benchmark numbers won't do for him. But since he wants math, how about a more rigorous approach, using a targeted shader-only benchmark (for example, CarstenS's data would work if we have theoretical maximums too), to show that Barts is VLIW5?

First I'm going to make some necessary (and hopefully safe to make) assumptions, including Barts (XT) has 2016 SP GFLOPS, Cayman (XT) has 2703.36 SP GFLOPS, the clocks are constantly at the stated numbers during the benchmark, the theoretical maximum benchmark numbers (the numbers that would be gotten given no bottlenecks) for both chips are of the form # = K(SP count)(clock speeds) where K is some constant, and the benchmark software gives accurate numbers.

Let B = (Barts benchmark score)/2016, C = (Cayman benchmark score)/2703.36, Bt = (Barts theoretical maximum score on that benchmark)/2016, and Ct = (Cayman theoretical maximum score on that benchmark)/2703.36.

Pick a benchmark such that Bt < C. This result shows that Barts cannot be (Cayman-type) VLIW4. Why? Assume Barts is VLIW4. We know that C < Ct. Since Barts and Cayman are both the same VLIW4, Bt = Ct, so C < Bt, which is a contradiction since the benchmark said that C > Bt.

Alternatively, pick a benchmark such that Ct < B. This result shows that Barts cannot be (Cayman-type) VLIW4. Why? Assume Barts is VLIW4. We know that B < Bt. Since Barts and Cayman are both the same VLIW4, Bt = Ct, so B < Ct, which is a contradiction since the benchmark said that B > Ct.

Did I miss anything? (I'm not an architecture expert.)

Davros · Feb 29, 2012

Thanks guys, bloody fantastic answers..............

ps: what does != mean ? ( yes I am that thick

)

Zaphod · Feb 29, 2012

Usually "does not equal". Programming languages, limiting themselves to ASCII, can't use the proper "≠".

humbertklyka · Feb 29, 2012

!= means "does not equal", same with ~=

Gipsel · Feb 29, 2012

Davros said:
what does != mean ?

Not equal. An exclamation mark denotes a negation.

@iMacmatician:
I doubt something like this will be proof for him, as he will simply say AMD stated the number of units, the number of SIMD engines, the peak performance numbers, or all together wrong. Let's face it: Barts has officially 1120 SPs and 14 SIMD Engines now known as CUs. A single devision tells you already, that each CU is comprised of 80 SPs which equals 5*16, not 4*16.
14*4*16 would be 896 SPs (with lower peak performance) and 1120/64 would be 17.5 CUs.

We should simply not invest any more time into this, he is obviously either completely clueless or a troll.

3dilettante · Feb 29, 2012

Has AMD retroactively applied the CU nomenclature to previous designs?
The abbreviation only seemed to apply to GCN.

iMacmatician · Feb 29, 2012

Gipsel said:
@iMacmatician:
I doubt something like this will be proof for him, as he will simply say AMD stated the number of units, the number of SIMD engines, the peak performance numbers, or all together wrong. Let's face it: Barts has officially 1120 SPs and 14 SIMD Engines now known as CUs. A single devision tells you already, that each CU is comprised of 80 SPs which equals 5*16, not 4*16.
14*4*16 would be 896 SPs (with lower peak performance) and 1120/64 would be 17.5 CUs.

We should simply not invest any more time into this, he is obviously either completely clueless or a troll.

A "896 SPs, 1612.8 GFLOPS" claim can be disposed of using the same method as earlier, and it's still possible to use a number other than 2016 GFLOPS to show that Barts is VLIW5.

But I agree now, there's no sense in continuing this as by doing so, he'll just find yet another can of worms to open up.

swaaye · Feb 29, 2012

I has question.

Why is Barts VLIW5 instead of 4? Or more to the point perhaps, why is Cayman the only VLIW4 chip? Why bother with VLIW4 when GCN was right around the corner? Maybe the loss of 32nm messed up the perception of all of this.

Though I am aware that Trinity is said to be VLIW4.

fellix · Feb 29, 2012

The VLIW4 parts were delayed due to the TSMC's 32nm cancellation. GCN was still to far ahead and something had to be released during the time frame.
Barts was a midrange market "storm trooper", so to speak. The GPU is even pin-compatible to Cypress SKU, so the board partners had the shortest time-to-market delay possible to build, qualify and release the complete product.
If it wasn't for the 32nm fiasco, Barts would probably have been another story.

Gipsel · Feb 29, 2012

3dilettante said:
Has AMD retroactively applied the CU nomenclature to previous designs?

They have to some degree. Especially in connection to OpenCL, they also name the SIMD engines of the VLIW4/5 GPUs as CUs. Compute Unit itself is an OpenCL term (and one SIMD engine is a CU there, too). AMD appears to align their nomenclature with it a bit, lately. I think it is a good move (as the OpenCL terms make more sense than their CUDA equivalents in my opinion and are additionally agnostic to the underlying hardware).

Davros · Feb 29, 2012

ps: I did pascal <> was not equal

I.S.T. · Mar 2, 2012

fellix said:
The VLIW4 parts were delayed due to the TSMC's 32nm cancellation. GCN was still to far ahead and something had to be released during the time frame.
Barts was a midrange market "storm trooper", so to speak. The GPU is even pin-compatible to Cypress SKU, so the board partners had the shortest time-to-market delay possible to build, qualify and release the complete product.
If it wasn't for the 32nm fiasco, Barts would probably have been another story.

Indeed, Cayman is a straight port to 40nm from 32nm.

I suspect it'd have been faster due to higher clock if 32nm hadn't been cancelled.

Dave Baumann · Mar 2, 2012

fellix said:
The VLIW4 parts were delayed due to the TSMC's 32nm cancellation.

Actually, only one chip was "affected" by the 32nm cancellation, and that was Ibiza/Cayman. For the rest, the decision to stay on 40nm had already been made (which also predicated why they existed with the VLIW5 architecture).

fellix · Mar 2, 2012

Thank you for the clarification.

CarstenS · Mar 2, 2012

Any chance of getting an answer why exactly a one-time move was made for VLIW4 for a rather low-volume product (compared to Turks and "lesser") which had a lifespan of only one year? Even if I count in Trinity, that's only one more product, basically, before APUs move to GCN as well.

From what I understand, altering an architecture binds quite a few engineering ressources and thus is quite costly.

Dave Baumann · Mar 2, 2012

Because a lot of the work was already baked and done. As mentioned, we had already taken the option of moving back to 40nm for the lower portion of the stack, but not for Ibiza (what became Cayman) - this being the lead SKU it had already spent a lot of time on the simulator, lots of software work had already been done, etc. which was not the case for the other products on 32nm. When 32nm was no longer an option for this we had two options - create a different (bigger) configuration on VLIW5 or backport the entire thing to 40nm; both options were looked at, but seeing as so much validation had occured on Ibiza it was judged to be lower cost, risk and shorter schedule just to take that design back. Hitting first rev as production on Cayman was an indication of how much validation had already occured.

CarstenS · Mar 2, 2012

Thanks a lot Dave!

AlNom · Mar 2, 2012

I don't suppose you're allowed to say how big or what the specs of Cayman/Ibiza would have been on 32nm had it happened.

LordEC911 · Mar 3, 2012

Dave Baumann said:
Because a lot of the work was already baked and done. As mentioned, we had already taken the option of moving back to 40nm for the lower portion of the stack, but not for Ibiza (what became Cayman) - this being the lead SKU it had already spent a lot of time on the simulator, lots of software work had already been done, etc. which was not the case for the other products on 32nm. When 32nm was no longer an option for this we had two options - create a different (bigger) configuration on VLIW5 or backport the entire thing to 40nm; both options were looked at, but seeing as so much validation had occured on Ibiza it was judged to be lower cost, risk and shorter schedule just to take that design back. Hitting first rev as production on Cayman was an indication of how much validation had already occured.

Interesting. So GCN was not slated for 32nm then.

Bo_Fox · Mar 7, 2012

Hello again, after a 7-day vacation! Beyond3D needs some more "excitement" from me..

Just kidding (I have a twisted sense of humor)!!!!!!!!!! But I'll go ahead and ruffle things a bit just for the heck of it.

AlStrong said:
Hopefully, 7 days is enough for Bo_Fox to examine the new information presented within the thread instead of resorting to tit-for-tat, knee-jerk, highly agitating communication.

Thank you for your patience,
AlS

Beyond3D suddenly came to life for a short while with this thread. Too bad it couldn't continue.. I know, you're welcome! BTW, I usually did not initiate the disrespectful, denigrating insults, only defending myself with a shield that automatically does a recoiling knee-jerk. Apoppin, who used to be a moderator for a long time at a much bigger forum (Anandtech) said that the posting etiquette by some posters here in this thread was much worse than my posting etiquette, and that it's an AMD-biased forum (hence ruffling your feathers).. but it does not seem to be that bad here with the AMD bias (unlike [H] and TPU which express the fanboyism in an all-out infantile expression).

One guy (itsmydamnation) was asking: "so how exactly does Barts work then if the shader compiler sends transcendentals to the T unit?" I could very well ask the same thing, "so how exactly does Cayman work then if the shader compiler sends transcendentals to the T unit?", without offering anything concrete - there's no real insight or evidence being presented.
(Sarcastically, I was just proving a point that I could have also claimed Cayman to be VLIW5-based, with the exact same words they all have been using throughout the entire thread, without ever giving some real evidence - anything concrete at all, for goodness' sake).

Lightman said:
Barts has among many other improvements:
- new front end, boosting utilization of shader core
- improved tesselation
- compared to 5830 fully functional 32 ROP's

On top of that game code rarely is limited by shader performance!
There are few purely shader limited tests, Perlin Noise from 3DMark is one of them. Just look at the results from this page and think about them for a second:
http://techreport.com/articles.x/20126/7

Barts performs exactly as you would expect from VLIW5 there.
Look at more complex tests to see how Barts is in line with Cypress on Particle, Cloth, and other tests where Cayman is showing good progress

Lightman's contention falls flat in the face at all accounts:
Improved tessellation nowhere makes up for the overall performance discrepancy in games that do not use tessellation.
HD 6790 (Barts) has only 16 fully functional ROPs, just like HD 5830. Lightman does nothing to even figure the discrepancy between HD 5830's 33% greater shader -AND- texturing power than HD 6790, while HD 5830 does no better than 2-3% better than HD 6790, with his very poorly thought-out comments.

Game code is usually limited by shader performance, which was the case with HD 3870 vs HD 4870 back when games were less shader-heavy!

Regarding Perlin Noise performance (in the same Techreport link given by Lightman), Barts XT actually performs in line with HD 6950 as well as HD 5870, while the 2GB performs differently along with HD 6950, proving nothing.
As for GPU Particles, HD 6870 performs nearly identically to HD 5870 despite having FAR less FLOPS capacity. (OUCH, Lightman, did you really look at it yourself and think about them for a second?)..
As for GPU Cloth, HD 6870 performs much, much more in line with its Cayman cousins rather than with HD 5870, shader-wise.

Even Shader Toy shows HD 6870 to perform much more in line with HD 6950:
Barts XT: 180 / 2016 GFLOPS = 0.9325
HD 5870: 206 / 2720 GFLOPS= 0.0757
HD 6950: 201 / 2253 GFLOPS = 0.0892

3dcgi said:
I don't know what's happening in your example, but a VLIW5 SIMD has more shader power than a VLIW4 SIMD so it could perform faster if the shader's co-issue well and have transcendental instructions. The advantage of VLIW4 is smaller area and much of the time the extra unit doesn't provide an advantage.

Yet Barts XT performs like as if the "extra unit" of VLIW5 (if it really had VLIW5) was always providing an advantage (rather than "much of the time not"), if this were to be the case.

CarstenS said:
Short excerpt, so that there's at least something posted on the interent.
A roughly even mixture of a lengthy shader not doing anything useful with MUL, MADD, MIN, MAX and SQRT (and AMD program from HD2900 launch basically)

HD 5870: 1.206 GI/s. (Giga-Instructions per second)
HD 6870: 893 GI/s.
HD 6970: 877 GI/s.
HD 7970: 1.101 GI/s.

Thank you!!!!! If HD 7970 were VLIW5-based, it should have had 39% more GI/s (or GFLOPs), then minus 20% due to the missing VLIW5 unit (1.224 GI/s if the performance were in line with the FLOPS capacity, which is more than 10% discrepancy -especially given that Tahiti has so many other things going for it like bandwidth, L2, etc..). Other factors could be in play here, so I cannot rule that out yet.

Thank you, CarstenS, for the first "something posted on the internet" that actually leans toward Barts being VLIW5-based. It could be interpreted as Carsten saying: "Since you B3D guys posted nothing yet on the internet, I'm the first one to take the positive initiative."

Carsten gets the love:

Mintmaster said:
Bo_Fox said:

Why does Barts XT perform so well in games against Cayman specs-wise if it's VLIW5 rather than VLIW4?

Click to expand...

It doesn't. In some games one architecture is more efficient, and in others vice versa (which, BTW, is very clear evidence that Barts and Cayman have different architectures). Look at Crysis and Stalker, where the 6950 is 23-30% and 29-40% faster, respectively, than the 6870.

Looking at Crysis and Stalker does not give "very clear evidence" as to draw such an absolute conclusion. The 6950 has 39% greater texturing power, 19% more bandwith, etc.. than the 6870, so it is normally to be expected that in some games it really shines through. You do present a strong case, though, since both games are heavily shader-bound, though. However, the Cypress cards still do not pull ahead of the 6870 enough for things to make absolute sense just yet.

The reason:
Later drivers show the 6870 to be within 15-20% of the 6950, rather than up to 40% as was indicated with the early benchmarks (at the time of launch), using the same settings:
http://www.anandtech.com/show/5153/nvidias-geforce-gtx-560-ti-w448-cores-gtx570-on-a-budget/4
While it is stated "On cards with 1GB of VRAM or less it can be overly taxing, but with more than 1GB of VRAM the bottleneck shifts to rendering."

Thanks for trying to make a good find, though. :good:

Mintmaster said:
Bo_Fox said:

Why does Barts XT absolutely destroy HD 5850 and HD 5870, specs-wise, by a ridiculous margin?

Click to expand...

In what world does that happen? Or are you normalizing performance to shader count?

You're making the same mistake that many people do: Shader performance is just one part of overall performance, and often less than half of a game's rendering time is limited by shaders. This is very clear when you compare the 9600GT to the 9800GT. Both are 256-bit, 16 ROP cards with equal bandwidth and similar clocks. However, the 9600GT has only 64 SPs to the 8800GT/9800GT's 112, yet the former is almost as fast as the latter in games. That's why the 7950 gets crushed by the 7970 in compute benchmarks, but only lags a bit in most games. By your logic, then, the 7950 and 9600GT are more efficient than the 7970 and 9800GT, and must have a better architecture.

The 9600GT is an excellent example, thank you very much. :smile: But you must not forget that the 9600GT has higher clocks, etc... far from "similar" as you put it. The 8800GT was already somewhat more bottlenecked by the bandwidth and ROPs (which still gave an 8800GTX about 20% advantage overall). The 8800GT has 25% greater overall gaming performance than the 9600GT, so it's not "almost as fast". By the way, I do notice the 7770 Cape Verde being astoundingly efficient, at 94 Voodoopower, compared against HD 7970 with roughly 3x the specs on paper, but only 220 Voodoopower.

It's just that Barts XT has something rather magical in it.

Perhaps Barts XT really has 1280 VLIW5 shaders, or what is it EXACTLY about Bart's improved front-end that makes it perform amazingly well given the specs?

Mintmaster said:
Bo_Fox said:

Why does HD 6790 perform about the same as HD 5830 if the latter has 33% more shader and texturing power, with other specs being roughly the same - if BOTH are VLIW5?

Click to expand...

The 5830 has always been an underperformer, taking a bigger hit vs the 5850 than the 6790 takes vs the 6850, despite similar handicaps. It's an outlier, so that comparison is meaningless.

Hardly, since HD 5830 actually has a whopping 33% more FLOPS capability and 33% more texturing power than the 6790. Both handicaps are pretty similar, given that both have the same cache structure, and the same VLIW5 architecture as is claimed. The 5830 actually has as just many shaders and TMUs as the highest-end Barts! To say that it's an outlier after considering this is just as meaningless in that context.

Why Barts is really VLIW4, not VLIW5 (and more on HD 5830/6790 being mainly 128-bit)

iMacmatician

Davros

Zaphod

Remember

humbertklyka

Gipsel

3dilettante

iMacmatician

swaaye

Entirely Suboptimal

fellix

Gipsel

Davros

I.S.T.

Dave Baumann

Gamerscore Wh...

fellix

CarstenS

Moderator

Dave Baumann

Gamerscore Wh...

CarstenS

Moderator

AlNom

Moderator

LordEC911

Bo_Fox

Similar threads