Why Barts is really VLIW4, not VLIW5 (and more on HD 5830/6790 being mainly 128-bit)

Bo_Fox · Feb 28, 2012

First -- Let's look at the 128-bit behavior of HD 5830 and then get that out of the way so that we could understand the behavior of HD 6790 better (as AMD would probably like for us to believe that they have already decoupled the memory controllers from the ROPs in an efficient manner by then, which does not seem to be the case)..

If we look at the 5830 in the first bar chart on this page:

http://techreport.com/articles.x/18521/5

We see that 3D Mark Vantage color fill test is strongly correlated with the available bandwidth. If the 5830 were 256-bit, it would have had identical bandwidth with the 5850. However, the performance shows that this is not the case. It is barely half of 5850's performance, and also much slower than HD 4870 which has just 16 ROPs at a lower clock.

Next, if we look at BeHardware's ultimate scrutiny:

http://www.behardware.com/articles/783-3/preview-radeon-hd-5830.html

We see that the 5830 has far lower FP16 and FP32 GPixel/s writes than not only the 5770 that has a slightly higher fill rate, but also the 4890 to a far greater degree. The test is directly linear to the available bandwidth as the 4890 is so much faster than the 5770, let alone 5830 in that respect.

Both the 5830 and 6790 behave like as if the memory is only 128 bits wide. So, these cards should be called 128-bit. Otherwise, AMD could just as well say that the 5770 is 256-bit and we'd be believing them.

One more thing is that as with Barts architecture, it really does look like it is based on VLIW4 architecture, not the traditional VLIW5 one. It seems that AMD wanted to save the "thunder" for Cayman's launch by reserving the announcement for what desperately needed as much thunder as possible. Just look at how close 6870's performance is to 5870 while comparing 6870's 1120sp and 4.2GHz memory to 5870's 1600sp and 4.8GHz memory.

__________

Now, let's compare the 6790 to the similarly-spec'ed 5830:

We can see that the 5830 has far higher numbers in these areas:
44.8 GT/s
1790 GFLOPs

And the 6790 has only
33.6 GT/s
1344 GFLOPs

The numbers above are 33% greater for the 5830.

By the way, the 6790 has just 5% greater pixel fillrate and memory bandwidth than the 5830. 5% in ROP pixel fillrate and bandwidth translates to no more than 2% in overall performance.

If the 6790 is also VLIW5-based, why is the 5830 only 2-3% faster than 6790 in all of the Anandtech benchmarks ( http://www.anandtech.com/show/4260/amds-radeon-hd-6790-coming-up-short-at-150 )? Why is not the 5830 at least 30% faster according to the numbers above?

Do you think all of the Barts-derived GPU's behave like as if they're VLIW4 instead of VLIW5 (with the 6870 being so close to 5870 that has 43% more shaders and TMU's)?

____________

Well, as to the "WHY", my guess is that AMD just wanted to adhere to their traditional naming structure where all x8xx (3870, 4850, 5850, etc..) cards were 256-bit. In the past, whenever a card had the ROPs cut in half, like the HD 4730, AMD acknowledged that it has 128-bit memory.

See:

http://www.hexus.net/content/item.php?item=19175&page=2

But the 5830 was not named 5730 in that respect, like the 4730 derivative of RV770XT. The 5830 was of Cypress (not Juniper) branch of Evergreen, since it was a cut-down version of RV870XT chip.

Also, with no DX11 competition at that time, the 5830 was introduced at a much higher-than-expected price, so simply announcing a 256-bit bus helped to justify the $240 price and to help the higher-end 5850 maintains its inflated price at $30 over the original MSRP by serving as a silly buffer. (Woe to those who had only $240, taking the plunge for 5830 rather than the faster 4890 that was $75-100 less).

I guess AMD decided that since it still used the same PCB as the 5870, with the same number of memory traces, they could still claim "256-bit bus" with its high 100+ GB/s bandwidth (without reminding us that the halved ROPs means that the access to available bandwidth is effectively "halved"). This 256-bit claim sure did get past most of us video card enthusiasts. Such sawed-off cards were a tad bit scarce anyways, so it did not create too much of a fuss.. the prices were outrageous anyways.

(Never mind that the tradition of the numbering structure is now changed with the 6790, 68xx, and 69xx cards, etc.. but I would prefer straight-up honesty like with the HD 4730 and its 128-bit memory. In the link, do you see how we are told that the 4730 uses 128-bit bus with its halved ROPs?)

About Barts really being VLIW4.. I guess that AMD just wanted to save the thunder for the Cayman cards that still disappointed the expectations of most AMD fans. I just remembered that the DP (double precision, FP16) capability was left out of Barts cards, so it's harder to prove whether it's really VLIW4 instead of VLIW5, but come on, just look at the numbers. Why is the 6790 only 2-3% slower than the 5830 that has 40% more shaders and TMU's while the rest of the specifications are practically identical, if both are indeed VLIW5-based?

___________________

But is it just a conspiracy theory? Let's look at the 5830 again..

Why is it actually slower than the 4890 in DX9/10 games if:

The 5830 has 40% more shaders, 40% more TMU's, 32% more GFLOPs output, 32% more texels per second..

while both have the same number of ROPs and the same "alleged 256-bit bus"? The 5830's memory is actually clocked higher than the 4890.

____________

Ryan Smith posted a reply over at Anandtech comments section of the 6790 review:

Ryan Smith said:
From a graphics point of view it's not possible to separate the performance of the ROPs from memory bandwidth. Color fill, etc are equally impacted by both. To analyze bandwidth you'd have to work from a compute point of view. However with that said I don't have any reason to believe AMD doesn't have a 256-bit; achieving identical performance with half the L2 cache will be harder though.

The words are rather confusing, like as if it's a marketing PR guy trying to defend AMD by saying things that defends the company but does not exactly make clear sense.
1) If it's not possible to separate the performance from a "graphics" rather than "compute" point of view, then should not the performance be linked for all "graphics" point of views (as it is a "graphics" card to begin with)? Even the "compute" applications (FP16 and FP32 analysis at http://www.behardware.com/articles/783-3/preview-radeon-hd-5830.html ) show the card to behave like as if it's 128-bit, so Ryan is only correct in the part where it's not possible to separate the performance of sawed-off halved ROPs from memory bandwidth.
2) Why does Ryan not have any reason to believe.. because AMD said so? If a manufacturer of a LCD panel advertises 1ms G2G response time, but it looks like 16ms, does he still have no reason to believe it's 16ms just because the manufacturer said so?
3) If the L2 cache is cut down in proportion with the castrated shaders/TMUs/ROPs, then it should not affect performance, let alone "harder though".
____________

Just ONE of the many proofs that Barts XT is based on VLIW4 architecture like its HD 6930 cousin:

HD 6930 has 14% more bandwidth than HD 6870, to begin with.

Both have the same 32 ROPs, but HD 6870 has higher 900MHz clock vs HD 6930's 750 MHz clock. However, HD 6870 has drastically lower number of TMUs, at only 56 vs HD 6930's 80 TMUs. Realistically, Barts XT already has plenty of ROPs for its capacity. See more specs here, like the Gpixels/sec and Gtexels/sec as related to the ROPs and TMUs.

http://www.xbitlabs.com/articles/graphics/display/radeon-hd-6930.html

Now, let's look at the shader performance:
HD 6930 with 1280sp at 750MHz yields 1920 TFLOPs
HD 6870 with 1120sp at 900MHz yields 2016 TFLOPs

The 6870 actually has 5% more shader power.

However, the 6930 comes out at about 4-5% faster overall, according to Xbitlab's first chart here: http://www.xbitlabs.com/articles/graphics/display/radeon-hd-6930_8.html#sect0

It is already anticipated that the 6870 is moderately bandwidth-bottlenecked. With 14% more bandwidth, the 6930 is probably making up for the deficit in shader power with the bandwidth alone.

While the number of ROPs are already bountiful for both, yielding in less than 0.5% real-world performance difference with the theoretical maximum pixel fillrate (if it were 16 ROPs, then a 100% increase to 32 ROPs would have yielded up to 5% overall), so the "GPU 101" course teaches that the pixel fillrate is not a factor here (since both cards are not bottlenecked by the 32 ROPs in 99.8+% of cases).

The 6930 has a 31% increase in texturing power. While having 31% more texturing power, the 14% increase in bandwidth more than allows for the card to make up for the deficit in shader power against Barts XT, in that it actually ends up around 4-5% faster overall.

If Barts XT were VLIW5-based rather than VLIW4-based, Barts XT would have to have considerably more shader power just to maintain this position. It would have to have roughly the same number of shaders as HD 5850.
____

Let's see..

HD 5850 has 1440sp, 72 TMUs, 32 ROPs, all clocked at 725 MHz for 2.09 TFLOPs, 52.2 GTexels/s, 23.2 GPixels/s. The memory is 4.0Gbps, for 128 GB/s bandwidth. It's rated at 112 VP. Increasing the clocks (both GPU and memory) by 8% in order to equal Barts XT (HD 6870) would cause it to have 2.26 TFLOPs, 56.4 GTexels/s, 25 GPixels/s and nearly identical bandwidth to HD 6870.

Now, that is 12% more shader power than Barts XT, while all of the other specs are nearly identical (except that Barts XT is still lagging behind in the TMU department by well over 10%) just for HD 5850 to be equal to Barts XT in performance.

Conclusively, let's just say that a VLIW5-based card with identical specs to Barts XT (HD 6870 with only 56 TMUs) needs about 14-15% more shaders just in order to match HD 6870 in performance. How could that be if AMD claimed Barts XT to be VLIW5-based in one of their slideshows, along with the rest of the world believing it?

I have already done several examples just like this one in the other thread. Why am I the only one who has really pointed this out, out of thousands of video card enthusiasts posting on the interwebs (and telling ATF repeatedly that it's VLIW4-based while providing proof, yet they still state it as VLIW5)? :scratch:

(I could break these into separate posts, but I don't know if that's against the rules.. Heard that B3D was really strict with the banning and stuff..)

It's the nerdy stuff, so I thought you "aficionados" here would enjoy some hard-core stuff! :smile:

Bo_Fox · Feb 28, 2012

Edit- fixed a link in OP

UPDATE:

The 5830 / 6790 cards might indeed have a 256-bit bus (with 256-bit total of pads and on a PCB with 256-bit traces), but are unable to effectively access the bandwidth with sawed-off ROPs, so it behaves like as it's 128-bit in 95% of real-world applications (games) and theoretical benchmarks.

The 6790 with VLIW4 and 128-bit bus performs exactly where it is supposed to be according to my mathematical calculations. Also, the 5830 performs exactly where it should be with 128-bit bus.

Usually, the retail box states the bus width in a large font somewhere on the front or the rear. Of course, the VRAM size is in the biggest font, but the next thing is usually the bus width. There have been enough tests done to get it out of the way for the main debate at hand:

Barts being VLIW4-based rather than VLIW5.

Pitcairn (hopefully HD 7850 that matches HD 6870 closely with the specs) will bring some more useful comparisons against Barts GPU, for another round of empirical comparisons to extrapolate from (as if it's not enough for some of you guys to be satisfied - you guys probably want some more of this addictive stuff)!.

mczak · Feb 28, 2012

Too lengthy post to read.
You have some point about the 5830/6790 being 128bit in practice, but you're REALLY on crack about barts being vliw4.
It looks like somehow the ROPs can't make use of all the bandwidth indeed, though the reasons are unclear.
As to why 5830 isn't much faster than 6790 it's the same reason 5850 isn't really much slower than 5870 at the same clocks: these cards don't scale well at all with more simds, they scale very well with higher clocks, and the 6790 has higher clocks than 5830.
As for proof that Barts is vliw5, you could essentially run any synthetic benchmark tuned for that purpose, or you could take a quick look at the linux opensource driver

.

Bo_Fox · Feb 28, 2012

mczak said:
Too lengthy post to read.
You have some point about the 5830/6790 being 128bit in practice, but you're REALLY on crack about barts being vliw4.
It looks like somehow the ROPs can't make use of all the bandwidth indeed, though the reasons are unclear.
As to why 5830 isn't much faster than 6790 it's the same reason 5850 isn't really much slower than 5870 at the same clocks: these cards don't scale well at all with more simds, they scale very well with higher clocks, and the 6790 has higher clocks than 5830.
As for proof that Barts is vliw5, you could essentially run any synthetic benchmark tuned for that purpose, or you could take a quick look at the linux opensource driver .

If it's that ineffective, then the 5850 still needs even more SPs (than the 14-15% as I proved at the bottom of the OP, which is already 1600sp) in order to match up against HD 6870 and its 1120sp, if the rest of clocks and specs were all scaled up linearly together to make up for the overall performance difference. Actually, VLIW5 scaling is not THAT ineffective - it's just VLIW5 vs VLIW4.

About the 6790 having higher clocks, ever think about why it's actually slower than 5830 if it's also VLIW5, with much less shaders also (along with less FLOPS performance - see the 2nd part of the OP for the comparison)? This alone should already have presented such a STRONG case to even the most skeptical.

Has anybody really thought about how Barts XT came pretty close to HD 6950, without needing anywhere nearly as many shaders (or TFLOPs theoretical output) or texturing power or bandwidth as the 5870? Sure, just increase the bandwidth and the TMUs, and also the shaders by a linear amount, and drop the core clock to 800 MHz, and yet it won't need much more than 1408 shaders to already beat a 6950 and also a 5870 for that sake.

:idea:

mhouston · Feb 28, 2012

Barts is VLIW5. Your analysis is arguing that the chip is balanced differently than others which is true as the 5830 is a bit of an odd balance compared to the rest of the generation.

Bo_Fox · Feb 28, 2012

mhouston said:
Barts is VLIW5. Your analysis is arguing that the chip is balanced differently than others which is true as the 5830 is a bit of an odd balance compared to the rest of the generation.

Please elaborate. Anything at all, thanks.

The 6790 shouldn't be much different than the 5830 if both are VLIW5. I challenge you to explain why the 5830 needs gobs more shaders then (as discussed in 2nd part of OP). Otherwise, your statement holds little water without an explanation.

EDIT: Barts is VLIW4 until you prove otherwise, because my math proves this. You could try to answer just the 2nd part of the OP (the parts are separated by lines, while ignoring the rest of the post) : why is HD 5830 only 2-3% faster than HD 6790 if it has 33% greater shader power, with the rest of the specs being roughly equal? This is an absolutely MASSIVE difference in shader power for both cards to be roughly equal in performance, with nearly identical clock, ROPs, bandwidth, etc.. with the only big difference being that HD 5830 already has 33% greater texturing power, also. Anybody with an IQ of over 100 should figure this out after reading the OP and following the calculations.

OpenGL guy · Feb 28, 2012

Bo_Fox said:
Please elaborate. Anything at all, thanks.

The 6790 shouldn't be much different than the 5830 if both are VLIW5. I challenge you to explain why the 5830 needs gobs more shaders then (as discussed in 2nd part of OP). Otherwise, your statement holds little water without an explanation.

EDIT: Barts is VLIW4 until you prove otherwise, because my math proves this.

What sort of logic is this?

Barts is VLIW5. Download AMD GPU Shader Analyzer or AMD APP Kernel Analyzer and see for yourself.

Bo_Fox · Feb 28, 2012

Umm, elementary school logic? Thanks for the compliment..

All Barts GPUs behave like as if they have the exact advantage of their Cayman cousins with the more efficient VLIW4 shaders (both being of the Northern Islands family, while AMD probably wanted to save the "thunder" for Cayman which they knew would disappoint).

Please prove otherwise, so that we can see for ourselves why it's so important for us to believe that it's VLIW5. I do not care if the programs label Barts GPU as VLIW5 - it proves nothing, even if it's in the algorithm to base Barts upon VLIW5 calculations. As an analogy, GPUz also had some errors, calculating the bandwidth according to GDDR3 rather than GDDR5 when the video card really had GDDR5.

UniversalTruth · Feb 28, 2012

Bo_Fox said:
...why is HD 5830 only 2-3% faster than HD 6790 if it has 33% greater shader power, with the rest of the specs being roughly equal?...

Do you consider software drivers as a factor when comparing two cards?

OpenGL guy · Feb 28, 2012

Bo_Fox said:
Umm, elementary school logic? Thanks for the compliment.

All Barts GPUs behave like as if they have the exact advantage of their Cayman cousins with more efficient VLIW4 shaders.

No, they don't.

Bo_Fox said:
Please prove otherwise, so that we can SEE for ourselves why it's so important for us to believe that it's VLIW5. Until then, it's nonsense.

I don't have to prove anything. You're the one jumping to conclusions without actually verifying that your assumptions are correct when the information is publicly available. I already pointed you to two publicly available tools you can use to see what sort of code is generated for Barts.

It's not open for debate: Barts is VLIW5. If you run some compute apps it will be quite clear.

Bo_Fox · Feb 28, 2012

UniversalTruth said:
Do you consider software drivers as a factor when comparing two cards?

Yeah, maybe by about 5% at most, but not included in the calculations for convenience's sake. If both are indeed VLIW5-based, then the driver differences should be even less of a factor. It still no where makes up for the 33% increase in shader and texturing power that the 5830 has over the 6790, with the rest of clocks and specs being roughly equal, for the 5830 to be only 2-3% faster overall. It still does not even make up for other calculations that I have done with other cards too.. There's just too much here to believe that Barts could possibly be VLIW5-based at all.

OpenGL guy said:
No, they don't.

I don't have to prove anything. You're the one jumping to conclusions without actually verifying that your assumptions are correct when the information is publicly available. I already pointed you to two publicly available tools you can use to see what sort of code is generated for Barts.

It's not open for debate: Barts is VLIW5. If you run some compute apps it will be quite clear.

You don't have to debate here if you are not going to prove anything. It's ok if you do not want to. Math does not lie.

itsmydamnation · Feb 28, 2012

Bo_Fox said:
You don't have to debate here if you are not going to prove anything. It's ok if you do not want to. Math does not lie. Your words... maybe.

epic LOLZ....... new guy questions word of Guy who's proven his worth and knowledge to this forum on hundreds if not thousands of posts.

this seems to be happening a lot of Late.........

damn FOB's

word to new guy....... maybe figure out who mhouston is before questioning his authoritar /cartman

OpenGL guy · Feb 28, 2012

Bo_Fox said:
You don't have to debate here if you are not going to prove anything. It's ok if you do not want to. Math does not lie. Your words... maybe.

You're not going to enjoy your time here with an attitude like that.

From your earlier post:

Bo_Fox said:
Why am I the only one who has really pointed this out, out of thousands of video card enthusiasts posting on the interwebs (and telling ATF repeatedly that it's VLIW4-based while providing proof, yet they still state it as VLIW5)? :scratch:

Occam's Razor says that the simplest answer is the correct one and here the simplest answer is that you are wrong. That's why you are the only person on the web making this "discovery".

P.S. If GPU performance were simply about the number of ALUs then you might be onto something. Unfortunately (or fortunately for some

), it isn't that simple. Check out one of the block diagrams AMD has posted and you'll see there is much more involved, even at the coarse level published in marketing docs.

Bo_Fox · Feb 28, 2012

itsmydamnation said:
epic LOLZ....... new guy questions word of Guy who's proven his worth and knowledge to this forum on hundreds if not thousands of posts.

this seems to be happening a lot of Late.........

damn FOB's

word to new guy....... maybe figure out who mhouston is before questioning his authoritar /cartman

Yeah, like you don't dare question the slideshow stating that Bulldozer has 2B trannies, until a couple months later AMD admit that it has 1.2B trannies.

I have the right to question things - especially when the math does not add up at all. And especially when the ONLY source of Barts being VLIW5-based is also a simple slideshow presentation by AMD right before the time of launch. Amen to that.

Bo_Fox · Feb 28, 2012

OpenGL guy said:
You're not going to enjoy your time here with an attitude like that.

From your earlier post:

Occam's Razor says that the simplest answer is the correct one and here the simplest answer is that you are wrong. That's why you are the only person on the web making this "discovery".

P.S. If GPU performance were simply about the number of ALUs then you might be onto something. Unfortunately (or fortunately for some ), it isn't that simple. Check out one of the block diagrams AMD has posted and you'll see there is much more involved, even at the coarse level published in marketing docs.

More words and still nothing concrete. You type quite a bit. I typed out the math for you. Maybe you were a bit offended to start the attitude. Me having an attitude? Nah, not half as bad as yours, I honestly think, IMHO.

I just want to have a concrete discussion -

Not for the discussion to be:

"You better take the BIG GUY's word as the truth, and swallow it hard, and shut up." Why can't it be a more intelligent discussion, rather than resorting to the bully tactics?

itsmydamnation · Feb 28, 2012

Bo_Fox said:
Yeah, like you don't dare question the slideshow stating that Bulldozer has 2B trannies, until a couple months later AMD admit that it has 1.2B trannies.

I have the right to question things - especially when the math does not add up at all. And especially when the ONLY source of Barts being VLIW5-based is also a simple slideshow presentation by AMD right before the time of launch. Amen to that.

how doesn't the math add up?

so how exactly does Barts work then if the shader compiler sends transcendentals to the T unit?

OpenGL guy · Feb 28, 2012

Bo_Fox said:
More words and still nothing concrete. You type quite a bit. I typed out the math for you. Maybe you were a bit offended to start the attitude. Me having an attitude? Nah, not half as bad as yours, I honestly think, IMHO.

I just want to have a concrete discussion -

Not for the discussion to be:

"You better take the BIG GUY's word as the truth, and swallow it hard, and shut up." Why can't it be a more intelligent discussion, rather than resorting to the bully tactics?

People have already provided feedback, but you're not listening.

mczak said:
As for proof that Barts is vliw5, you could essentially run any synthetic benchmark tuned for that purpose, or you could take a quick look at the linux opensource driver

OpenGL guy said:
Barts is VLIW5. Download AMD GPU Shader Analyzer or AMD APP Kernel Analyzer and see for yourself.

OpenGL guy said:
It's not open for debate: Barts is VLIW5. If you run some compute apps it will be quite clear.

How can anyone respond to "My math proves it."? More faulty math?

Davros · Feb 28, 2012

steampoweredgod is that you

can someone provide some test data proving what barts is otherwise this will go on and on

Bo_Fox · Feb 28, 2012

itsmydamnation said:
how doesn't the math add up?

so how exactly does Barts work then if the shader compiler sends transcendentals to the T unit?

You want more math?

Ok let's try something else, that's new..

Barts XT vs HD 6950:

About 12% overall performance difference. All right? Everybody ok with that?

Now, increase all of Barts specs by 12% while leaving the clocks alone, to try to make Barts XT equal to HD 6950 in overall performance. Or should I do it by 13%, just for the heck of it? Let's just do 13% then.

shader power: 2016 GFLOPs plus 13% = 2278 GFLOPs
texturing: 50.4 Gtexels/sec plus 13% = 57 GT/s
bandwidth: 134.4 GB/s plus 13% = 151.9 GB/s

Ignore the Gpixels/sec as Barts XT already has 32 ROPs and higher clocks than Cayman, which means almost nothing (less than 1% practical real-world difference in gaming).

HD 6950 has:
shader power: 2253 GFLOPs
texturing power: 70.4 Gtexels/sec
bandwidth: 160 GB/s

Whoa! Wait a minute here.. increasing ALL of Barts XT specs by 13% in order to increase overall perf by 13% only gives it about 1% more GFLOPS than HD 6950, while STILL having considerably LESS texturing power and bandwith.

How does that add up?

Even worse yet - how does that add up if Barts XT is based upon VLIW5 rather than VLIW4?

Somebody here, please have the integrity to answer how that is possible for Barts XT to be that AMAZINGLY efficient if it's actually VLIW5? It would mean that VLIW4 is meaningless for Cayman, netting zero overall gaming performance benefits over Barts XT.

Alexko · Feb 28, 2012

How about the bleeding obvious: 1120/64 = 17.5?

Why Barts is really VLIW4, not VLIW5 (and more on HD 5830/6790 being mainly 128-bit)

Bo_Fox

Bo_Fox

mczak

Bo_Fox

mhouston

A little of this and that

Bo_Fox

OpenGL guy

Bo_Fox

UniversalTruth

OpenGL guy

Bo_Fox

itsmydamnation

OpenGL guy

Bo_Fox

Bo_Fox

itsmydamnation

OpenGL guy

Davros

Bo_Fox

Alexko

Similar threads