Why Barts is really VLIW4, not VLIW5 (and more on HD 5830/6790 being mainly 128-bit)

mczak said:
was such a fun thread .
Same here, it was fun for me too.. funny in fact, as I felt like I was doing this to the AMD engineers before knowing who they were:
:mrgreen:

rpg.314 said:
There are better ways to handle noise than locking up *potentially* promising threads.

Bo deserves a chance and some sound advice before harsher sanctions are applied.
Personally noted, thanks. You're such a good person.

Now, it really looks like Barts is VLIW5-based, due to further explanation by DarthShader and iMacmatician.
DarthShader said:
He has one point though, his trolling was quite elaborate and insisted on numbers and math. So why not simply post some benchmarks numbers, like Carsten tried, or where the difference between VLIW 4 and 5 is shown, to stuff this guys mouth with crow? Shaders with lots of transcendentals, like Mineral and Fire shaders mentioned by Jawed here: http://forum.beyond3d.com/showthread...de#post1422548 would do the trick, as other rationality calls don't hit home apparently. I got one:

34659.png


Tried looking for more, but googling relevant keywords made this thread appear as first results. :LOL:
Thank you!!! After CarstenS, you're the second person to post something real, that actually shows something. :smile:

I stand down from my conspiracy theories that Barts is VLIW4-based. Beyond3D ultimately wins 1 against Bo_Fox, after a good, long match! :oops:

iMacmatician said:
Apparently just posting benchmark numbers won't do for him. But since he wants math, how about a more rigorous approach, using a targeted shader-only benchmark (for example, CarstenS's data would work if we have theoretical maximums too), to show that Barts is VLIW5?

First I'm going to make some necessary (and hopefully safe to make) assumptions, including Barts (XT) has 2016 SP GFLOPS, Cayman (XT) has 2703.36 SP GFLOPS, the clocks are constantly at the stated numbers during the benchmark, the theoretical maximum benchmark numbers (the numbers that would be gotten given no bottlenecks) for both chips are of the form # = K(SP count)(clock speeds) where K is some constant, and the benchmark software gives accurate numbers.

Let B = (Barts benchmark score)/2016, C = (Cayman benchmark score)/2703.36, Bt = (Barts theoretical maximum score on that benchmark)/2016, and Ct = (Cayman theoretical maximum score on that benchmark)/2703.36.

Pick a benchmark such that Bt < C. This result shows that Barts cannot be (Cayman-type) VLIW4. Why? Assume Barts is VLIW4. We know that C < Ct. Since Barts and Cayman are both the same VLIW4, Bt = Ct, so C < Bt, which is a contradiction since the benchmark said that C > Bt.

Alternatively, pick a benchmark such that Ct < B. This result shows that Barts cannot be (Cayman-type) VLIW4. Why? Assume Barts is VLIW4. We know that B < Bt. Since Barts and Cayman are both the same VLIW4, Bt = Ct, so B < Ct, which is a contradiction since the benchmark said that B > Ct.

Did I miss anything? (I'm not an architecture expert.)
The "benchmark" numbers were nice, but I wanted a screenshot or something, to show evidence, that's all. This kind of effort should be accompanied with a screenshot, like most HW benchers provide with their data. But now that DarthShader showed something to really back it up, it is now clear that Barts is VLIW5-based.

Your "whole-hearted" explanation is a bit abstract, though! :LOL:

Ok, maybe my "conspiracy theorist" scale just went up from 2-3 to a 5 in a scale of 1-10. Because - I am about to conspire that since Barts is indeed VLIW5-based, I think that Barts still needs to have 1280 shaders instead of just 1120 for things to make sense in MOST gaming benchmarks.

I should have compared it against HD 6930 instead of 6950, because the 6930 has 1280 VLIW4 shaders (for a more clear comparison).
Let's see:

Barts XT
GFLOPs: 2016 (VLIW5)
GT/s: 50.4
GP/s: 28.8
GB/s: 134.4

HD 6930
GFLOPs: 1920 (VLIW4)
GT/s: 60.0
GP/s: 24.0
GB/s: 153.6

With HD 6930 only performing about 4-5% faster than Barts XT overall, while having 14.3% greater bandwidth and ~20% greater texturing power, plus a VLIW4 architecture that should be at the VERY least 11-15% more efficient overall, how is it possible for Barts XT to come that close to the 6930 with these specifications (only 1120 VLIW5 vs 1280 VLIW4 shaders)? The 32 ROPs certainly do not bottleneck the 6930 at all, as you all should already know.

It either means that Barts' VLIW5 architecture is that much more efficient than Cayman's new VLIW4 architecture, or that the number of shaders are artificially lowered on paper to save the thunder for Cayman that came out 2 months later (which AMD knew would disappoint the fans). The numbers would have made LOGICAL sense if Barts XT was actually running 1280 VLIW5 shaders instead of just 1120.

I'm not trying to be dodgy at all. (BTW, those who conspired me to be somebody else 2 times in this thread are just as much of conspiracists themselves! :blush: :p :taunt: ) It's just a major discrepancy .. almost a severe one that makes no sense at all against the 'supposed' benefits of VLIW4.

-----MAYBE it's just the front end that has something magical in it?

To the AMD Engineers, sorry for getting off the wrong foot here. If you could explain the "magic", then my respect and gratitude goes a long way for your kindness.

EDIT --- AND THANK YOU to those here who helped to answer my questions about Barts. I appreciated it. It has helped me to more keenly understand the workings of the video cards, as I am currently working on a project (Voodoopower ratings - http://alienbabeltech.com/abt/viewtopic.php?f=6&t=21797&start=0#p41174 ). Thanks again to you guys.
 
Last edited by a moderator:
One guy (itsmydamnation) was asking: "so how exactly does Barts work then if the shader compiler sends transcendentals to the T unit?" I could very well ask the same thing, "so how exactly does Cayman work then if the shader compiler sends transcendentals to the T unit?", without offering anything concrete - there's no real insight or evidence being presented.
You got that question wrong.

To understand it, you have to look a bit of the inner workings of AMD's VLIW architectures. Each (very long) instruction (word) consists of 4 or 5 "slots" and executes a maximum of 4 or 5 operations in parallel (1 per slot, a slot can be empty if it can't be filled with useful work). The 5 slots of the VLIW5 instructions got named x, y, z, w, and t (x to w mainly for historic reasons: designating the x, y, z, w homogenic coordinates in a vertex shader; but since we have almost freely programmable shaders, the names are just arbitrary today). Each instruction slot is rigidly tied to a set of ALUs in the SIMD engines of the GPU (having the same name as the instruction slots). The GPU doesn't reorder the operations between the slots or such things, that means the assignment of an operation to a slot is done statically during the compilation of the shader program (this allows to make the GPU hardware quite a bit simpler, the very reason to go the VLIW route). The x,y,z and w slots are for practical purposes almost identical, but the t slot is a different animal. It can do quite a bit more instructions than the other slots (and some are missing), especially the transcendental functions, where the name actually comes from. As the GPU itself can't redirect operations between slots, the compiler has to take the capabilities of each slot into account, when packing the instructions. Or the other way around, just by looking at the code created by the compiler, one can exactly tell which operation is executed in which slot.

That was the background for the (ironic) question: "so how exactly does Barts work then [if it would be VLIW4] if the shader compiler sends transcendentals to the T unit?"
Because one can indeed just look at the code generated by the shader compiler (integrated into the driver) for different GPUs. Doing that reveals that for Barts GPUs transcendentals get send to the t slot. It therefore has to exist as otherwise the code would simply not run. When looking at the code generated for Cayman (a VLIW4 GPU), one sees a different behaviour (the one you see in my sig, the transcendental operation gets distributed to the x, y, and z slots, which cooperate to carry out the operation formerly done by the t slot alone). When one tries to execute such code compiled for Cayman on a Barts GPU, it simply fails to run because of the different architecture. The mapping of the operations to the slots doesn't fit.
So if you don't believe in seeing the difference in benchmarks, the (binary) code run on the GPU reveals the different architecture, specifically that Barts is VLIW5. That is a definite proof.
 
Thank you, sir. It was a nice read. You sound like a good engineer/programmer.

It's just that I'm not an engineer or a programmer. If you were not an engineer/programmer, and I was telling you that Cayman had VLIW5-based architecture, you'd want to see at least a screenshot of a compiler program that shows the code actually being used. You don't know if you could believe AMD's own engineers, after the Bulldozer 2B trannies "claim" that was not corrected until a month or two after the product was launched, so you want to SEE evidence with your own eyes.

That's just what I wanted to see, and you were wrong in that DarthShader actually showed me something I could SEE with my own eyes. This Civilization V texture decompression, out of 99% of benchmarks out there, finally showed me something that I badly overlooked.:
@DarthShader:
Just looking at the (disassembled) ISA code send to the GPU for execution is definite proof and was mentioned in the first answers in the thread by OpenGL_guy. If that doesn't shut down Bo_Fox (looks like he didn't got the argument for some reason ), I can't help him.
Is there even a screenshot of a compiler program with some code that is using VLIW4/5 architectures? I'm just curious what it looks like, that's all - no pressure, but it'd be interesting at least.
 
A screenshot of the "StreamKernelAnanlyzer" which calls the compiler and is able to show the disassembly of the generated binary code can be found here:


I don't have a screenshot of that program with some VLIW4/5 code at hand (the screenshot shows the OpenCL code translated to an intermediate language (IL) AMD uses, not the actual ISA code; one can choose the compilation target in that dropdown menu), but you can find some copy&paste lines here (comparing VLIW4 and VLIW5 code generated from the same IL code, look how the instructions for the t slot in the VLIW5 code gets distributed to the other slots like in my sig) or for some other code here (comparing VLIW4 and Tahiti, which looks completely different).
The first example was actually not done with that StreamKernelAnalyzer but came from a real running program manually feeding the binary created by the shader compiler to the dissassembler.
 
Last edited by a moderator:
It's just that I'm not an engineer or a programmer. If you were not an engineer/programmer, and I was telling you that Cayman had VLIW5-based architecture, you'd want to see at least a screenshot of a compiler program that shows the code actually being used. You don't know if you could believe AMD's own engineers, after the Bulldozer 2B trannies "claim" that was not corrected until a month or two after the product was launched, so you want to SEE evidence with your own eyes.
A published transistor number is really a marketing number - it has no practical use and the reported number will make no difference on how an application works, it is immaterial to all intents a purposes (and the difference in numbers just relates to what is actually counted and whether it is consistent with previous numbers released). Discussing and detailing how an architecture works does have real implications to terms of how an application performs; although the operation is abstracted from developers documentations and tools are provided so that developers can accurately tune their code to a given target or at least understand application behaviours from one solution to another.
 
A screenshot of the "StreamKernelAnanlyzer" which calls the compiler and is able to show the disassembly of the generated binary code can be found here:


I don't have a screenshot of that program with some VLIW4/5 code at hand (the screenshot shows the OpenCL code translated to an intermediate language (IL) AMD uses, not the actual ISA code; one can choose the compilation target in that dropdown menu), but you can find some copy&paste lines here (comparing VLIW4 and VLIW5 code generated from the same IL code, look how the instructions for the t slot in the VLIW5 code gets distributed to the other slots like in my sig) or for some other code here (comparing VLIW4 and Tahiti, which looks completely different).
The first example was actually not done with that StreamKernelAnalyzer but came from a real running program manually feeding the binary created by the shader compiler to the dissassembler.
Wow, that's some hard-core programming! It's like me trying to learn how to speak and write Japanese fluently - which would take over 10 years of complete exposure and immersion! Thanks! :cool:

You are assuming too much out of VLIW4.
Hmm, yeah many review sites were saying that about Cayman at the time of launch (usually 15%, even 20% in some cases). As one guy said here, the T-unit (5th "extra" unit of VLIW) does not get to be used very efficiently in most games.

AMD didn't even need VLIW4 at all, given the magic of Barts! :p
It's pretty variable anyways.. in the below article, more games should have been tested to get a better "overall" estimation (even though it was 11 games and 34 different settings). I'm quoting the below from ABT forums:
apoppin said:
You might all find this interesting. GCN compared with VLF architecture
http://ht4u.net/reviews/2012/amd_radeon_hd_7700_test/index4.php

This poster sums it up pretty well
http://www.xtremesystems.org/forums/showthread.php?279169-GCN-vs.-VLIW5-performance-improvements
They took a HD 5770 and a new HD 7770. Both cards have 40 TMUs, 16 ROPs with a 128bit SI and GDDR5.
And both have 10 shader-cluster.

The only difference is, that while the HD 5770 has 160 5D shaders, the HD 7770 uses 640 1D GCN shaders.

Now they clocked the HD 7770 down to the same clocks as the HD 5770 and compared the GPUs.

HD 5770: 1360 GFLOP/s, 34,0 GTex/s, 13,6 GPix/s, 76,8 GB/s
HD 7770: 1088 GFLOP/s, 34,0 GTex/s, 13,6 GPix/s, 76,8 GB/s

You can see that the HD 7770 has about 25% less FLOPs, the rest of the cards are almost identical (and keep in mind that the HD 7770 has the better AF).

The results are great: The HD 7700 is up to 37% faster in some games while losing to the HD 5770 only in two games, Dirt 3 & DA2. And there it's a maximum 1.8% slower.
+ drivers for the VLIW5 arch are matured, GCN is quiet new yet, we'll see some improvements here for sure.

So, despite having about 25% less raw power, the HD 7770 is faster in most games

And the google translate:
http://translate.google.com/transla...views/2012/amd_radeon_hd_7700_test/index4.php
 
Last edited by a moderator:
Even Shader Toy shows HD 6870 to perform much more in line with HD 6950:
Barts XT: 180 / 2016 GFLOPS = 0.9325
HD 5870: 206 / 2720 GFLOPS= 0.0757
HD 6950: 201 / 2253 GFLOPS = 0.0892
You're doing the same stupid metric of perf/flops. That always goes up for the same architecture when you only reduce shader count and nothing else.

The 9600GT is an excellent example, thank you very much. :smile: But you must not forget that the 9600GT has higher clocks, etc... far from "similar" as you put it. The 8800GT was already somewhat more bottlenecked by the bandwidth and ROPs (which still gave an 8800GTX about 20% advantage overall). The 8800GT has 25% greater overall gaming performance than the 9600GT, so it's not "almost as fast".
It's not even close to 25%. Look at this review:
http://techreport.com/articles.x/14168/5
The 8800GT is only 10-15% faster than the stock 9600GT, despite having 60% more shading and texturing power. That includes the clock difference of 8%, and both have the same architecture and bandwidth.
Perhaps Barts XT really has 1280 VLIW5 shaders, or what is it EXACTLY about Bart's improved front-end that makes it perform amazingly well given the specs?
Amazing compared to what? The 5870? It's the same thing that makes a 9600GT "more efficient" than a 8800GT: faster clock and fewer shaders, despite having the same architecture.
Hardly, since HD 5830 actually has a whopping 33% more FLOPS capability and 33% more texturing power than the 6790.
Read my post again. You're comparing the wrong cards. Fill in the blanks:
-5830 is a cut down version of _____, and is __% slower
-6790 is a cut down version of _____, and is __% slower
The 5830 is an underperformer and an efficiency outlier in the Evergreen family.
 
Thank you, sir. It was a nice read. You sound like a good engineer/programmer.

It's just that I'm not an engineer or a programmer.

Then you should stop talking about things that only engineers an programmers understand about.

Learn these things, and start posting again when you understand something about th things you are now posting about.
 
Last edited by a moderator:
Then you should stop talking about things that only engineers an programmers understand about.

Learn these things, and start posting again when you understand something about these things.

Nah, the thing is that we now have a screenshot to look at. I simply asked for it, and now, we're able to SEE something. It's a good thing for non-engineers/programmers to be able to see what it even looks like. It's pretty cool, huh?

I'm learning as I inquire into the behavior of video cards, and it's already helping me with my project on rating the GPUs (thanks again to those who actually explained a thing or two).

images


Just kidding, but your post is a bit disheartening, so I guess I should try to remind you to just smile!!!

You're doing the same stupid metric of perf/flops. That always goes up for the same architecture when you only reduce shader count and nothing else.

It's not even close to 25%. Look at this review:
http://techreport.com/articles.x/14168/5
The 8800GT is only 10-15% faster than the stock 9600GT, despite having 60% more shading and texturing power. That includes the clock difference of 8%, and both have the same architecture and bandwidth.
Amazing compared to what? The 5870? It's the same thing that makes a 9600GT "more efficient" than a 8800GT: faster clock and fewer shaders, despite having the same architecture.
Read my post again. You're comparing the wrong cards. Fill in the blanks:
-5830 is a cut down version of _____, and is __% slower
-6790 is a cut down version of _____, and is __% slower
The 5830 is an underperformer and an efficiency outlier in the Evergreen family.

Well, without AA, 8800GT is 26.9% faster, and that's with 16 games (at 1600x1200).
http://www.computerbase.de/artikel/...e-9600-gt-sli/24/#abschnitt_performancerating
This benchmark reflects normal playable settings using a 9600GT - I positioned the card so that it was barely faster than HD 3870 because that's what I clearly remember. It was pretty much in the same price category, and nobody wanted to use AA with HD 3870 due to horrible performance in most current games at that time (especially with Unreal Engine 3 games like Bioshock).
With AA enabled, 8800GT does seem to be somewhat bottlenecked by its ROPs in comparison, though, it seems?

As to other issues that you mentioned, I'll just leave you alone with them. J/K.. It's a good point about the 9600GT and about the 5830. I do realize how the 5830 is pretty disappointing compared against HD 5770, with tons more shaders and stuff but less bandwidth.

Thanks for trying to point it out.

On the OTHER hand, look at how well HD 7750 (castrated Cape Verde) is performing against HD 7770. Look at HD 7950 vs HD 7970. Look at HD 6950 vs HD 6970. Look at HD 5850 vs HD 5870. See how HD 5850 is actually much more efficient specs-wise than its fully unlocked bigger brother. There are Geforce Tesla cards with half the shaders of the fully unlocked counterparts, and they still perform reasonably in line. GTX 465 is even more castrated than HD 5830 in some ways.

Plus, you are not taking into account the "amazing" efficiency of Barts architecture to begin with, given the specs on paper.

Nothing too serious, man. Let the mood be light and enlightened!

d


Oh yeah, you better smile!!!
 
Last edited by a moderator:
Tongue-in-cheek is nice and all, when well positioned and not excessively employed. When it's constantly used to handwave this or that, and is coupled with 1000+ word posts and all sorts of meme-ist demotivational images inserted directly, it needs to stop. I reckon this level of discourse may be adequate elsewhere (as you so kindly pointed out upstream), but it's not adequate here. I would politely ask you to ponder a bit about this.
 
Tongue-in-cheek is nice and all, when well positioned and not excessively employed. When it's constantly used to handwave this or that, and is coupled with 1000+ word posts and all sorts of meme-ist demotivational images inserted directly, it needs to stop. I reckon this level of discourse may be adequate elsewhere (as you so kindly pointed out upstream), but it's not adequate here. I would politely ask you to ponder a bit about this.

Please let the posts be OT, or somebody please lock this thread that I started.

I do not need more whining (completely OT) posts from people who are so serious and depressing.

And I only used the memes in one post, out of 40, and you're already so uptight about it. Why didn't you tell hkulata that "it needs to stop" instead of me? Posts like this that have nothing constructive to offer to the discussion (and that obviously have malicious intent toward the OP) need to stop, get it?

This is the last time I'm replying to malicious, completely OT posts.

A published transistor number is really a marketing number - it has no practical use and the reported number will make no difference on how an application works, it is immaterial to all intents a purposes (and the difference in numbers just relates to what is actually counted and whether it is consistent with previous numbers released). Discussing and detailing how an architecture works does have real implications to terms of how an application performs; although the operation is abstracted from developers documentations and tools are provided so that developers can accurately tune their code to a given target or at least understand application behaviours from one solution to another.

Your point is completely understood.

But (yes, there's one but..) to mislead people with the "reported transistor number" is not quite ideal for those who have been discussing the transistor density of chips such as HD 7970 vs HD 7770 vs HD 7870 back and forth in other threads. Not to be argumentative.. just thought I should add that, that's all.
 
Last edited by a moderator:
So basically rather than intelligent engineers explaining to you why you are wrong in detail, you want a pretty powerpoint with bar graphs and unicorns because the latter is far more believable.
 
Never mind calling the site owner a malicious, depressing, off topic, whiner after a polite point of moderation. Oh, my...
 
Ah, but you forget, supposedly Beyond3D is totally an AMD fan-site as evidenced by Bo's rantings on whatever that other forum was. Because obviously we never have any NVIDIA engineers on here, nor do we ever have VIA engineers, or ARM, or ... well, you get the point.

It's sad that so many people assume that Beyond3D is just some random sounding board full of fanboys, and forget that a significant chunk of this forum is the people building the hardware, and writing the code, and doing this level of work.
 
It's sad that so many people assume that Beyond3D is just some random sounding board full of fanboys, and forget that a significant chunk of this forum is the people building the hardware, and writing the code, and doing this level of work.

its why i love this forum :D

I think people think this site is pro AMD because the AMD ALU architecture has been more interesting to talk about thus it gets talked about more, its changed so much from 5870 to 7970.
 
Ah, but you forget, supposedly Beyond3D is totally an AMD fan-site as evidenced by Bo's rantings on whatever that other forum was. Because obviously we never have any NVIDIA engineers on here, nor do we ever have VIA engineers, or ARM, or ... well, you get the point.

It's sad that so many people assume that Beyond3D is just some random sounding board full of fanboys, and forget that a significant chunk of this forum is the people building the hardware, and writing the code, and doing this level of work.
In my experience with observing and participating in a number of forums, Beyond3D is about the least fanboy-ish forum out of the ones I've seen.
 
In my experience with observing and participating in a number of forums, Beyond3D is about the least fanboy-ish forum out of the ones I've seen.

I quite agree. It is also the most serious and mature forum, usually, for the reasons below.


Ah, but you forget, supposedly Beyond3D is totally an AMD fan-site as evidenced by Bo's rantings on whatever that other forum was. Because obviously we never have any NVIDIA engineers on here, nor do we ever have VIA engineers, or ARM, or ... well, you get the point.

It's sad that so many people assume that Beyond3D is just some random sounding board full of fanboys, and forget that a significant chunk of this forum is the people building the hardware, and writing the code, and doing this level of work.

I lurked here for a few years before even working up the courage to start posting since much of the in-depth discussions being had were way over my head.
Some of them still are but I can generally figure out a very basic understanding of what is being discussed.

Nah, the thing is that we now have a screenshot to look at. I simply asked for it, and now, we're able to SEE something. It's a good thing for non-engineers/programmers to be able to see what it even looks like. It's pretty cool, huh?
That's part of the problem. You act like you are on a lone crusade to bring knowledge and justice to the GPU world but the problem is... that information is already out there. It had been discussed here before and rather than take the time to research and find what you are looking for, you go conspiracy theory and create long, rambling posts "proving" your point without any real understanding of what you are discussing. Your attitude, after some were even trying to help you understand how baseless your argument was, turned into "you vs b3d" and it didn't matter what was being posted, they were wrong and you were right.

Flamebaiting and ignorance are acceptable on some forums but you won't last long here with that posting style.
 
Last edited by a moderator:
Then you should stop talking about things that only engineers an programmers understand about.

Learn these things, and start posting again when you understand something about th things you are now posting about.

If every body had to be an expert before discussing, then how would anybody learn?

@Bo: Calling people names at the drop of a hat, needlessly antagonizing the people who know more (a LOT more) and being shrill is not the best way to learn.
 
So basically rather than intelligent engineers explaining to you why you are wrong in detail, you want a pretty powerpoint with bar graphs and unicorns because the latter is far more believable.

Clearly, the man is upper management material.
 
Back
Top