Why Barts is really VLIW4, not VLIW5 (and more on HD 5830/6790 being mainly 128-bit)

Alexko · Feb 29, 2012

Gipsel said:
As said above, that's exactly what is missing.

I mean, if already well known textbooks on that matter complain about the "quirky jargon" (yes, that is written in it!) used by nVidia, we don't have to think much further.

Yeah but with execution masks you can effectively execute different instructions. Of course, there's a performance penalty, but still.

Gipsel · Feb 29, 2012

Alexko said:
Yeah but with execution masks you can effectively execute different instructions.

No you can not in the general case, also not "effectively".
A divergent warp/wavefront/vector doesn't spawn a new one (at that point one may start to discuss the matter, but it would break the SI part of SIMT). At any given time, you have only a single instruction for all elements of the warp/wavefront/vector which gets executed. You can't synchronize some "threads" on one side of a branch to some "threads" on the other side of a branch (if they belong to the same warp/wavefront/thread) for this exact reason. They are simply not independent. They cannot be as long as they are executed just in SIMD fashion. And for this reason it makes no sense to make up a whole new terminology just to confuse people.

iMacmatician · Feb 29, 2012

Well, Bo_Fox is continuing his crusade over at ABT forums by creating this post and this thread to reply to B3D posters (am I allowed to post this here since he's in a time-out?).

mczak · Feb 29, 2012

Oh too bad. While totally pointless, was such a fun thread

.

Arty · Feb 29, 2012

I think the mods should swoop in and do a mercy kill on this thread and lock it away. While we're at it, increase Bo_Fox's vacation period as it seems he's not going to be civil when he comes back.

3dilettante · Feb 29, 2012

Alexko said:
Yeah but with execution masks you can effectively execute different instructions. Of course, there's a performance penalty, but still.

As currently implemented when facing synchronization operations and irreducible control flow, the SIMT abstraction breaks down.
With more straightforward code, multiple threads and masked SIMD units produce consistent behavior.
In more complex cases, SIMT implementations lock up or fail.

rpg.314 · Feb 29, 2012

Arty said:
I think the mods should swoop in and do a mercy kill on this thread and lock it away. While we're at it, increase Bo_Fox's vacation period as it seems he's not going to be civil when he comes back.

There are better ways to handle noise than locking up *potentially* promising threads.

Bo deserves a chance and some sound advice before harsher sanctions are applied.

CarstenS · Feb 29, 2012

3dilettante said:
The presentation said at least 10 wavefronts per SIMD.

10 are private to each 16-wide vector unit for a whole of up to 40 in flight per CU, less if register pressure is high.

3dilettante said:
I'm not sure how you are characterizing the scalar unit as a coprocessor sharing resources over four cores. There is a write-only scalar cache that is shared between the CUs, but this is not the only shared cache. The scalar unit is tied closely in each CU.

The cache is shared between four GCNs/CUs.

iMacmatician said:
Well, Bo_Fox is continuing his crusade over at ABT forums by creating this post and this thread to reply to B3D posters (am I allowed to post this here since he's in a time-out?).

Oh dear... and he's totally misinterpreting there what I've posted.

DarthShader · Feb 29, 2012

iMacmatician said:
continuing his crusade

He has one point though, his trolling was quite elaborate and insisted on numbers and math. So why not simply post some benchmarks numbers, like Carsten tried, or where the difference between VLIW 4 and 5 is shown, to stuff this guys mouth with crow? Shaders with lots of transcendentals, like Mineral and Fire shaders mentioned by Jawed here: http://forum.beyond3d.com/showthread.php?p=1422548&highlight=code#post1422548 would do the trick, as other rationality calls don't hit home apparently. I got one:

Tried looking for more, but googling relevant keywords made this thread appear as first results.

Rodéric · Feb 29, 2012

I think we should settle with 1 Core = 1 Instruction pointer, makes more sense to me too.

Gipsel · Feb 29, 2012

@DarthShader:
Just looking at the (disassembled) ISA code send to the GPU for execution is definite proof and was mentioned in the first answers in the thread by OpenGL_guy. If that doesn't shut down Bo_Fox (looks like he didn't got the argument for some reason

), I can't help him.

@Roderic:
Basically yes. But you have to think about the fact that multithreaded architectures often maintain several instruction pointers per core. So I would define a core more like the smallest entity, which is able to execute a thread independently (for the major part, so excluding IO and such stuff). Btw., this is OT here!

Alexko · Feb 29, 2012

Gipsel said:
No you can not in the general case, also not "effectively".
A divergent warp/wavefront/vector doesn't spawn a new one (at that point one may start to discuss the matter, but it would break the SI part of SIMT). At any given time, you have only a single instruction for all elements of the warp/wavefront/vector which gets executed. You can't synchronize some "threads" on one side of a branch to some "threads" on the other side of a branch (if they belong to the same warp/wavefront/thread) for this exact reason. They are simply not independent. They cannot be as long as they are executed just in SIMD fashion. And for this reason it makes no sense to make up a whole new terminology just to confuse people.

Well sure, you can't execute different instructions simultaneously, but you can still have different threads within a warp executing different instructions, they just have to wait their turn. One might also argue that it doesn't matter and that the way "cores" are exposed to the software is what matters; that the SIMD execution is just a detail of the implementation.

But in any case, I don't agree with Michael Shebanow and I've already put more words in his mouth than I'm comfortable with, so perhaps I should leave it at that.

DarthShader said:
Tried looking for more, but googling relevant keywords made this thread appear as first results.

Wow, looks like BoFox single-handedly killed the S/N ratio of the entire Internet!

Davros · Feb 29, 2012

reading cartsens and similar posts Im getting lost is there any sort of online resource that explains what the following are
wiki is not helping me here
Transcendentals (the properties of being according to wiki

)
scalars
vectors (thought a vector was a speed + direction)
ect
tnx...

Man from Atlantis · Feb 29, 2012

5830 is not actually that bad performer, most reviews've done with older drivers, it's just lazy reviewers who doesnt bench all cards with same drivers.

overall 5830 is definitely faster than 6790 and mostly beats 6850 as well, if there is no tessellation..

5830: 104,26%
6850: 100%
6790: 95,75%

CarstenS · Feb 29, 2012

Davros said:
reading cartsens and similar posts Im getting lost is there any sort of online resource that explains what the following are
wiki is not helping me here
Transcendentals (the properties of being according to wiki )
scalars
vectors (thought a vector was a speed + direction)
ect
tnx...

Transcendentals in "our" sense are special functions that are usually carried out via macros over multiple cycles. Examples are Sine/Cosine, Exponent, Reciproce and stuff like that.

Vectors are one-dimensional arrays of data and in expansion, processors specializing on those kinds of workloads. Scalar is a vector with a single lane (tm) (and in german a kind of fish

) - But wait for Gipsel & Co., they can probably give a much better and more accurate definition.

fellix · Feb 29, 2012

Vectors types are simply packed data structures with explicit ordering (RGBA != ABGR) and lenght.

Gipsel · Feb 29, 2012

Alexko said:
Well sure, you can't execute different instructions simultaneously, but you can still have different threads within a warp executing different instructions, they just have to wait their turn. One might also argue that it doesn't matter and that the way "cores" are exposed to the software is what matters; that the SIMD execution is just a detail of the implementation.

Well, it does matter. It is not just a (transparent) detail of the physical implementation of an ISA. The actual ISA of current GPUs is SIMD based at its core. With the individial elements of a Warp/Wavefront (or work elements in OpenCL slang, see the similarity of the wording to vector elements!) you simply can't do everything which you are used to from real threads. A whole class of control structures simply don't work (irreducible control flow). That is a fundamental difference, not an implementation detail. The SIMD nature of the underlying processor is not transparent. A Warp/Wavefront is a thread for the hardware, not a single element of it.

What is true, is that the higher level GPU languages forces you to express the problem in an implicit parallel way (if you don't and want to extract something from general purpose C code it basically degenerates to an autovectorization by the compiler). But confusing this with meaning that each element of a warp/wavefront is independent (it isn't) is quite a bad misconception which causes that quite a few beginners have troubles to understand the performance pitfalls for instance.

MDolenc · Feb 29, 2012

Davros said:
wiki is not helping me here
Transcendentals (the properties of being according to wiki )

Maybe you're looking at a wrong entry? This one should answer your questions in detail: Transcendental function
Basically addition, subtraction and multiplication are for example algebraic operations. Also they are implemented directly in computer hardware. For example that's about all those 512 SP-s in GTX 580 can do. These functions are generally fast and low latency in hardware.

Transcendental functions are those that can't be expressed with polynomial e.g. they can't be EXACTLY represented as a series of adds, subtractions and multiplications. But we can approach them with some degree of precision on some interval with a polynomial. Some general approaches to do this are Taylor series or Fourier transform. To compute these hardware has to approximate them with a series of algebraic operations. They are computed in 64 SFU-s in GTX580. Transcendental functions are not necesarily macros (as CarstenS said) as they can still run at one instruction per clock throughput (on much fewer units), but latency is much higher.

A macro would IMO be div, which for example not present in Tesla/Fermi ISA, but gets replaced by rcp/rsq combo by compiler.

Gipsel · Feb 29, 2012

Davros said:
reading cartsens and similar posts Im getting lost is there any sort of online resource that explains what the following are
wiki is not helping me here
Transcendentals (the properties of being according to wiki )
scalars
vectors (thought a vector was a speed + direction)

Transcendentals:
As Carsten said already, basically logarithms, exponentials, trigonometric functions and in this frame also square roots and divisions/reciprocals, even as those are technically (I should say mathematically) not transcendental functions. In general, transcendental functions cannot be expressed by an algebraic equation (a reciprocal can quite easily for instance: f(x)=1/x). But as said, in this frame one often subsumes everything which is more "complicated" than the basic operations of addition, multiplication, multiply-adds, bit manipulations and such stuff.

scalar:
A quantity which can be represented by a single value (number).

vector:
A quantity, which is represented by a list of values (numbers).
Historically, vector means "carrier" (carrying something from one point to another, in biology it has still this meaning). In geometrics, it gives a direction and a distance in some space with some number of dimensions (independent axes in space). In practice, this can be expressed as a list of numbers, one number for each dimension (the distance along the according axis). In this sense, its meaning got generalized to name either something which points somewhere (can even be a scalar value

) or basically just something which is represented by a list of values (like a column of a table).
As you mentioned speed, this is a vector quantity in physics as it is given by the absolute value (magnitude) and a set of angles in space or alternatively the components of the velocity along each axis of the space. That means the velocity is only completely given if you use a list of values, a.k.a. a vector.

homerdog · Feb 29, 2012

iMacmatician said:
Well, Bo_Fox is continuing his crusade over at ABT forums by creating this post and this thread to reply to B3D posters (am I allowed to post this here since he's in a time-out?).

Lol! "Comments vs the Beyond3D Wimps"

Why Barts is really VLIW4, not VLIW5 (and more on HD 5830/6790 being mainly 128-bit)

Alexko

Gipsel

iMacmatician

mczak

Arty

KEPLER

3dilettante

rpg.314

CarstenS

Moderator

DarthShader

Rodéric

a.k.a. Ingenu

Gipsel

Alexko

Davros

Man from Atlantis

idk

CarstenS

Moderator

fellix

Gipsel

MDolenc

Gipsel

homerdog

donator of the year

Similar threads