Why Barts is really VLIW4, not VLIW5 (and more on HD 5830/6790 being mainly 128-bit)

Discussion in 'Architecture and Products' started by Bo_Fox, Feb 28, 2012.

  1. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    Yeah but with execution masks you can effectively execute different instructions. Of course, there's a performance penalty, but still.
     
  2. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    No you can not in the general case, also not "effectively".
    A divergent warp/wavefront/vector doesn't spawn a new one (at that point one may start to discuss the matter, but it would break the SI part of SIMT). At any given time, you have only a single instruction for all elements of the warp/wavefront/vector which gets executed. You can't synchronize some "threads" on one side of a branch to some "threads" on the other side of a branch (if they belong to the same warp/wavefront/thread) for this exact reason. They are simply not independent. They cannot be as long as they are executed just in SIMD fashion. And for this reason it makes no sense to make up a whole new terminology just to confuse people.
     
  3. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    797
    Likes Received:
    223
    Well, Bo_Fox is continuing his crusade over at ABT forums by creating this post and this thread to reply to B3D posters (am I allowed to post this here since he's in a time-out?).
     
    #63 iMacmatician, Feb 29, 2012
    Last edited by a moderator: Feb 29, 2012
  4. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Oh too bad. While totally pointless, was such a fun thread :).
     
  5. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    I think the mods should swoop in and do a mercy kill on this thread and lock it away. While we're at it, increase Bo_Fox's vacation period as it seems he's not going to be civil when he comes back.
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    As currently implemented when facing synchronization operations and irreducible control flow, the SIMT abstraction breaks down.
    With more straightforward code, multiple threads and masked SIMD units produce consistent behavior.
    In more complex cases, SIMT implementations lock up or fail.
     
  7. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    There are better ways to handle noise than locking up *potentially* promising threads.

    Bo deserves a chance and some sound advice before harsher sanctions are applied.
     
  8. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    10 are private to each 16-wide vector unit for a whole of up to 40 in flight per CU, less if register pressure is high.

    The cache is shared between four GCNs/CUs.


    Oh dear... and he's totally misinterpreting there what I've posted. :(
     
    #68 CarstenS, Feb 29, 2012
    Last edited by a moderator: Feb 29, 2012
  9. DarthShader

    Regular

    Joined:
    Jul 18, 2010
    Messages:
    350
    Likes Received:
    0
    Location:
    Land of Mu
    He has one point though, his trolling was quite elaborate and insisted on numbers and math. So why not simply post some benchmarks numbers, like Carsten tried, or where the difference between VLIW 4 and 5 is shown, to stuff this guys mouth with crow? Shaders with lots of transcendentals, like Mineral and Fire shaders mentioned by Jawed here: http://forum.beyond3d.com/showthread.php?p=1422548&highlight=code#post1422548 would do the trick, as other rationality calls don't hit home apparently. I got one:

    [​IMG]

    Tried looking for more, but googling relevant keywords made this thread appear as first results. :lol:
     
  10. Rodéric

    Rodéric a.k.a. Ingenu
    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,080
    Likes Received:
    997
    Location:
    Planet Earth.
    I think we should settle with 1 Core = 1 Instruction pointer, makes more sense to me too.
     
  11. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    @DarthShader:
    Just looking at the (disassembled) ISA code send to the GPU for execution is definite proof and was mentioned in the first answers in the thread by OpenGL_guy. If that doesn't shut down Bo_Fox (looks like he didn't got the argument for some reason :roll:), I can't help him.

    @Roderic:
    Basically yes. But you have to think about the fact that multithreaded architectures often maintain several instruction pointers per core. So I would define a core more like the smallest entity, which is able to execute a thread independently (for the major part, so excluding IO and such stuff). Btw., this is OT here! ;)
     
    #71 Gipsel, Feb 29, 2012
    Last edited by a moderator: Feb 29, 2012
  12. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    Well sure, you can't execute different instructions simultaneously, but you can still have different threads within a warp executing different instructions, they just have to wait their turn. One might also argue that it doesn't matter and that the way "cores" are exposed to the software is what matters; that the SIMD execution is just a detail of the implementation.

    But in any case, I don't agree with Michael Shebanow and I've already put more words in his mouth than I'm comfortable with, so perhaps I should leave it at that. :razz:

    Wow, looks like BoFox single-handedly killed the S/N ratio of the entire Internet! :D
     
    #72 Alexko, Feb 29, 2012
    Last edited by a moderator: Feb 29, 2012
  13. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    17,879
    Likes Received:
    5,330
    reading cartsens and similar posts Im getting lost is there any sort of online resource that explains what the following are
    wiki is not helping me here
    Transcendentals (the properties of being according to wiki ;) )
    scalars
    vectors (thought a vector was a speed + direction)
    ect
    tnx...
     
  14. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    960
    Likes Received:
    853
    5830 is not actually that bad performer, most reviews've done with older drivers, it's just lazy reviewers who doesnt bench all cards with same drivers.
    overall 5830 is definitely faster than 6790 and mostly beats 6850 as well, if there is no tessellation..

    5830: 104,26%
    6850: 100%
    6790: 95,75%
     
  15. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Transcendentals in "our" sense are special functions that are usually carried out via macros over multiple cycles. Examples are Sine/Cosine, Exponent, Reciproce and stuff like that.

    Vectors are one-dimensional arrays of data and in expansion, processors specializing on those kinds of workloads. Scalar is a vector with a single lane (tm) (and in german a kind of fish ;)) - But wait for Gipsel & Co., they can probably give a much better and more accurate definition.
     
  16. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Vectors types are simply packed data structures with explicit ordering (RGBA != ABGR) and lenght.
     
  17. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Well, it does matter. It is not just a (transparent) detail of the physical implementation of an ISA. The actual ISA of current GPUs is SIMD based at its core. With the individial elements of a Warp/Wavefront (or work elements in OpenCL slang, see the similarity of the wording to vector elements!) you simply can't do everything which you are used to from real threads. A whole class of control structures simply don't work (irreducible control flow). That is a fundamental difference, not an implementation detail. The SIMD nature of the underlying processor is not transparent. A Warp/Wavefront is a thread for the hardware, not a single element of it.

    What is true, is that the higher level GPU languages forces you to express the problem in an implicit parallel way (if you don't and want to extract something from general purpose C code it basically degenerates to an autovectorization by the compiler). But confusing this with meaning that each element of a warp/wavefront is independent (it isn't) is quite a bad misconception which causes that quite a few beginners have troubles to understand the performance pitfalls for instance.
     
  18. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    696
    Likes Received:
    446
    Location:
    Slovenia
    Maybe you're looking at a wrong entry? This one should answer your questions in detail: Transcendental function
    Basically addition, subtraction and multiplication are for example algebraic operations. Also they are implemented directly in computer hardware. For example that's about all those 512 SP-s in GTX 580 can do. These functions are generally fast and low latency in hardware.

    Transcendental functions are those that can't be expressed with polynomial e.g. they can't be EXACTLY represented as a series of adds, subtractions and multiplications. But we can approach them with some degree of precision on some interval with a polynomial. Some general approaches to do this are Taylor series or Fourier transform. To compute these hardware has to approximate them with a series of algebraic operations. They are computed in 64 SFU-s in GTX580. Transcendental functions are not necesarily macros (as CarstenS said) as they can still run at one instruction per clock throughput (on much fewer units), but latency is much higher.

    A macro would IMO be div, which for example not present in Tesla/Fermi ISA, but gets replaced by rcp/rsq combo by compiler.
     
  19. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Transcendentals:
    As Carsten said already, basically logarithms, exponentials, trigonometric functions and in this frame also square roots and divisions/reciprocals, even as those are technically (I should say mathematically) not transcendental functions. In general, transcendental functions cannot be expressed by an algebraic equation (a reciprocal can quite easily for instance: f(x)=1/x). But as said, in this frame one often subsumes everything which is more "complicated" than the basic operations of addition, multiplication, multiply-adds, bit manipulations and such stuff.

    scalar:
    A quantity which can be represented by a single value (number).

    vector:
    A quantity, which is represented by a list of values (numbers).
    Historically, vector means "carrier" (carrying something from one point to another, in biology it has still this meaning). In geometrics, it gives a direction and a distance in some space with some number of dimensions (independent axes in space). In practice, this can be expressed as a list of numbers, one number for each dimension (the distance along the according axis). In this sense, its meaning got generalized to name either something which points somewhere (can even be a scalar value ;)) or basically just something which is represented by a list of values (like a column of a table).
    As you mentioned speed, this is a vector quantity in physics as it is given by the absolute value (magnitude) and a set of angles in space or alternatively the components of the velocity along each axis of the space. That means the velocity is only completely given if you use a list of values, a.k.a. a vector.
     
  20. homerdog

    homerdog donator of the year
    Legend Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,294
    Likes Received:
    1,075
    Location:
    still camping with a mauler
    Lol! "Comments vs the Beyond3D Wimps" :grin:
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...