What gives ATI's graphics architecture its advantages?

Discussion in 'Architecture and Products' started by Kaotik, Apr 5, 2010.

  1. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    The PTX (the pseudo assembly instructions for CUDA) is scalar. NVIDIA does not expose the SIMD nature of its GPU to the PTX. The only hint of its SIMD nature is the warp vote functions.
     
  2. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    I knew about abysmal branching perf, but how was VTF so bad, especially compared to the competing solutions at that time?
     
  3. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Fitting scalar kernel threads into VLIW is not a vectorization problem, it's a scheduling problem.
     
  4. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Compared to the competition, it rocked :).
    It had a lot of limitations (no filtering for instance, not many texture formats). Looks like using "normal" TMUs wasn't feasible, hence nvidia ended up implementing it with some minimalistic vertex TMU where ATI just said "screw it", apparently thinking it's not worth investing transistors in a bad solution.
     
  5. Novum

    Regular

    Joined:
    Jun 28, 2006
    Messages:
    335
    Likes Received:
    8
    Location:
    Germany
    There are no SIMD instructions in the NVIDIA architectures. There are multiple threads in a WARP executed at once, but that doesn't change that the threads themselves are executed in a strictly scalar manner.

    Neither the CUDA to PTX nor the PTX to GPU bytecode compiler need to do any vectorization.
     
  6. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Neither it's needed for ATI's OpenCL to IL compiler for instance. The difference is that it *can* be done (you can use r#.xyzw or just r# to apply an instruction to all components, works also with just 2 or 3 components), but this is mostly something to make it more readable or shorter. It doesn't make a difference if you pass

    add r100.x, some_register
    add r101.x, another_register
    add r102.x, yet_another_value
    or
    add r100.xyz, r25.wzx

    to the IL compiler. Both are handled as three independent additions which may end in the same VLIW bundle or not. The slot the hardware uses for an instruction is also not determined by the IL code (only the case for double precision adds, but that will likely be fixed in the future). The Brook+ or OpenCL to IL compiler don't do any vectorization or other fancy stuff. They simply translate the code in the most simple way to IL (looks terrible most of the time). The optimization is done by the IL compiler, but as already said by MfA, this optimization is about scheduling, finding ILP and packing the VLIW bundles, not about vectorization.
    If a programmer chooses instructions using 2, 3 or 4 components, that only means he structures the code to be shorter, better readable, and to have more parallelism readily exposed, it doesn't magically lead to a higher utilization compared to the same algorithm written with just more scalar instructions.
     
    #86 Gipsel, Apr 9, 2010
    Last edited by a moderator: Apr 9, 2010
  7. Novum

    Regular

    Joined:
    Jun 28, 2006
    Messages:
    335
    Likes Received:
    8
    Location:
    Germany
    My point was, that a VLIW bundle is much easier to fill for a compiler than to vectorize code (as you said). It doesn't matter at all where this happens in the case of the ATI driver - but it needs to happen.

    So I think we have nothing to argue about :)
     
  8. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    VLIW vs. SIMD is irrelevant.

    AFAIK, ATI has a SIMD array of VLIWs. That's right, it's like SIMD, except each 'lane' of the SIMD is a VLIW.

    This is the reverse of what is usually done in CPUs (i.e. you might have a VLIW where one of the functional units is a SIMD unit).

    I doubt that you can pack dependent operations in the same bundle. Perhaps the compiler can do something interesting to enable scheduling tricks...but by definition if two operations form a dependency chain...one must execute before the other.

    Anyway, the real point is that you actually need to have multiple instructions that can be packed in the same bundle for the VLIW or your utilization suffers. If you only have one independent instruction in each bundle...you are limited to 25% of peak performance.

    However to fill a bundle, you need to do all the normal scheduling wizardry required for a VLIW. That's not too bad for graphics, but can be difficult for other workloads. NV does not require that you find any ILP - you just need to find a lot of data. That is easier by definition from finding a lot of data and ILP in the instruction stream that operates on said data.

    David
     
  9. EduardoS

    Newcomer

    Joined:
    Nov 8, 2008
    Messages:
    131
    Likes Received:
    0
    It's so hard to read the manual?

    So I have a tip for you, do 2 or 4 work-itens per lane and bundles will be full, it's not black magic, it's easy enough for the compiler to do it alone but since it still doesn't we, programmers, still have to do it by hand on "low ILP" code...
     
  10. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    Thanks for the detailed answer ..I remember reading about the problems of long data buses on motherboards , I read that one of the reasons that FSBs and HT links have limited frequency , is that they tend to transform into an EM antenna if clocks reached a certain limit , the signal is wasted and could potentially interrupt other connections .

    Thanks for the answer ,So I understand that this is a common practice used even in the CPU industry ? but then again , some CPUs like the Core i7 920 are overclockable from 2.6GHz to 4.0GHz , which is a very large clock jump .

    I guess this has something to do with yields and Quality control , rather than a specific feature of the transistors themselves , I mean Intel must have branded these CPUs as mainstream products and clocked them modestly to deal with certain market slices , even though these products are capable of much higher clock speeds , perhaps they even failed some stiff Quality control tests that would never be manifested in real world scenarios .

    Excellent clarification , thanks .
     
  11. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Actually, you can do it with the Evergreen series for some of the most used instructions (add, mul, madd, and some other stuff like sad).
    Partly true, but normally you have some ILP to extract. And as said, a 50% utilization rate is already enough to reach the theoretical peak performance of the competition. It's often not a big deal.
    If a problem is suited to GPUs (the implicit vectorized execution in the SIMD units of GPUs) you can process almost by definition several work items in a single VLIW SIMD slot as already said by EduardoS. It is often very easy to do and just increases your effective vector length, i.e. it is the equivalent of finding a lot of data.
    It requires that the programmer is aware of the problem, true. But even when not doing it, you will be hard pressed to find a problem where the disadvantage is larger (if it exists at all, for medium and high ILP workloads ATI's shader performance is higher either way) than the potential advantage ATI has when the programmer pays attention.
     
  12. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    It's still not scalar, it just looks like scalar. To put it simply, every threads in a warp has to be running the same instruction. That's the definition of SIMD. It's easy to do vectorization because its SIMD supports gather, scatter, and proper masked execution (just like LRBni). That's also why you can't have irreducible control flow in CUDA or OpenCL on NVIDIA's hardwares (or ATI's hardwares, for the same reason).
     
  13. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Compared to the competition at launch NVidia's dynamic branching was great as well. :wink: Still pretty much useless, though.

    NV40 VTF was clocked at 10 or 20 million verts per second. R2VB was orders of magnitude faster.
     
  14. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    It is scalar vertically (no ILP) unlike amd. It is simd horizontaly (aka hardware vectorization) just like nv.

    Also, packing 4 workitems into VLIW lanes is a VERY BAD idea at the compiler level. It breaks the conceptual simplicity of ocl programming model. It is also akin to doing the same thing in two different ways. You have both hw vectorization and VLIW on top of it. The VLIW is meant for extracting ILP and that's that.

    A better way is to pack vec2 registers together and vec3 and scalars together into a single vec4 physical register. This should be enough to minimize the register waste, which would be a big start.
     
  15. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    Yes, and it's a easily vectorizable SIMD. Which is my point, i.e. a proper SIMD is not harder to vectorize than VLIW.

    AMD's compiler is already doing this very well IMHO. In my n-queen kernel, a 2D vectorized version has very similar ALU packing rate compared to a 4D vectorized version.
     
  16. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    That's interesting, I really seem to be lacking in this topic, too. :) Maybe you can tell me/us some more about it in a separate thread?
     
  17. EduardoS

    Newcomer

    Joined:
    Nov 8, 2008
    Messages:
    131
    Likes Received:
    0
    But if there isn't ILP the compiler could do it, if the compiler doesn't than the programmer have to do it to get decent performance. Why not give this option at compiler level?

    AMD already does this, and they register file is so big that they don't have to actually put too much effort on it...
     
  18. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    We did lots of R2VB vs VTF evaluation years ago. A quick link to a post in describing DX9 support/perf of both (and more discussion of the topic in the thread): http://forum.beyond3d.com/showpost.php?p=1160683&postcount=33. Naturally things have changed since DX10 and unified shaders. VTF is now faster and more flexible approach to majority of the algorithms out there.
     
  19. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Having a 64 wide SIMD running the kernels instead of a 16 wide one doesn't impact the programming model.
     
  20. Robert Varga

    Newcomer

    Joined:
    Jan 13, 2010
    Messages:
    26
    Likes Received:
    0
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...