What gives ATI's graphics architecture its advantages?

I don't know if the architecture is SIMD.

The microarchitecture is definitely SIMD.

There's a really big difference.

David

The PTX (the pseudo assembly instructions for CUDA) is scalar. NVIDIA does not expose the SIMD nature of its GPU to the PTX. The only hint of its SIMD nature is the warp vote functions.
 
True, but NV40's implementations of PS3.0 and VS3.0 were almost useless due to bad branching and bad vertex texturing.
I knew about abysmal branching perf, but how was VTF so bad, especially compared to the competing solutions at that time?
 
Unfortunately, compilers are not perfect at all when it comes to vectorization. For graphics, it can be pretty easy to vectorize, but that's a function of the workload. Not all HPC is trivial to vectorize, and a lot is downright difficult or impossible.
Fitting scalar kernel threads into VLIW is not a vectorization problem, it's a scheduling problem.
 
I knew about abysmal branching perf, but how was VTF so bad, especially compared to the competing solutions at that time?
Compared to the competition, it rocked :).
It had a lot of limitations (no filtering for instance, not many texture formats). Looks like using "normal" TMUs wasn't feasible, hence nvidia ended up implementing it with some minimalistic vertex TMU where ATI just said "screw it", apparently thinking it's not worth investing transistors in a bad solution.
 
Not if your SIMD is a complete SIMD with gather and scatter though (see LRBni for an example). CUDA does automatic vectorization for their SIMD because of this (it looks like scalar but the underlying architecture is actually a SIMD).
There are no SIMD instructions in the NVIDIA architectures. There are multiple threads in a WARP executed at once, but that doesn't change that the threads themselves are executed in a strictly scalar manner.

Neither the CUDA to PTX nor the PTX to GPU bytecode compiler need to do any vectorization.
 
Neither the CUDA to PTX nor the PTX to GPU bytecode compiler need to do any vectorization.
Neither it's needed for ATI's OpenCL to IL compiler for instance. The difference is that it *can* be done (you can use r#.xyzw or just r# to apply an instruction to all components, works also with just 2 or 3 components), but this is mostly something to make it more readable or shorter. It doesn't make a difference if you pass

add r100.x, some_register
add r101.x, another_register
add r102.x, yet_another_value
or
add r100.xyz, r25.wzx

to the IL compiler. Both are handled as three independent additions which may end in the same VLIW bundle or not. The slot the hardware uses for an instruction is also not determined by the IL code (only the case for double precision adds, but that will likely be fixed in the future). The Brook+ or OpenCL to IL compiler don't do any vectorization or other fancy stuff. They simply translate the code in the most simple way to IL (looks terrible most of the time). The optimization is done by the IL compiler, but as already said by MfA, this optimization is about scheduling, finding ILP and packing the VLIW bundles, not about vectorization.
If a programmer chooses instructions using 2, 3 or 4 components, that only means he structures the code to be shorter, better readable, and to have more parallelism readily exposed, it doesn't magically lead to a higher utilization compared to the same algorithm written with just more scalar instructions.
 
Last edited by a moderator:
My point was, that a VLIW bundle is much easier to fill for a compiler than to vectorize code (as you said). It doesn't matter at all where this happens in the case of the ATI driver - but it needs to happen.

So I think we have nothing to argue about :)
 
So I think we have nothing to argue about :)

VLIW vs. SIMD is irrelevant.

AFAIK, ATI has a SIMD array of VLIWs. That's right, it's like SIMD, except each 'lane' of the SIMD is a VLIW.

This is the reverse of what is usually done in CPUs (i.e. you might have a VLIW where one of the functional units is a SIMD unit).

I doubt that you can pack dependent operations in the same bundle. Perhaps the compiler can do something interesting to enable scheduling tricks...but by definition if two operations form a dependency chain...one must execute before the other.

Anyway, the real point is that you actually need to have multiple instructions that can be packed in the same bundle for the VLIW or your utilization suffers. If you only have one independent instruction in each bundle...you are limited to 25% of peak performance.

However to fill a bundle, you need to do all the normal scheduling wizardry required for a VLIW. That's not too bad for graphics, but can be difficult for other workloads. NV does not require that you find any ILP - you just need to find a lot of data. That is easier by definition from finding a lot of data and ILP in the instruction stream that operates on said data.

David
 
I doubt that you can pack dependent operations in the same bundle. Perhaps the compiler can do something interesting to enable scheduling tricks...but by definition if two operations form a dependency chain...one must execute before the other.
It's so hard to read the manual?

However to fill a bundle, you need to do all the normal scheduling wizardry required for a VLIW. That's not too bad for graphics, but can be difficult for other workloads. NV does not require that you find any ILP - you just need to find a lot of data. That is easier by definition from finding a lot of data and ILP in the instruction stream that operates on said data.
So I have a tip for you, do 2 or 4 work-itens per lane and bundles will be full, it's not black magic, it's easy enough for the compiler to do it alone but since it still doesn't we, programmers, still have to do it by hand on "low ILP" code...
 
Because you have several gates in series for a single clock. The max clock you can get is is also a (linear) function of how long these chains are (though I'm not really up to date here, I've no idea what the length actually is for these modern chips).
Generally, you can make them shorter by using more gates in parallel, but unfortunately that will increase the gate and hence transistor count very significantly.
Thanks for the detailed answer ..I remember reading about the problems of long data buses on motherboards , I read that one of the reasons that FSBs and HT links have limited frequency , is that they tend to transform into an EM antenna if clocks reached a certain limit , the signal is wasted and could potentially interrupt other connections .

larger transistors, parallel logic, more pipelining, etc.
Thanks for the answer ,So I understand that this is a common practice used even in the CPU industry ? but then again , some CPUs like the Core i7 920 are overclockable from 2.6GHz to 4.0GHz , which is a very large clock jump .

I guess this has something to do with yields and Quality control , rather than a specific feature of the transistors themselves , I mean Intel must have branded these CPUs as mainstream products and clocked them modestly to deal with certain market slices , even though these products are capable of much higher clock speeds , perhaps they even failed some stiff Quality control tests that would never be manifested in real world scenarios .

Voltage and heat are notthe only limiting factors. As you say, the cooler the chip, the higher it can go. But no matter how cool the chip is, there is the limit of how fast a signal can travel: the speed of light. The maximum possible frequency of a chip is determined by the maximum time needed for the signals to propagate in one clock tick. Because the speed of light is a constant the only thing you can do is reduce the distance the signal has to travel. For this you can use pipelining and this requires extra transistors for buffering and splitting the work in stages.
Excellent clarification , thanks .
 
I doubt that you can pack dependent operations in the same bundle.
Actually, you can do it with the Evergreen series for some of the most used instructions (add, mul, madd, and some other stuff like sad).
Anyway, the real point is that you actually need to have multiple instructions that can be packed in the same bundle for the VLIW or your utilization suffers. If you only have one independent instruction in each bundle...you are limited to 25% of peak performance.
Partly true, but normally you have some ILP to extract. And as said, a 50% utilization rate is already enough to reach the theoretical peak performance of the competition. It's often not a big deal.
NV does not require that you find any ILP - you just need to find a lot of data. That is easier by definition from finding a lot of data and ILP in the instruction stream that operates on said data.
If a problem is suited to GPUs (the implicit vectorized execution in the SIMD units of GPUs) you can process almost by definition several work items in a single VLIW SIMD slot as already said by EduardoS. It is often very easy to do and just increases your effective vector length, i.e. it is the equivalent of finding a lot of data.
It requires that the programmer is aware of the problem, true. But even when not doing it, you will be hard pressed to find a problem where the disadvantage is larger (if it exists at all, for medium and high ILP workloads ATI's shader performance is higher either way) than the potential advantage ATI has when the programmer pays attention.
 
There are no SIMD instructions in the NVIDIA architectures. There are multiple threads in a WARP executed at once, but that doesn't change that the threads themselves are executed in a strictly scalar manner.

Neither the CUDA to PTX nor the PTX to GPU bytecode compiler need to do any vectorization.

It's still not scalar, it just looks like scalar. To put it simply, every threads in a warp has to be running the same instruction. That's the definition of SIMD. It's easy to do vectorization because its SIMD supports gather, scatter, and proper masked execution (just like LRBni). That's also why you can't have irreducible control flow in CUDA or OpenCL on NVIDIA's hardwares (or ATI's hardwares, for the same reason).
 
I knew about abysmal branching perf, but how was VTF so bad, especially compared to the competing solutions at that time?
Compared to the competition at launch NVidia's dynamic branching was great as well. ;) Still pretty much useless, though.

NV40 VTF was clocked at 10 or 20 million verts per second. R2VB was orders of magnitude faster.
 
It's still not scalar, it just looks like scalar. To put it simply, every threads in a warp has to be running the same instruction. That's the definition of SIMD. It's easy to do vectorization because its SIMD supports gather, scatter, and proper masked execution (just like LRBni). That's also why you can't have irreducible control flow in CUDA or OpenCL on NVIDIA's hardwares (or ATI's hardwares, for the same reason).

It is scalar vertically (no ILP) unlike amd. It is simd horizontaly (aka hardware vectorization) just like nv.

Also, packing 4 workitems into VLIW lanes is a VERY BAD idea at the compiler level. It breaks the conceptual simplicity of ocl programming model. It is also akin to doing the same thing in two different ways. You have both hw vectorization and VLIW on top of it. The VLIW is meant for extracting ILP and that's that.

A better way is to pack vec2 registers together and vec3 and scalars together into a single vec4 physical register. This should be enough to minimize the register waste, which would be a big start.
 
It is scalar vertically (no ILP) unlike amd. It is simd horizontaly (aka hardware vectorization) just like nv.

Yes, and it's a easily vectorizable SIMD. Which is my point, i.e. a proper SIMD is not harder to vectorize than VLIW.

A better way is to pack vec2 registers together and vec3 and scalars together into a single vec4 physical register. This should be enough to minimize the register waste, which would be a big start.

AMD's compiler is already doing this very well IMHO. In my n-queen kernel, a 2D vectorized version has very similar ALU packing rate compared to a 4D vectorized version.
 
Compared to the competition at launch NVidia's dynamic branching was great as well. ;) Still pretty much useless, though.

NV40 VTF was clocked at 10 or 20 million verts per second. R2VB was orders of magnitude faster.
That's interesting, I really seem to be lacking in this topic, too. :) Maybe you can tell me/us some more about it in a separate thread?
 
Also, packing 4 workitems into VLIW lanes is a VERY BAD idea at the compiler level. It breaks the conceptual simplicity of ocl programming model. It is also akin to doing the same thing in two different ways. You have both hw vectorization and VLIW on top of it. The VLIW is meant for extracting ILP and that's that.
But if there isn't ILP the compiler could do it, if the compiler doesn't than the programmer have to do it to get decent performance. Why not give this option at compiler level?

A better way is to pack vec2 registers together and vec3 and scalars together into a single vec4 physical register. This should be enough to minimize the register waste, which would be a big start.
AMD already does this, and they register file is so big that they don't have to actually put too much effort on it...
 
That's interesting, I really seem to be lacking in this topic, too. :) Maybe you can tell me/us some more about it in a separate thread?
We did lots of R2VB vs VTF evaluation years ago. A quick link to a post in describing DX9 support/perf of both (and more discussion of the topic in the thread): http://forum.beyond3d.com/showpost.php?p=1160683&postcount=33. Naturally things have changed since DX10 and unified shaders. VTF is now faster and more flexible approach to majority of the algorithms out there.
 
Also, packing 4 workitems into VLIW lanes is a VERY BAD idea at the compiler level. It breaks the conceptual simplicity of ocl programming model.
Having a 64 wide SIMD running the kernels instead of a 16 wide one doesn't impact the programming model.
 
Raw power

That's raw power:

computemarkBench.jpg


Download: http://www.friendsea.com/ComputeMarkhttp://www.czechgamer.com/download/3617/ComputeMark-v12-DX11-ComputeShader-benchmark.html
 
Back
Top