ATi presentation on future GPU design

Also, for long shaders, can't you break the shader into several smaller subshaders, and operate in data parallel fashion on the sub-shaders?
The execution of each subshader would transform the input stream element to an output stream element of data-dependent type. Run one sub-shader at a time until the number of output stream elements has reached a size which allows for massively parallel execution of the next sub-shader.
 
arjan de lumens said:
A poor fit for software rendering, perhaps, but having a hardware renderer fork off a hundred threads in order to process a hundred vertices or a hundred pixels doesn't sounds that hard to me,...
But you are not taking storage into account. Having hundreds of vertices would require a similar increase in the register costs etc. If we just consider the input "V" registers, off the top of my head, there are 16 per vertex, each 128bits wide. A few hundred vertices (eg 300) would require 75kB of V-reg storage alone. If you include temps etc it'd be a huge lump of silicon.
 
aaronspink said:
MfA said:
What's in a name ...

NVIDIA has said this about their vertex shader :

The floating-point core is a multi-threaded vector processor operating on quad-float data. Vertex data is read from the input buffers and transformed into the output buffers (OB). The latency of the vector and special function units are equal and multiple vertex threads are used to hide this latency.

Which is why in a lot of cases CPUs can out perform a vertex shader...
I'm not necessarily disputing that it may occur in practice, I just don't understand how you get to that conclusion from MfA's statement.
 
Simon, please use pixelshaders as an example instead ... to me it seems overcoming branching the same way as memory latency is overcome should not add that much to the storage requirements. Takes much longer to fetch a texel than it does to evaluate a conditional and start fetching instructions from the branch target (assuming the program is in cache).

Of course Im assuming modern hardware can execute pixel-shaders with a reasonable amount of temps before dependent texture-reads at decent speeds :) I have no clue if that is true.
 
MfA said:
Simon, please use pixelshaders as an example instead ... to me it seems overcoming branching the same way as memory latency is overcome should not add that much to the storage requirements. Takes much longer to fetch a texel than it does to evaluate a conditional and start fetching instructions from the branch target (assuming the program is in cache).
Sorry about that, but (as Jodi 'kindly' pointed out :rolleyes:) I'm a VGP (vertex) bloke, not pixel shader. Having said this, the issues are surely going to be similar. If you want to increase the number of "threads" (i.e verts/pixels) in flight, then you have to have more storage for when those threads are suspended.
 
Simon, yes but the point is that for pixelshaders (and in the future for vertex shaders too) some storage will already be there ... if you hide the delay to branch target fetch the same way as you hide texture-read latency then it would be for the developer to decide wether a given shader would increase the pressure on storage too much (but as I indicated before, you would think the effect of branching would be on a smaller scale).
 
aaronspink said:
Pixel shaders DO NOT have more than one context. They are NOTHING like SMT at either the architecture or micro-architecture level.
That's entirely implementation specific.
GPUs have aspects of vector and stream processors, and are multithreaded. These naming schemes are not mutually exclusive (you can do stream processing with or without multithreading, Stanford does without for instance).
The Pixel shaders are NOT multithreaded. Their performance characteristics prove this beyond a shadow of a doubt. The pixel shaders in all modern GPUs resemble either a Vector processor with SIMD operations or a Stream processor with SIMD operations.
What you see as a "stream processor with SIMD operations" may be far more complex. You can't know the internal workings just based on how it performs.
While some people like to do a lot of grouping of VLIW/SMT/CMT/Vector/Stream into one category, it is incorrect. The performance and design characteristics between a Stream processor and a SMT processor are worlds apart, more apparently in their ability to change flow.

And while Stream processors could perhaps be multi-threaded, they would lose a large amount of their efficiency and performance. The best way to think of a Stream processor is as an application specific FPGA datapath: shove data in and it comes out with complex operations performed on it. Just don't try to be dynamic in what those complex operations are and the performance will be good.
You're wrong. Simple example. Take the following shader:
Code:
mad r0, c0, t0
tex r1, r0, s0
mov o1, r1
So when pixel n comes along and hits that tex read, what does the HW do? Wait potentially hundreds of cycles while the texture unit does the look up? No, the shader can work on pixel n+1. When pixel n+1 hits the texture read, the HW can either go back to pixel n (if the data is ready) or start working on pixel n+2... etc. This is very much like a multithreaded environment where each thread is a pixel. Far from losing efficiency and performance, you gain them.

Again, implementations vary, but I see evidence that other vendors are doing this as well. The question is if you can juggle enough threads to hide all of the latencies in different parts of the chip. Obviously, not all designs are created equal in this regard.
 
Simon F said:
arjan de lumens said:
A poor fit for software rendering, perhaps, but having a hardware renderer fork off a hundred threads in order to process a hundred vertices or a hundred pixels doesn't sounds that hard to me,...
But you are not taking storage into account. Having hundreds of vertices would require a similar increase in the register costs etc. If we just consider the input "V" registers, off the top of my head, there are 16 per vertex, each 128bits wide. A few hundred vertices (eg 300) would require 75kB of V-reg storage alone. If you include temps etc it'd be a huge lump of silicon.
You only need enough active 'threads' to mask texture fetch + instruction fetch latency (assuming you need both texturing AND branching without pipeline bubbles), which apparently amounts to about 150-180 at 500 MHz - 300 is excessive and would roughly match about 1GHz GPU operation speed. And 75kB is still 'only' about 4 million transistors if implemented in SRAM - you can afford many of those on the ~200million transistor budgets of next-gen designs. Or you can do like Nvidia did in NV30, and trade off the number of per-thread registers against number of threads you can keep running (170 pipeline steps, 170 threads, 340 registers -> once you exceed 2 working registers per thread, the number of threads the hardware can handle is suddenly halved. And halved again if you need >4 registers, and so on).
 
OpenGL guy said:
aaronspink said:
Pixel shaders DO NOT have more than one context. They are NOTHING like SMT at either the architecture or micro-architecture level.
That's entirely implementation specific.

Oh, I agree that eventually as the pixel pipeline becomes more programable that may change but I am unaware of any current GPU that has more than one control flow context per pipe. Perhaps you'll educate me as it is fairly hard to get concrete info since ATI/NV/etc seem to have an absolute aversion to technical papers on designs.

What you see as a "stream processor with SIMD operations" may be far more complex. You can't know the internal workings just based on how it performs.

I agree that figuring out all the internal workings of a given design from how it performs can be difficult, but you can make educated guesses based on programming models and performance as a combination.



You're wrong. Simple example. Take the following shader:
Code:
mad r0, c0, t0
tex r1, r0, s0
mov o1, r1
So when pixel n comes along and hits that tex read, what does the HW do? Wait potentially hundreds of cycles while the texture unit does the look up? No, the shader can work on pixel n+1. When pixel n+1 hits the texture read, the HW can either go back to pixel n (if the data is ready) or start working on pixel n+2... etc. This is very much like a multithreaded environment where each thread is a pixel. Far from losing efficiency and performance, you gain them.

What you do is what I assume is currently done which is effectively creating additional virtual pipeline stages via some method of delay buffering. This is a far different thing than multi-threading.

The main differentiator is that a multi-threaded architecture has independant control flow for each given stream within the architecture. This allows a far greater amount of programming flexibility while at the same time requiring more hardware resources to implement. From the great aversion to state changes and branches in both current and future GPU designs from both of the market leaders, I can only assume that they don't really support independant flow control and instead act more like stream processors with conditional execution and reconfiguration for different instruction streams/workloads. In general it appears that the pixel pipelines are even more limited in their designs, working on a quad as the minimum unit of computation instead of an individual pixel.

I'm not arguing that there aren't simularities between how GPU/Stream processors opperate and how a multi-threaded architecture opperates, I am merely arguing that there are significant differences and that the appropriation of the terms thread/multi-threading is incorrect.

Again, implementations vary, but I see evidence that other vendors are doing this as well. The question is if you can juggle enough threads to hide all of the latencies in different parts of the chip.

I would assume that with current workloads, you should be able to juggle enough pixels to keep the pipeline at near peak utilization. But I don't believe this will be the case in future workloads that use significant flow control. Already there are significant programming model limitations put into place to try to prevent the hardware from becoming too complex and devolve in performance.

A lot of the reason that pixel engines perform so well at what they do is the programming model is kept deliberately simple (full pre-allocation of resources, limitations on flow control changes, etc), allowing a lot of parallel resources to be used.

Aaron Spink
speaking for myself inc.
 
if there is no stack (means no function calls) in the pixelshader, then all a pixel needs is an integer to store the current instruction position it is. and then, more or less a pipelined parallel and serial architecture can handle the jobs quite well, having a "pixel queue" to wait for texture fetches, and possible branching results (while the actual determination to where to jump, and if you have to jump, isn't a big issue. and after evaluating, all you do is change the instruction pointer, and reloop. the full program is in cache anyways all the time (as at every part of the program, there is some pixel flowing).

there is NO need for threads at all. this, as said, if there is no need for some sort of stack. wich isn't there (yet:D), if i remember right the nv40 documentations.
 
There is a stack in NV40 with pusha/popa functions. The stack only holds PC register for call return (and I think loop addresses) , not data registers. Registers are still globally shared (i.e. you don't spill them to stack). It's purely to deal with nested method calls and possibly loops.

Here's what I don't get in this discussion: the insistence on arguing over the definition of "thread" and trying to exclude pixel shaders from it. To me, a thread is a sequential list of instructions to be executed. The ability to suspend execution (cooperatively, preemptively, assymetric or symmetrically) and execute a difference sequence of instructions is multithreading.

It's true that in GPUs most pipes are running the same exact sequence of instructions, while in CPUs, they mightl be running radically different program code, but never has "thread" been defined to require running arbitrary code.

In fact, it's perfectly possible in a highly multithreaded CPU app (say, a server) to have 10,000 threads, all of them executing the same I/O loop, and all of them done with synchronous round-robin scheduling. Seems like aaron is just engaging in semantic quibbling to avoid being incorrect.
 
a thread is a context, wich stores the whole state of the cpu, where you are, what you have in stack, and all that. it's a thing with quite much overhead, wich you don't need most the time. espencially not in pixelshaders
 
A pixel needs its full context too - which might be quite substantial if you are allowing pixels with different states to flow through the shader simultaneously.
 
davepermen said:
a thread is a context, wich stores the whole state of the cpu, where you are, what you have in stack, and all that. it's a thing with quite much overhead, wich you don't need most the time. espencially not in pixelshaders

Not true. There are CPUs without stacks, and there are machines that implement multithreading with stack contexts. Yes, if you've been in the industry long enough, you'll encounter them.

Secondly, pixel shaders have way more registers than the average CPU, and hence their context overhead is much larger. We're talking 32 read/write temp registers, plus lots of other state. (texture coordinate iterators, etc)

Third, that's a *thread on a CPU* There is no definition of thread laid down in stone. Google returns like 20 differing definitions alone.
 
of course. still, people in here normally refer to cpu-style threads. context-saving threads have definitely nothing to search on the gpu.
 
That kinda sounds like a haiku, makes less sense than most though.

Saving contexts is exactly what a pixel shader does when it puts aside a quad when a texture fetch occurs.
 
Doesn't the R350 and higher already have a buffering mechanism for full pixel state to implement the F-Buffer? How many different sets of state can this mechanism store? Wouldn't this qualify for the full CPU definition of multi-threading even if the granularity is somehow fixed to the instruction limitation of the R3xx series?
 
If I understand this thread correctly (which is a big if) would Mr. Spink being trying say that one method of multi threading is more akin to time slicing?
 
Back
Top