Stream Processing

trinibwoy

Meh
Legend
Supporter
Since those G80 specs tipped up the term "stream processor" has come into play but what exactly does this mean?

Nvidia makes the following distinctions between GPU types in a recent patent application:

Thus, in one embodiment, a programmable GPU may comprise a fragment processing stage that has a simple instruction set. Fragment program data types may primarily comprise fixed point input textures. Output frame buffer colors may typically comprise eight bits per color component. Likewise, a stage typically may have a limited number of data input elements and data output elements, a limited number of active textures, and a limited number of dependent textures. Furthermore, the number of registers and the number of instructions for a single program may be relatively short. The hardware may only permit certain instructions for computing texture addresses only at certain points within the program. The hardware may only permit a single color value to be written to the frame buffer for a given pass, and programs may not loop or execute conditional branching instructions. In this context, an embodiment of a GPU with this level of capability or a similar level of capability shall be referred to as a fixed point programmable GPU.

In contrast, more advanced dedicated graphics processors or dedicated graphics hardware may comprise more enhanced features. The fragment processing stage may be programmable with floating point instructions and/or registers, for example. Likewise, floating point texture frame buffer formats may be available. Fragment programs may be formed from a set of assembly language level instructions capable of executing a variety of manipulations. Such programs may be relatively long, such as on the order of hundreds of instructions or more. Texture lookups may be permitted within a fragment program, and there may, in some embodiments, be no limits on the number of texture fetches or the number of levels of texture dependencies within a program. The fragment program may have the capability to write directly to texture memory and/or a stencil buffer and may have the capability to write a floating point vector to the frame buffer, such as RGBA, for example. In this context, an embodiment of a GPU with this level of capability or a similar level of capability may be referred to as a floating point programmable GPU.

Likewise, a third embodiment or instantiation of dedicated graphics hardware shall be referred to here as a programmable streaming processor. A programmable streaming processor comprises a processor in which a data stream is applied to the processor and the processor executes similar computations or processing on the elements of the data stream. The system may execute, therefore, a program or kernel by applying it to the elements of the stream and by providing the processing results in an output stream. In this context, likewise, a programmable streaming processor which focuses primarily on processing streams of fragments comprises a programmable streaming fragment processor. In such a processor, a complete instruction set and larger data types may be provided. It is noted, however, that even in a streaming processor, loops and conditional branching are typically not capable of being executed without intervention originating external to the dedicated graphics hardware, such as from a CPU, for example. Again, an embodiment of a GPU with this level of capability or a similar level comprises a programmable streaming processor in this context.

Based on these definitions I would categorize current hardware as programmable streaming processors + looping/branching. I see kernel/stream as analogous to shader/batch. Is that wrong? Is there something about current architectures that prevents them from being classified as stream processors?
 
A modern GPU as a whole can be concidered a stream processor Trini, but nV is saying the g80 has 128 streaming processors. Pretty much that rules out 1 ALU per streaming processor, it has to have more then 1.

Added to that each ALU in a streaming processor must have same capabilities as each other.
 
Last edited by a moderator:
In this context it appears that the only differentiation is the input data format is "undefined".

You could take this to its logical extreme. For example, a shader program clause (not an entire program) is provided with a stack of data as input, it has access to a set of registers to hold temporary values while it is executing a program clause, and it either pushes its output onto a stack (or just a blob of memory, not necessarily as "registers") or writes output to a queue.

So imagine a pixel shader program that wants to take a lighting constant (bright red, say) and apply it to a two-layer texture.

Both texels need to be fetched from memory and filtered (if need be). The constant, "red", needs to be fetched from the memory of constants (a huge, irregular lump of memory, populated by thousands of constants for all sorts of tasks).

So, the thread scheduler issues requests to the appropriate units to fetch the texels and retrieve the constant. These units dump the results in one or more buffers (one buffer for texels, another for constants). The scheduler then issues the program clause for execution, which runs by taking the buffers (texels+constant) and puts the result in a queue (say the pixel is now finished and so needs to go to the ROP).

If the shader program has not finished at this point (say some more textures need to be fetched, or more constants are required) the thread scheduler will get textures/constants fetched and then issue the next clause in the program when all the data is ready.

So that's a streaming view of shader execution, in a fairly strict sense. All the source data comes directly, there is no "latency hiding" as such as the latency is hidden by pre-fetching all the required data. Plainly, the output of data is also a stream, either to a ROP queue in this case, or to a "stack" or to another buffer (e.g. post vertex transform cache - if we're talking about vertex shading, not pixel shading).

---

You could describe texel fetch/filtering and constant fetching as two streaming processes, too. Each is supplied with necessary "coordinates" and then pulls in the required data. Now most of the time is taken waiting for memory to respond. Ideally you keep these units running at 100% utilisation by keeping their input queues full.

---

GPUs already operate pretty closely to this model anyway, what with texture caches being the primary "buffer" for texture fetches (fetches from memory into cache then provide "instantaneous", latency-hidden, texels). And out-of-order threading in R580 seems extremely close to this model.

I dare say where the fun comes in for NVidia is that there's an opportunity to "gather" all the memory reads (textures, constants, vertex buffers, texture buffers) into "one place" which is then easily accessed by a clause of code. And by scheduling all instances of clauses, NVidia can build a finely-grained latency-hiding, high-utilisation GPU.

Jawed
 
A modern GPU as a whole can be concidered a stream processor Trini, but nV is saying the g80 has 128 streaming processors. Pretty much that rules out 1 ALU per streaming processor, it has to have more then 1.

Added to that each ALU in a streaming processor must have same capabilities as each other.

Razor, thanks for the feedback but I don't think that's accurate. There is no restriction on the number of ALU's per processor. And ALUs can certainly have different capabilities - see Imagine.
 
Oh yeah I see what they are saying, but again each ALU cluster has to have the same functionality. But the way I'm taking it is kinda like taking 1pipeline of a nv40 or better example would be a g70 but with independtly working ALU's (no TMU usage lock ups in any circumstances) that would be concidered a stream processor.

If each pipeline has only 1 ALU then the entire chip can be concidered a stream processor but thats what is confusing I don't think nV would call something a stream processor if it didn't have the functionality of one. Could be using it as a buzz word though.
 
Last edited by a moderator:
Jawed -- that's pretty much my thinking, too.

Given the description of branching being external, the thread scheduler might wind up in charge of that as well....

The only missing element in my mind is whether streams propagate from unit to unit (is there a real producer-consumer model here?), and whether shaders could be repurposed to take advantage of it even if it did exist. Can thread-state size be reduced if we separate memory-intensive shader clauses from non-memory intensive clauses? That sort of thing....

Also, in a high-unit stream model, caching might move closer to the processing unit in question to reduce internal bandwidth requirements. Intelligent prefetching and local caching, possibly with high bandwidth unit<-->unit comm channels.

:slobber:

-Dave [whaddya mean *only* 5-700M transistors ;^/]
 
I dare say where the fun comes in for NVidia is that there's an opportunity to "gather" all the memory reads (textures, constants, vertex buffers, texture buffers) into "one place" which is then easily accessed by a clause of code. And by scheduling all instances of clauses, NVidia can build a finely-grained latency-hiding, high-utilisation GPU.

Jawed

Thanks Jawed. I think I follow your example but you seem to be describing "streaming" as a data retrieval process. I was phrasing the question in the context of a stream of fragments/vertices/triangles etc to be processed, not necessarily the additional data required to process them.
 
Oh yeah I see what they are saying, but again each ALU cluster has to have the same functionality. But the way I'm taking it is kinda like taking 1pipeline of a nv40 or better example would be a g70 but with independtly working ALU's (no TMU usage lock ups in any circumstances) that would be concidered a stream processor.

Gotcha.

If each pipeline has only 1 ALU then the entire chip can be concidered a stream processor but thats what is confusing I don't think nV would call something a stream processor if it didn't have the functionality of one. Could be using it as a buzz word though.

Well that's kinda what I'm curious about - what is the functionality of one and how's that different to what we have now? :)
 
I'm not so sure "stream processor" in this context is much more than a vehicle to avoid talking about "ps" and "vs" and "gs" units. i.e. its unification-speak, going up one level of generality context (in other words, ps, vs, and gs are all subset examples of stream processors).
 
I'm not so sure "stream processor" in this context is much more than a vehicle to avoid talking about "ps" and "vs" and "gs" units. i.e. its unification-speak, going up one level of generality context (in other words, ps, vs, and gs are all subset examples of stream processors).

Well that's no fun. But you're probably right.
 
Thanks Jawed. I think I follow your example but you seem to be describing "streaming" as a data retrieval process. I was phrasing the question in the context of a stream of fragments/vertices/triangles etc to be processed, not necessarily the additional data required to process them.
No, the main part of streaming here is that the data is already "packetised" ready for execution in the clause.

So the clause has a very simple access pattern for all its data sources: buffers for texture and constant data, all of which have "0 latency".

If you examine shader code you will find that some "constants" (whether texels or actual constants in code) have the same value for a swathe of pixels (e.g. these triangles are lit by a red light) while other constants vary by pixel (which is the normal case for texels, but could easily be the case if you take per-triangle attributes and interpolate them across the triangle). Logically, though, all these constants (per-pixel, per triangle, per "state") actually have unchanging value for the duration of the shader program. The fiddly bit comes when some constants can only be identified after some calculations have been done, e.g. the shader might decide to evaluate which are the four closest lights and then use them for lighting. The colours of those lights are constant, but which lights is not known.

So depending on the frequency of constant variation, you organise a buffer for per-pixel constants, another for per-triangle constants and another for per-state constants. This makes it much easier to build the logic that sorts out all these data access patterns and get it to the execution pipeline. The resulting packet is just enough (no more, no less) for the clause to complete execution.

If you like, shader execution is proactively scheduled, rather than reactively scheduled. When you look at a shader you can identify every point where latency will occur (branches, texture lookups, constant or other fetches), split the code up into clauses bounded by latency and issue as many of the latency-inducing instructions as possible, ahead of time. This clausing then allows the data required by each clause to be packetised.

This doesn't solve the problem of what's the best order to perform texture fetches (because they can cause thrashing against memory if the ordering is bad) or how to efficiently combine "clauses in flight" (since they'll consume varying amounts of memory - a clause might consume 2 FP32s at the start of the shader but by the end, on clause 10, say, there might be 6 FP32s).

Obviously as clauses get very short, you end up with intensive swapping. You can minimise the amount of swapping by reducing per-thread instruction-level parallelism - i.e. you don't mind running your code through a scalar pipeline, because it means that the minimum per-clause execution time is long enough for a stall-free swap-in and then swap-out. So while G71 can execute two successive MAD and MUL instructions in a single cycle, say, G80 will happily take at least two clocks (though as many as eight).

It's hard to say what the minimum clause length will turn out to be. It's worth remembering that with batching, i.e. each instruction-issue takes multiple cycles to complete, there'll at least two objects per pipeline per instruction.

Jawed
 
Jawed -- that's pretty much my thinking, too.

Given the description of branching being external, the thread scheduler might wind up in charge of that as well....
That's the way it works in Xenos as far as I can tell.

The only missing element in my mind is whether streams propagate from unit to unit (is there a real producer-consumer model here?), and whether shaders could be repurposed to take advantage of it even if it did exist. Can thread-state size be reduced if we separate memory-intensive shader clauses from non-memory intensive clauses? That sort of thing....
It seems unlikely that there's any need to erect a physical producer-consumer pipeline. I think we're looking at nothing more than a logical packetisation. Except, of course, where there are dedicated pipelines for memory and texel fetch and ROP (which may re-use the texel pipeline) - so you might argue that the GPU as a whole functions as a working-cell network rather than a strict pipeline. Not to mention fixed function units, e.g. input assembler (though who knows to what degree NVidia has cut-out fixed-function?...).

Also, in a high-unit stream model, caching might move closer to the processing unit in question to reduce internal bandwidth requirements. Intelligent prefetching and local caching, possibly with high bandwidth unit<-->unit comm channels.
Yes this is basically what I mean. In G71, for example, there's an L2 cache shared across the quad-pipes, with L1 caches operating per-quad or per-pixel (can't remember). You can liken this organisation to the packets I'm referring to, with an L2 cache for memory access and L1 to hold a packet close to the execution pipeline. This could mean that data appears multiple times across the set of L1s (which is certainly the case in ATI's L1 texture caches). This makes it more important not to implement a physical producer-consumer pipeline across the set of shader ALU pipes.

Jawed
 
No, the main part of streaming here is that the data is already "packetised" ready for execution in the clause.

I guess what's not clear in your example is whether this packetization happens for all elements in the stream before processing starts or if it happens before the processing of each individual element.

What exactly do you consider a "stream"? Is it a single pre-fetched packet or is it an array of such packets? Maybe that's where I'm not following you.
 
Stream processing is basically non-stop execution of shader code. The execution pipeline doesn't have to organise any latency-hiding because all of the data (and only that data) it wants to access is in a place/format that provides for 0-latency access.

The packetisation avoids latency and the pipeline just takes the input data from the packet and produces output data.

You want to maximise packet size simply to avoid swapping, but on the other hand a packet that consumes a vast amount of 0-latency memory might be better broken down by splitting the clause into shorter sections of code (assuming that longer sections of code tend to use more input data and more temporary registers). That's the special sauce in making a GPU, what variables to account for in your thread scheduling model and how to weight them.

Basically you end up with chunks of work that are self-contained so that the execution pipeline is guaranteed never to stall. 16 texels and 4 constants might be needed to execute the next 12 instructions - all 12 instructions will proceed without a hitch. The degree of filtering required for those texels has no effect on the ALU pipeline.

The GPGPU guys really should be very happy with this kind of architecture.

(It's not a million miles away from Xenos, for what its worth, at the clause-scheduling level: latency from texel or constant fetches is hidden, meaning that each clause has its packet of data in-register to work from. Xenos doesn't offer support for the full range of constant buffer and texture buffer features in D3D10, so the access paths are, I guess, easier to implement.)

---

If by element you mean "pixel" or "vertex" or "primitive" then that's a slightly different question. That's more of a batching question. Implicit in my ideas is that a single pixel, for example, is actually part of a larger batch.

So when a pixel shader requests 16 texels to be fetched, there's really tens or hundreds (or more) parallel fetches - each pixel will have its own set of 16 texels, and a lot of those texels will be common to more than one pixel. So the resulting packet will be quite sizable (because of the batch size) and then it's a matter of organising access to that data across all the pipelines executing that batch.

Jawed
 
If by element you mean "pixel" or "vertex" or "primitive" then that's a slightly different question. That's more of a batching question. Implicit in my ideas is that a single pixel, for example, is actually part of a larger batch.

So when a pixel shader requests 16 texels to be fetched, there's really tens or hundreds (or more) parallel fetches - each pixel will have its own set of 16 texels, and a lot of those texels will be common to more than one pixel. So the resulting packet will be quite sizable (because of the batch size) and then it's a matter of organising access to that data across all the pipelines executing that batch.

Jawed

There we go. I understood your other point about setting up zero-latency data retrieval for streaming processing but you were kinda answering a question I wasn't asking :) Think I'm clear on the context of your comments now. Thanks. And if it helps explain my confusion - I refer to your "batch" as a "stream".
 
Last edited by a moderator:
What exactly do you consider a "stream"? Is it a single pre-fetched packet or is it an array of such packets? Maybe that's where I'm not following you.
In general, in stream programming, you'll find that there's more than one stream of input data being consumed.

A nice example (at a higher level) is geometry instancing. In geometry instancing you have a model defined (a flower) which is a stream of vertex data that defines how the vertices are connected together. Then you have an instance stream which defines how the flowers vary (colour, height, which way they're facing...). So, you run through the instance stream for all the flowers in the scene (hundreds) and loop over the model stream to actually construct the final set of hundreds of flowers.

So in that example the two streams are read in a completely different access patterns.

For the sake of this example, you could describe a packet as the entire flower model and one set of instance data (colour, height, facing). So, while the ALU pipeline is constructing flower 87, the constant pipeline is fetching the instance data for flower 88.

Jawed
 
This seems to be an arbitrary interpretation of "stream processing". My historical intuition about the abstract concept is simply that it seeks to differentiate from models of computation that require full random access and destructive in-place update. Instead, it replaces traditional processing with a semi-functional notion of an input stream of data (which cannot be randomly seeked or modified), loaded in chunks (where you may randomly address or destructively update *within* the kernel), followed by stream output (as opposed to writing back to the source location) It's kind of like a two tape turing machine, where one tape is readable, the other is writable, and neither tape head can backup or more forward more than N positions.

I don't see where latency hiding is ruled out. First, there's no reason why any stream processor has to be "pure" and stick rigidly to some arbitrary definition or model. It can violate such definitions with hybrid design. Where does the definition of stream processing = 0 latency come from? I've never seen it before.
 
Normally GPUs make random accesses to textures. That isn't usually within the remit of a stream processor (~serial in, ~serial out - based on chunks of data), as you've said yourself.

So I was interpreting the "streaming processor" (ALU pipeline) to be a unit that doesn't do random accesses from/to memory. The best way to do this is to break up the code into clauses where packets of data can be accessed "once" (at clause start-up, in effect) while the latency-inducing random reads are performed by other units within the GPU, producing packets (in on-die memory) for the stream processors to consume.

Hence the ALU pipelines, considered as a whole, have a very "pure" streaming model of execution, with no random accesses to video memory. Hence the pipeline itself doesn't have any latency-hiding stages (unless you consider reading/writing on-die memory or registers as needing latency-hiding).

It's just my interpretation of NVidia's terminology for an ALU pipeline.

Jawed
 
But isn't this the case already? Latency inducing reads on current GPUs are dealt which by a specialized unit outside the ALU (processor = ALU in this terminology), and latency is hidden simply by switching contexts.

You seem to be suggesting taking a shader and splitting it along TEX boundaries. TEX instructions turn into "metadata" attached to the piece of code following it. Each piece won't have any texture loads. Each piece then marks what input data it requires to be run (fragment input, texture, registers from previous code clause). The GPU then picks fragments and feeds them to the "stream processors" only when data is available, initiating loads for fragments whose metadata requirements are satisfiable.

My problem with this "split up the shader and only run segments when the data is there" approach is that it reminds me of OOOE instead of threading. With threading, you run until you are about to block for I/O, and then yield to someone else to do work. It dynamically adjusts and is very simple to implement.

On the other hand, splitting a shader into pure-functional I/O-less chunks and scheduling data loads and packets seems to require alot more logic, because of the potential for out of order. You can't run chunk N+1 if it depends on registers in chunk N, and registers in chunk N were dependent on calculations from a texture load. Etc etc. We then have to do a lot of bookkeeping.

Seems to me that GPUs are embarassingly threadable, so I think they will keep the same thread based latency hiding design and TEX loads will simply be a point of hybrid design within a stream processor. Random non-stream access is allowed, but only from an auxillary input.
 
Back
Top