Stream Processing

dnavas · Oct 11, 2006

Is switching contexts for lots of ALUs that easy to support, though? Seems to me that the order of complexity can increase steeply when you have a lot of ALUs contending on shared internal caches. The more ALUs, the more contention, the deeper the thread pool has to be to hide the latency. That's >O

....

At the same time, I don't think the scheduling of kernels is as complex as you make out, as it isn't registers that need to be tracked, but rather, whether all data has been written to a stream's input arguments -- whether that data is written by texture fetching or by another stream's output write. Stream run-ability is then just number of required inputs vs. number of inputs written.

That said, I do expect that the best way to run some shaders will be to create kernels that can be swapped out, I just expect fewer of them....

Jawed · Oct 11, 2006

DemoCoder said:
But isn't this the case already? Latency inducing reads on current GPUs are dealt which by a specialized unit outside the ALU (processor = ALU in this terminology), and latency is hidden simply by switching contexts.

R580 does. NVidia's current GPUs don't. Actually, NVidia futzes this distinction because in NVidia's view, the single long ALU/TEX pipeline inside G71's quads, say, is actually executing hundreds of threads, one after the other (each quad of pixels in a quad-pipeline counts as a thread in NVidia's view). ATI and NVidia don't see eye to eye on this terminology.

You seem to be suggesting taking a shader and splitting it along TEX boundaries. TEX instructions turn into "metadata" attached to the piece of code following it. Each piece won't have any texture loads. Each piece then marks what input data it requires to be run (fragment input, texture, registers from previous code clause). The GPU then picks fragments and feeds them to the "stream processors" only when data is available, initiating loads for fragments whose metadata requirements are satisfiable.

Yeah. R580 and Xenos already do this.

My problem with this "split up the shader and only run segments when the data is there" approach is that it reminds me of OOOE instead of threading. With threading, you run until you are about to block for I/O, and then yield to someone else to do work. It dynamically adjusts and is very simple to implement.

In G71, say, the threading model is driven by the size of the register file. If there's space for 4 FP32s for 880 fragments in a batch, then there's alternatively space for 2 FP32s for each of 1760 fragments in a batch. Or 8 FP32s at 440 fragments per batch. The quad-pipeline is 220 clocks long (hence 880 is the default).

If the shader contains some seriously awkward texture fetches/filtering then you'll lose performance as the latency can only be partly hidden. No easy way to predict that as it depends on cache-thrashing and bandwidth. Latency hiding depends on there being enough pixels in the batch (count of clock cycles from one instruction to the next for a given pixel).

So the 440-fragment batch will run at roughly half-performance on bilinearly filtered textures because half the clock cycles are bubbles. If the shader has no texture fetches, then there's no effect on performance.

On the other hand, splitting a shader into pure-functional I/O-less chunks and scheduling data loads and packets seems to require alot more logic, because of the potential for out of order. You can't run chunk N+1 if it depends on registers in chunk N, and registers in chunk N were dependent on calculations from a texture load. Etc etc. We then have to do a lot of bookkeeping.

This is what Xenos and R5xx do. Even the cheap $50 X1300HM does this.

Not only do you fill your ALU pipeline (no bubbles) but you can minimise the bubbles in the (decoupled) TMU pipeline.

G71's TMU pipeline, not being decoupled from the ALU pipeline, doesn't offer any flexibilty in latency-hiding. So the batch size remains high to hide the typical worst case latency. R5xx doesn't depend on a single batch executing for enough cycles to hide latency, it swaps batches repeatedly. Which is how it can get away with batch sizes of 16 or 48 - and then you get into the whole argument about efficient dynamic branching requiring small batches.

The problem that D3D10 introduces is that textures are not the only source of latency. The best example of this is "constant buffers". SM4 allows devs to create fantastically complex structures as constants. You could easily have one structure amounting to 10s of KB. And SM4 supports an effectively unlimited count of these constant structures. The point being that constants are simply too large to keep entirely on die (in much the same way as shader programs can't, if they're very long). So the GPU now has to implement some kind of latency hiding when referring to apparently innocuous constants - all constants live in video RAM, by default.

Clearly there'll be some kind of caching for constants. I think it's fair to say that a small population of constants will prolly fit entirely in cache.

Jawed

DemoCoder · Oct 11, 2006

I don't think all ALUs content on the same global context caches, you don't really need that. Just partition the ALUs so that at most N ALUs are severved by 1 context storage, then you know the worst case you have to design for.

Atleast as far as CPUs are concerned, covering latency by threading (e.g. Niagara) has been shown to produce substantially simpler implementation than ordering execution by dependency (OOOE CPUs) I am skeptical that bookkeeping stream state plus ordering execution by whether the bookkeeping is satisfied will be simpler than just swapping context when you do I/O. Sure, you don't have exactly the same issue of needing to deal with writes out of order like with a CPU where you can readback what you wrote, but it certainly seems to me that I could have two chunks of code that both output to the stream, neither of which depends on anything other than a texture load. What if the later texture load finishes first? Then I'd end up writing to the stream in the wrong order.

trinibwoy · Oct 11, 2006

Jawed said:
Yeah. R580 and Xenos already do this.

Jawed, do you have a source for this? Where was it indicated that R580 simply doesn't swap out a thread when it needs to block for I/O?

DemoCoder · Oct 11, 2006

So Jawed, are you claiming that the R580 is a streaming processor then? See, to me, what the R580 does is threading. What you seemed to have been describing to me was a new model in which case shaders are broken up into pieces, and those "pure" pieces (which do no I/O) can be executed in arbitrary order as long as their inputs are satisifed. This would not be threading, it would be out of order execution, only instead of having an "instruction queue" you have a "shader chunk queue", so the granularity is higher than an OOOe processor, and you reorder I/O-less chunks as a whole instead of reordering individual instructions. This presumably would work because loads, especially dependent loads, are much less frequent than in CPU code, and writes can't modify the inputs during execution (e.g. update a texture inplace)

If you're now just saying that what you're describing is Xenos/R580, then I don't really see the need for a new terminology. Even ATI calls it an ultrathreaded dispatch processor.

Jawed · Oct 11, 2006

DemoCoder said:
Sure, you don't have exactly the same issue of needing to deal with writes out of order like with a CPU where you can readback what you wrote, but it certainly seems to me that I could have two chunks of code that both output to the stream, neither of which depends on anything other than a texture load. What if the later texture load finishes first? Then I'd end up writing to the stream in the wrong order.

This is certainly a problem. I don't know how ATI's architectures handle this.

NVidia has a thread-control buffer, apparently incoming on G80:

http://www.beyond3d.com/forum/showpost.php?p=847089&postcount=1081

I suspect the TCB is quad-coherent, not pixel-coherent.

Jawed

Geo · Oct 11, 2006

So, DemoCoder, you're not convinced by my explanation? They are just trying to get away from talking about vs, gs, and ps hardware units in a unified world by giving them a general name of "stream processors" that covers all three?

So far as I can tell, "stream processing" just means SIMD parallel: http://en.wikipedia.org/wiki/Stream_processor

Razor1 · Oct 11, 2006

can't say too much but I think this will give a better understanding on what direction the g80 is going

http://download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch29.pdf#search="nvidia stream processor"

Jawed · Oct 11, 2006

trinibwoy said:
Jawed, do you have a source for this? Where was it indicated that R580 simply doesn't swap out a thread when it needs to block for I/O?

It has reservation stations that track completion status for ALU and TEX instructions (since each type has its own pipeline). You have to dig through the patent... I'm sure I'll have posted quite a bit about this stuff in discussions of the patent... (yeah, but all of it seems to be before we knew how Xenos and R5xx worked...)

This is useful:

http://www.beyond3d.com/forum/showpost.php?p=776206&postcount=31

SirEric said:
Oh yes, as for the latency hiding, it's really the number of threads * the clause size which determines the latency hiding for a given group of threads. As clauses are larger (i.e. more instructions can be executed before we need to go fetch data), the number of threads required to hide all the texture fetch latency goes down.

Worth reading the entire post.

Xenos has a unit called Sequencer that manages the execution of clauses. It normally runs on auto-pilot but the programmer can set their own clause boundaries, e.g. to submit a queue of texture fetches as a single clause and halt the ALU code execution until all the TEX fetches complete. You can get seriously fancy with this, but of course getting fancy consumes registers.

Jawed

dnavas · Oct 11, 2006

Fantastic -- I had found the summary at the top of this doc in my searches on Sunday, but somehow managed to miss the details.....

This works slightly differently than I had imagined (the batches/streams are MUCH larger....)

Kernel outputs are functions only of their kernel inputs, and within a kernel, computations
on one stream element are never dependent on computations on another element.

OOOE? Not a problem....

If a high-latency memory reference is required in processing any given element, the
processing unit can simply work on other elements while the data is being fetched.

Not all inputs are required prior to execution....

Less clear is whether all ALUs or even all stream processors are created equal in this world. I'll have to re-read at home.

Jawed · Oct 11, 2006

DemoCoder said:
So Jawed, are you claiming that the R580 is a streaming processor then?

Well as a whole GPU it does random reads and can do random writes (gather and scatter in Brook GPGPU streaming terminology). I dunno, how pure is your definition of a streaming processor? I think as a whole it isn't, really. But the ALU pipelines clearly are, as they only have one data source, the register file and they can only write to the register file or "new" memory locations ("pixels" in a render target, or random memory locations as per "memexport"). Writes to memory as pixels only occur on shader completion. I'm unclear on the timing of "memexport" and how soon such locations can be read. Have to wait for ATI's close to the metal stuff to become public.

See, to me, what the R580 does is threading. What you seemed to have been describing to me was a new model in which case shaders are broken up into pieces, and those "pure" pieces (which do no I/O) can be executed in arbitrary order as long as their inputs are satisifed. This would not be threading, it would be out of order execution, only instead of having an "instruction queue" you have a "shader chunk queue", so the granularity is higher than an OOOe processor, and you reorder I/O-less chunks as a whole instead of reordering individual instructions. This presumably would work because loads, especially dependent loads, are much less frequent than in CPU code, and writes can't modify the inputs during execution (e.g. update a texture inplace)

I presume you can see from Eric's post that I've part-quoted above that this is how it works.

Jawed

Jawed · Oct 11, 2006

geo said:
So, DemoCoder, you're not convinced by my explanation? They are just trying to get away from talking about vs, gs, and ps hardware units in a unified world by giving them a general name of "stream processors" that covers all three?

R5xx's PS ALU pipelines are an example of a "stream processor" in the sense I've been describing, but there's no unified shader architecture in sight

Jawed

Geo · Oct 11, 2006

Jawed said:
R5xx's PS ALU pipelines are an example of a "stream processor" in the sense I've been describing, but there's no unified shader architecture in sight

Nor any need to use "stream processor" since the perfectly fine "ps" already existed. But in a unified world, a ps is only a piece of software and not also a piece of hardware (other than temporarily, of course, after which it can be a vs, and then a gs, and then. . .etc).

trinibwoy · Oct 12, 2006

Jawed said:
It has reservation stations that track completion status for ALU and TEX instructions (since each type has its own pipeline). You have to dig through the patent... I'm sure I'll have posted quite a bit about this stuff in discussions of the patent... (yeah, but all of it seems to be before we knew how Xenos and R5xx worked...)

This is useful:

http://www.beyond3d.com/forum/showpost.php?p=776206&postcount=31

Wow, thanks. I can't believe I missed that entire thread.

trinibwoy · Oct 12, 2006

Razor1 said:
can't say too much but I think this will give a better understanding on what direction the g80 is going

http://download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch29.pdf#search="nvidia stream processor"

That paper effectively refers to current vertex and pixel shader hardware as "stream processors". Although I started this thread with that possibility in mind I'm not so sure that the term by itself is indicative of any innovation in G80.

Ailuros · Oct 12, 2006

trinibwoy said:
That paper effectively refers to current vertex and pixel shader hardware as "stream processors". Although I started this thread with that possibility in mind I'm not so sure that the term by itself is indicative of any innovation in G80.

29.4.1 and below.

dnavas · Oct 12, 2006

trinibwoy said:
That paper effectively refers to current vertex and pixel shader hardware as "stream processors". Although I started this thread with that possibility in mind I'm not so sure that the term by itself is indicative of any innovation in G80.

Yes, although, it appears that this may be a side-effect of an historical account purposely geared to appear as if stream computing was the obvious evolutionary end-goal of graphics card design. I think y'all have come to the conclusion that the R580 is essentially there modulo unification, while the G7 architecture has some historical baggage regarding texture processing and its incorporation into the ps alus.

Given the similarities to existing architectures (R580), this "leap" is clearly not innovative (kudos!). The term itself also appears to be somewhat ... 'stretchy' in its application. However, it does seem to embody a different way to view a GPU, at least to me.

As I noted earlier, there is a drive to more effectively manage even internal communication inside the GPU, so more localization of computing resources as opposed to squirreling away special-purpose logic somewhere off in the hinterlands. That's surely of interest if we are really toying with the notion of a hundred (multiple in future revs, surely) stream processing units. If moving data around the chip is prohibitive, does it make sense to save a few transistors in specialized ROP or texture logic? The area saved in the more efficient implementation may wind up wasted with all the data routing support necessary to get data to and from such units.

Local caching displays another set of problems. If a set of shaders requires access to two largish textures, does it make sense to have both textures replicated across multiple local caches, or does it make more sense to split the shader into pieces and run half under the auspicies of one locally-caching unit, passing resulting data to the second half in a separate locally-caching unit.

I'm not saying that G80 runs its ROPs in its stream processors (though if I had to place a bet...), and I'm not saying it aggressively pulls shaders apart into multiple kernels (were I to guess there, I expect that's still theoretical, or at least under-utilized/optimized). What I am saying is that one might wish to cease thinking that unified-shader = sum( functionality(pixel-shader) + functionality(vertex-shader)), as if a unified-shader will be running either a vertex-shader or a pixel-shader program as it has done in the past with load-balancing strictly managing different vertex vs. pixel/fragment loads. The stream processing kernel scope may not even map one-to-one with a shader (and it is balancing not just the alu use, but the external resources required by the alus). I find that possibility to be fascinating, and wish I had had that realization when I read the Dynamic Branching Granularity thread (linked earlier) way back when.... [and certainly I had failed to recall the GPR-usage vs. thread-depth relationship listed therein, which nullifies one of my concerns mentioned earlier]

So, can we go back to wondering how many ALUs (and what kind) are in those 128 stream units now?

overclocked · Oct 12, 2006

Reading through this and with the source that it comes from i think its very interesting to talk about naturally.
BUT is not *much* of the numbers PR marketing Bullshit to be honest.

LeStoffer · Oct 12, 2006

This is the part I found most interesting:

And instead of sending data to a distant on-chip computation resource and then sending the result back, we may simply replicate the resource and compute our result locally. In the trade-off between communicate and recompute/cache, we will increasingly choose the latter.

Stream processing = much less data traffic between different Units on the chip.

Jawed · Oct 12, 2006

dnavas said:
As I noted earlier, there is a drive to more effectively manage even internal communication inside the GPU, so more localization of computing resources as opposed to squirreling away special-purpose logic somewhere off in the hinterlands.

I've written extensively about screen-space tiling in ATI's R3xx, R4xx, R5xx so if you do a search on me you'll find plenty on this subject of "locality".

http://www.beyond3d.com/forum/showpost.php?p=701811&postcount=62

http://www.beyond3d.com/forum/showthread.php?t=25836

Jawed

Stream Processing

dnavas

Jawed

DemoCoder

trinibwoy

Meh

DemoCoder

Jawed

Geo

Mostly Harmless

Razor1

Jawed

dnavas

Jawed

Jawed

Geo

Mostly Harmless

trinibwoy

Meh

trinibwoy

Meh

Ailuros

Epsilon plus three

dnavas

overclocked

LeStoffer

Jawed

Similar threads