Stream Processing

DemoCoder · Oct 12, 2006

Hasn't every GPU since SGI's days had some form of screen space tiling to improve locality? Nvidia's patent on SST for example, date's back to 1997. Or maybe you mean something different with your definition. To me, it means that you generate fragments in an order that is most friendly to pipeline layout, cache, and memory access. E.g. on architectures of yesterday, you'd probably want N fragments to go to each pipeline, paired with a TMU, ROP, buffers, and a memory channel. N is probably a multiple of a quad.

Are you telling me that ATI is the only one who does this, and NVidia just sends fragments in haphazard order to the pipelines?

Jawed · Oct 12, 2006

Since NV40, ROPs have been decoupled from the shading pipelines, so there is no 1:1 locality in that sense. Additionally the L2 cache is shared by all texturing pipes, which implies, again, no locality.

Finally, NV40 has one shared program counter across all 16 of its shading pipelines. There is, in a sense, only one "locality" there on the scale of the entire GPU.

I've never seen any description of SST with respect to current NVidia GPUs. I'd be interested in the NVidia patent you're referring to.

Jawed

trinibwoy · Oct 12, 2006

So Jawed, you're basically saying that nearly everything we're discussing here as possibilities for G80 have already been implemented in ATI hardware? Man, Nvidia's engineers must suck !

Jawed · Oct 12, 2006

But there are different ways to skin a cat and different percentage points of utilisation/ease-of-implementation. I'm enthusiastic about the possibility of a grid of scalar ALUs (as individual stream processors), because it should lead to better utilisation than ATI can achieve, looking at shading at the channel level: xyzw or rgba; and may well provide for extremely fine-grained batching. If indeed there's any batching - I can see how it's possible to avoid batches altogether: if you make the set of instructions to be executed in a clause part of the clause's input data :smile:

Who knows what other subtleties lie in the respective architectures? After all, this will be ATI's v2.0 USA.

Jawed

INKster · Oct 12, 2006

Jawed said:
But there are different ways to skin a cat and different percentage points of utilisation/ease-of-implementation. I'm enthusiastic about the possibility of a grid of scalar ALUs (as individual stream processors), because it should lead to better utilisation than ATI can achieve, looking at shading at the channel level: xyzw or rgba; and may well provide for extremely fine-grained batching. If indeed there's any batching - I can see how it's possible to avoid batches altogether: if you make the set of instructions to be executed in a clause part of the clause's input data :smile:

Who knows what other subtleties lie in the respective architectures? After all, this will be ATI's v2.0 USA.

Jawed

Make that AMD's v1.0 USA...

DemoCoder · Oct 12, 2006

Jawed said:
Since NV40, ROPs have been decoupled from the shading pipelines, so there is no 1:1 locality in that sense. Additionally the L2 cache is shared by all texturing pipes, which implies, again, no locality.

This doesn't imply "no locality" to me. Video RAM is shared too, that doesn't imply "no locality". Locality to me implies optimal access patterns, scheduling reads and writes to avoid contention, and minimum roundtrip bus traffic over small transactions. As in "locality of reference" or "locality of data structure" or "coherent". You seem to be using locality to mean *physical chip layout* locality or "shared none = local/fully distributed/NUMA-like". I don't neccessarily think that it is a useful definition that encompasses the real meat and potatoes which is ensuring access patterns are local instead of random.Sure, it does affect chip design, but that's like saying if you build a compute cluster, having the computers "local" to one another on the same gigabit subnet switch would be better than having them in separate rooms. It is an important consideration, but even more important would be coding the algorithm on the compute cluster to deal with distributed tasks and shared global resources. You can have the ideal physical layout, but still fumble by having access patterns that are too random. I posted a paper on massively distributed N-body problems in the Console forum to show how vitally important the locality of access is (as in, memory order)

For me, it is as an algorithm which has locally predictable access patterns. For example, generating a series of reads such that a burst prefetch will load in everything at once, or scheduling writes in such a way that they can be combined into a single write, or the same memory page. One example is scan converting using a space filling fractal curve that is a series of quads of quads of quads of ever increasing size.

I'm not sure I'd even agree that separate local L2 caches for each "group" of ALUs/TMUs is neccessarily the best. It would depend on the workload. Personally, I'd want some kind of crossbar/ringbus so that if you request a texel from your own L2, and it's not there, it can try and get it from neighboring ones before going to main memory. Otherwise, you'd end up with needless duplication.

I've never seen any description of SST with respect to current NVidia GPUs. I'd be interested in the NVidia patent you're referring to.

Well, I don't think this specific one applies to NVidia GPUs in production SST patent, but it directly mentions screen space tiling.

Jawed · Oct 13, 2006

Dave and I were specifically discussing physical locality, because we were talking about not moving data around the GPU. It was just a minor point in fact, in terms of keeping work assigned to a processor or processor sub-pool, rather than assigning a unit of work (e.g. a clause of shader code) to any processor within the GPU. It's simply wasteful to move state/status from one register-file/reservation-station combination to another, for example...

That patent is merely using tiling to perform early Z testing. It's not using tiling to split up work into concurrent batches each of which is localised to a portion of the GPU dedicated to that tile. This SST is merely a way to enable an on-die memory to hold the entire render target Z while rasterising and then shading a batch of fragments. It requires that primitives are clipped and binned to tiles in order to perform efficient Z testing:

Because SST requires binning one frame of geometries, due to the geometry size, the binned geometries have to be stored in external memory. Both writing binned geometries into memory during binning and reading binned geometries from memory during tile rendering consumes memory bandwidth.

So, while it's a tiled method, as such, it isn't the physically-tiled locality that we're used to in R300...R5xx. In these GPUs, pixel at coordinates 256,384 (on a 1024x768 screen) is ALWAYS shaded by quad-pixel pipeline 3, say (out of the set of 4 quad pipelines).

The patent appears to be describing a variety of tile-based deferred renderer.

Jawed

DemoCoder · Oct 13, 2006

I think what I'd call what you're describing "distributed scheduling" and not locality in the traditional sense of the word, atleast with respect to memory hierarchy. That are basically subdividing and distributing work according to a fixed mapping function (if I read you correctly) that maps fragment at X,Y to group N of resources. Yeah, you don't need to have those databits capable of going to any of N GPU resources (e.g. whichever one is free to do more work), but that to me is an issue of routing streamed data around the chip, not neccessarily one of ensuring cache locality or memory locality, or FIFO locality.

IMHO, what's important is that the "kernel" of data being passed around is "local" and not dependent on other packets of data, not that the packets themselves are scheduled by a fixed distribution. Seems to me to be just like the arguments over other network architectures: time division/reservation vs collision-detect/queue. There are arguments pro/con to each. Sounds to me like the Rxxx argument is based on saving transistors, optimizing chip layout, and avoiding more complex data routing.

I don't neccessarily think that dividing the screen up into W x W chunks, and then mapping each chunk to a specific resource based on its coordinates vs putting the chunks into multiple work queues and letting chip resources dequeue work as needed, will ultimately determine performance. It's just a choice of scheduling algorithm, both both master-worker and deterministic scheduling have their tradeoffs.

I am speaking from ignorance of the details of the mapping, but it seems to me that any screen space mapping would also have pathological cases where (depending on the "pattern" they used in the map), you could get uneven distribution of work. Of course, one would try to design it so that hopefully the statistical majority of cases ends up with uniform distribution (e.g. a hash function with avalanche criterion for the coordinates),.

I just think that the "tile" terminology is confusing. We have two existing "tile" nomenclatures out there, both of which refer to rasterization order (tile deferred renderers, and tiled scan conversion), it's confusing to start talking about "tiled locality on physical layout of chip"

Ailuros · Oct 13, 2006

I've been told that if I'd want to talk to somebody that is more pro-DR at NV I should search for a gentleman named Zhu

That link was posted yesterday in the IRC channel considering stream processing (no idea if it's directly relevant to the discussion here, haven't had time to read it more carefully):

http://www.ece.cmu.edu/~babak/papers/isca05.pdf

dnavas · Oct 13, 2006

I think it's very easy to be .. distracted .. by very interesting discussions regarding data locality -- particularly WRT the sort of gpu-'stream programming' which, let's face it, isn't really 'Stream Programming' as some inputs apear to be fetched on demand (eg texture-lookups) rather than streamed. ;^/ The balance between locality of chunked streams vs. locality of texture access is likely to be a very entertaining discussion, but data locality as such wasn't really what I felt was the 'new' item of interest.

http://download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch29.pdf#search="nvidia stream processor"

...future architectures will increasingly use transistors to replace the need for communication. ...instead of sending data to a distant on-chip computation resource and then sending the result back, we may simply replicate the [computational] resource and compute our result locally.

In otherwords, on-chip communication vs. ALU count is becoming a tradeoff. It may well be cheaper (area/transistor/performance-wise) to replicate a computational resource than to create and manage an effective route between computational units on a processor. The choice is "do I use die area to support routes from my 16 existing execution spots to my specialized frobber, or, do I replicate my frobber 16 times?'.

Sounds to me like the Rxxx argument is based on saving transistors, optimizing chip layout, and avoiding more complex data routing.

Well, yes. And the results of that style of optimal layout might well produce chips that are relatively light on memory controller and unbelievably overweighted with ALUs. Many hundreds....

Jawed · Oct 13, 2006

DemoCoder said:
I think what I'd call what you're describing "distributed scheduling" and not locality in the traditional sense of the word, atleast with respect to memory hierarchy.

But the point is that this isn't just about memory, it's about three asynchronous pipelines that feed off memory (or on-die memories): ALU, TMU and ROP (four if you include early-Z). These pipelines are asynchronous since R300 with independent queues etc.

That are basically subdividing and distributing work according to a fixed mapping function (if I read you correctly) that maps fragment at X,Y to group N of resources. Yeah, you don't need to have those databits capable of going to any of N GPU resources (e.g. whichever one is free to do more work), but that to me is an issue of routing streamed data around the chip, not neccessarily one of ensuring cache locality or memory locality, or FIFO locality.

Bingo: Dave and I were discussing why you don't want to route data around the chip.

IMHO, what's important is that the "kernel" of data being passed around is "local" and not dependent on other packets of data, not that the packets themselves are scheduled by a fixed distribution. Seems to me to be just like the arguments over other network architectures: time division/reservation vs collision-detect/queue. There are arguments pro/con to each. Sounds to me like the Rxxx argument is based on saving transistors, optimizing chip layout, and avoiding more complex data routing.

Well, without incredibly advanced simulators we're ultimately in the dark about this kind of architecture versus one that schedules all fragments/pixels uniformly.

These GPUs appear to have a single-tier cache architecture: L1 solely at the fragment level. And I guess the colour buffer and z/stencil buffer caches are also simpler to implement.

I don't neccessarily think that dividing the screen up into W x W chunks, and then mapping each chunk to a specific resource based on its coordinates vs putting the chunks into multiple work queues and letting chip resources dequeue work as needed, will ultimately determine performance. It's just a choice of scheduling algorithm, both both master-worker and deterministic scheduling have their tradeoffs.

Maybe the relevant patents provide some metrics/motivations that are convincing...

All we can say is that as the flexibility (programmability) of shader pipelines increases, along with the sheer count of them, the desire to achieve efficient finely-grained scheduling increases. Some kind of distributed scheduling becomes more and more important. R300 etc. packetise by screen-space. G80 may well packetise by batch-ID. Who knows?...

I am speaking from ignorance of the details of the mapping, but it seems to me that any screen space mapping would also have pathological cases where (depending on the "pattern" they used in the map), you could get uneven distribution of work. Of course, one would try to design it so that hopefully the statistical majority of cases ends up with uniform distribution (e.g. a hash function with avalanche criterion for the coordinates),.

The tiles are small. I've never seen dimensions stated, but it would appear to be based upon nominally 16x16-pixel tiles - this size then determines the batch size. Since R420, the tiles can vary in size. Bigger tiles increase cache coherency. Smaller tiles reduce the number of "null" pixels that end up being uselessly shaded, when a triangle doesn't entirely cover the tile (obviously that happens quite a lot but falls-off as screen resolution is increased). I guess that R300 etc. can only shade one triangle per tile at any one time. So a four "quad pipeline" GPU such as R420 or R580 can pixel-shade up to four triangles simultaneously. Plainly, one triangle can cover hundreds of tiles.

I just think that the "tile" terminology is confusing. We have two existing "tile" nomenclatures out there, both of which refer to rasterization order (tile deferred renderers, and tiled scan conversion), it's confusing to start talking about "tiled locality on physical layout of chip"

There are plenty of other tilings in computer graphics, e.g. textures are tiled across memory to maximise bandwidth utilisation.

You can even get non-rectangular tilings, such as this hexagonal render-target/texture tiling:

http://www.graphicshardware.org/previous/www_2005/presentations/bando-hexagonal-gh05.pdf

The fact is that the distributed scheduling in R300 etc. is based upon screen-space tiling. It results in a physical locality of fragment/pixel processing which affects not just one type of memory access, but the entire workload post-rasterisation.

Jawed

Xmas · Oct 13, 2006

Jawed said:
The tiles are small. I've never seen dimensions stated, but it would appear to be based upon nominally 16x16-pixel tiles - this size then determines the batch size. Since R420, the tiles can vary in size. Bigger tiles increase cache coherency. Smaller tiles reduce the number of "null" pixels that end up being uselessly shaded, when a triangle doesn't entirely cover the tile (obviously that happens quite a lot but falls-off as screen resolution is increased).

As far as I know the work units are still quads, not tiles. So larger tiles should not result in more pixels being "uselessly shaded".

Jawed · Oct 13, 2006

Xmas said:
As far as I know the work units are still quads, not tiles. So larger tiles should not result in more pixels being "uselessly shaded".

I'm working on the basis that the tile size (e.g. 256 fragments: 64 quads) determines the batch size. i.e. 64 clocks.

But, I am guessing. It would be interesting to see an analysis of the number of triangles that are smaller than a 16x16 square on screen in a typical game at 1600x1200.

EDIT: I should put a caveat on this, since R5xx obviously breaks-up screen-space tiles into smaller batches that are issued, e.g. batches of 16 or 48, from a tile of "256". There's definitely a funny problem with batches of 48, since it doesn't divide cleanly into 256.

Jawed

trinibwoy · Oct 13, 2006

Ailuros said:
I've been told that if I'd want to talk to somebody that is more pro-DR at NV I should search for a gentleman named Zhu

That link was posted yesterday in the IRC channel considering stream processing (no idea if it's directly relevant to the discussion here, haven't had time to read it more carefully):

http://www.ece.cmu.edu/~babak/papers/isca05.pdf

Interesting paper but it seems to be related to expending a lot of effort in reducing latency. Not sure if that's something GPU makers would be terribly concerned with.

RoOoBo · Oct 14, 2006

Jawed said:
I'm working on the basis that the tile size (e.g. 256 fragments: 64 quads) determines the batch size. i.e. 64 clocks.

But, I am guessing. It would be interesting to see an analysis of the number of triangles that are smaller than a 16x16 square on screen in a typical game at 1600x1200.

EDIT: I should put a caveat on this, since R5xx obviously breaks-up screen-space tiles into smaller batches that are issued, e.g. batches of 16 or 48, from a tile of "256". There's definitely a funny problem with batches of 48, since it doesn't divide cleanly into 256.

Jawed

You can mix the quads from different triangles as long as they are from the same draw call (same state and shaders). The only requeriment is that attributes must be interpolated before shading (what ATTILA does) or that each quad carries an index to an entry of a triangle buffer (storing the parameters required to compute the attribute) and which is accessed by an on-demand interpolator executing an 'interpolate' shader instruction.

The 16x16 screen tiles are just a way to distribute the work so that cache access locality is improved ... at least in the simulator. Of course in the simulator it's also true that we generate fragments on a 8x8 tile basis but after this first generation step we break the tile on quads and clean from any quad outside the triangle or the viewport (no triangle clipping implemented) or that completly fails the test at Hierarchical Z. I don't know if that resembles a real GPU or not.

Stream Processing

DemoCoder

Jawed

trinibwoy

Meh

Jawed

INKster

DemoCoder

Jawed

DemoCoder

Ailuros

Epsilon plus three

dnavas

Jawed

Xmas

Porous

Jawed

trinibwoy

Meh

RoOoBo

Similar threads