Larrabee at GDC 09

I thought Jawed was still referring to the rasterization process when he said "shading". Since the 80 pixels in figure 29 obviously won't all be sent on to the pixel shading stage regardless of the rasterization algorithm employed. There should only be 7 quads or 2 qquads sent to the pixel shading stage for this triangle.
I guess Jawed needs to clarify what he was referring to. On the other hand we if talk about the efficiency of hierarchical SIMD rasterization you have to keep in mind that if you reduce your SIMD width you also need more steps (which can be mapped to more branches or more computations per hierchical level) to get down to a pixel or subpixel level.
 
that triangle covers 5 tiles, 15 pixels and 7 quads. It's shading 80 pixels to produce 15 results, 19% utilisation. If the quads were packed (i.e. a bit of conditional routing) then it'd be shading 32 pixels for 15 results, 47% utilisation.
This is only the method of determining which pixels are inside the triangle that's being addressed. One of hte pseudo code bits indicates that once it's been determined that a pixel lies inside the triangle, that job is added to a queue somewhere. I can fairly easily imagine a setup where they identify pixels (or quads or something) that are using the same shader (And hence micro) code and run those in a group as a single thread. Considering the amount of work that RAD's doing on LRB, I find it hard to imagine that they'd be happy with even 50% utilization of the vector unit for the heavy-duty shading work.
 
Does TBR even work on their IGPs to date? Afaik it's even a hw implementation (that sucks).
That is not a tiler AFAIK. Intel had their own tiler development in the past though, there are papers on it (probably never made it to silicon).

As for the people not being the same, I know that. I think I can expect people of this calibre to spend a while on literature research and tapping internal know-how ... if they really had to reinvent the wheel in this regard that is a huge strike against them, no matter how smart they are.

PS. as for similarities and differences ... the inventive step in this regard is performing point in triangle tests for each pixel individually for parallelization at the pixel level while performing binning with more serial but also more efficient code ... that's the inspiration, the rest is just perspiration (such as adding an extra level of binning in the hierarchy). Although I don't think it's a huge deal (hence my early post questioning how non obvious the potential for parallelization of rasterization is).
 
Last edited by a moderator:
The rasterization algorithm already determines which quads are empty and which are not :) So it's just a matter of filling a qquad with non empty quads (as much as possibile) and to gather/scatter (or to perform multiple loads/store, one per quad, whichever is faster) the quads in the proper place. It could potentially big a big win if you have lots of small triangles.
It seems to me this is essential, not optional. This is a conditional routing problem, at least if working in terms of shading a single triangle at a time.

The setup engine thread has to issue qquads to the 3 shader threads. The qquads it issues aren't, generally-speaking, allowed to contain triangles that overlap any other triangles that are already being shaded. So the setup thread has to spend time both sorting for non-overlapping triangles and packing valid quads into qquads.

So, all I'm alluding to is the extra computational cost before shading can start, and that with lots of small triangles the cost is going to be quite high.

Maybe the basis of most triangles, as strips, provides an easy way to optimise this. In theory it's better because triangles that are smaller than the example triangle in fig 29 will end up wasting even more of a qquad. The packing implicit in triangle strips makes it easier to fill qquads.

Wasn't there a theory that NVidia packs triangles into batches?

Not sure what is the right approach for attribute interpolation. Once multiple triangles occupy a qquad the data structure for attributes associated with the qquad becomes more complicated and could become really rather large, unless this is deferred until shading in a JIT-interpolation fashion, again like NVidia does... :???:

Jawed
 
I thought Jawed was still referring to the rasterization process when he said "shading". Since the 80 pixels in figure 29 obviously won't all be sent on to the pixel shading stage regardless of the rasterization algorithm employed. There should only be 7 quads or 2 qquads sent to the pixel shading stage for this triangle.
Actually I was referring to pixel shading on qquads. The default with this triangle is 5 qquads. The article doesn't talk about any optimisation into 2 qquads.

The only optimisation Abrash refers to is the removal of the 6th qquad (on the extreme left) because it's entirely empty. I'm not saying Intel won't pack quads into qquads, merely noting that this is a "conditional routing" (dynamic warp formation, in CUDA parlance) problem.

Jawed
 
Maybe someone can shed light on wheter this would also affect Multisampling since the render targets have to be sampled a lot more fine-grained and in course 32 bit might not prove enough. Or would the max. tile size just have to be reduced by a factor according to the MS-level?
I think reduction will happen. The increased data volume associated with MSAA means the screenspace tile size (stored in L2) will shrink anyway, generally speaking.

Jawed
 
Once multiple triangles occupy a qquad the data structure for attributes associated with the qquad becomes more complicated
Don't see what is so complicated about using 4 pointers to quads of interpolated parameters.
 
Don't see what is so complicated about using 4 pointers to quads of interpolated parameters.
I'm thinking there may be an interaction here between the up to 128 scalar attributes per vertex, the timing of interpolation (before shading starts or while shading?) and the number of triangles in a qquad. In theory you can have 16 triangles in a qquad being simultaneously shaded, if you take packing to the extreme. Though I accept that for texturing reasons you'd prolly stick with only 4 triangles, even if each triangle is only covering a single pixel.

Is there a synthetic test out there that uses single-pixel-sized triangles?

---

I wonder if Swiftshader does rasterisation in a similar way to what Abrash has documented?

Jawed
 
Actually I was referring to pixel shading on qquads. The default with this triangle is 5 qquads. The article doesn't talk about any optimisation into 2 qquads.

True but there was an explicit disclaimer that he wasn't going into anything but rasterization. There was no mention of shading at all. So I don't think you can assume anything about what's gonna happen after coverage is determined. Like you said, it's pretty much expected that they build qquads from non-empty quads before shading.
 
unless this is deferred until shading in a JIT-interpolation fashion
Why wouldn't they BTW? Seeing as the interpolation is almost certainly done independently for each pixel it just doesn't make much sense to do it in the setup thread.
 
Last edited by a moderator:
PS. as for similarities and differences ... the inventive step in this regard is performing point in triangle tests for each pixel individually for parallelization at the pixel level while performing binning with more serial but also more efficient code ... that's the inspiration, the rest is just perspiration (such as adding an extra level of binning in the hierarchy). Although I don't think it's a huge deal (hence my early post questioning how non obvious the potential for parallelization of rasterization is).

Well realistically speaking I doubt we'll see anything embedded based on LRB anyway or turned by 180 degrees any high end GPU from PowerVR so it's rather moot anyway to even attempt any comparisons. But once there I'm not so sure PVR has killed as much ff hw on its SGX as on LRB and I don't think they use fixed tile sizes anymore either; in fact I wouldn't be surprised if they've seriously revamped their HSR unit in their next generation.

Since you folks here are debating the possible efficiencies of the LRB sw rasterizer and or sw TBR, I've got a better question: assuming LRB manages to come with a sw rasterizer if not on the same level as the other ff rasterizers, at least damn close to it, what is it going to look like if ATI/NVIDIA have implemented more than 1 geometry unit on each core?
 
True but there was an explicit disclaimer that he wasn't going into anything but rasterization. There was no mention of shading at all. So I don't think you can assume anything about what's gonna happen after coverage is determined. Like you said, it's pretty much expected that they build qquads from non-empty quads before shading.
My point is that shading performance hangs on the mother of all dynamic branching divergence penalties. Current GPUs have it bad if the shader has incoherent control flow. Without conditional routing this looks untenable in Larrabee when triangles are small.

Marco says that conditional routing is easy. If it's that easy, does that mean conditional routing is generally easy, any time that divergence is encountered? Or is this a special case of easy where there's no per pixel (fragment) state at this point in rendering, as pixels don't gain any state until setup is complete?

Maybe this just isn't worth dwelling on, given the 4x4 tiles, or smaller with MSAA.

Jawed
 
Jawed, you are making a big fuss out of it when probably all GPUs do it anyway.
This is not some general mechanism that you can apply on anything at anytime, it's just an optimization.
 
Since you folks here are debating the possible efficiencies of the LRB sw rasterizer and or sw TBR, I've got a better question: assuming LRB manages to come with a sw rasterizer if not on the same level as the other ff rasterizers, at least damn close to it, what is it going to look like if ATI/NVIDIA have implemented more than 1 geometry unit on each core?

I just skimmed the article, but the recursive process to get the quad coverage of a triangle takes a number of recursive steps.
I haven't put much thought into what the cycle count would be, but this would probably take a fair number of cycles per triangle.
Of course, Larrabee's going to have a significant number of cores per chip and clock higher.

The multiple setup units on ATI/Nvidia hardware are an unknown quantity, and some of the patents show some interesting attempts at increasing throughput.
The setup units may also be charged with doing more than just coverage calculations, so the two schemes are not completely equivalent.

I'd suppose the high-end GPUs will still have peak rates in excess of Larrabee's.
The downside, as the article states, is that in times where setup rate exceeds other bottlenecks, those units will be twiddling their thumbs, while Larrabee's cores can just switch to other workloads.

Given the current fraction of die space given over to the setup pipeline, that could be something like 5% of the die.
If I had a *shrug* smiley, I'd use it.

The more interesting thing is the other bells and wistles a software approach might offer, like the reduced cost of calculating MSAA that was posited.
 
The more interesting thing is the other bells and wistles a software approach might offer, like the reduced cost of calculating MSAA that was posited.
Meh, like angle-dependent pixel sampling. I mean, a line of 45 degrees (relative to the screen-space) definitely doesn't need a gazillion samples to smooth it out, as does one projected in a much steeper angle...:oops:
 
It's a mathematically cheap way of shaving off a few cycles in other parts of the software pipeline and could be elaborated or modified. It does point out one avenue available to Larrabee that a fixed-function setup engine wouldn't offer.

The tradeoff is that going hog wild with extras that do not have a commensurate cycles savings elsewhere would be a performance negative, since the effectiveness of Larrabee's parallel triangle setup across all the cores in software is inversely related to the complexity of the setup loop.
 
I just skimmed the article, but the recursive process to get the quad coverage of a triangle takes a number of recursive steps.
I haven't put much thought into what the cycle count would be, but this would probably take a fair number of cycles per triangle.
Of course, Larrabee's going to have a significant number of cores per chip and clock higher.
At the end of the article Abrash also mentions that triangles that fit in a 16x16 tile can directly skip a level, while very small triangles can be directly rasterized skipping the whole hierarchical process.

The multiple setup units on ATI/Nvidia hardware are an unknown quantity, and some of the patents show some interesting attempts at increasing throughput.
The setup units may also be charged with doing more than just coverage calculations, so the two schemes are not completely equivalent.
Do you know any GPU that has multiple setup units? Multiple rasterization units on the other hand should be fairly easy to implement as long as they are independently assigned to different screen tiles.

Given the current fraction of die space given over to the setup pipeline, that could be something like 5% of the die.
If I had a *shrug* smiley, I'd use it.
No idea if that number is close to reality or not but keep in mind that we haven't seen any programmable rasterizer yet, while LRB should make practical a lot of stuff that right now only exists on graphics research papers.
 
At the end of the article Abrash also mentions that triangles that fit in a 16x16 tile can directly skip a level, while very small triangles can be directly rasterized skipping the whole hierarchical process.
Skipping one level in a multilevel process saves cycles, but that still leaves several steps that consume some number of cycles each.
The process of skipping the hierarchy for very small triangles still involves the necessary size check and then a branch.
I'd be curious to see how a code implementation would check for the the very small case in way that avoids too much overhead in the general case and avoids a branch mispredict.

Do you know any GPU that has multiple setup units? Multiple rasterization units on the other hand should be fairly easy to implement as long as they are independently assigned to different screen tiles.
Not any current GPU, just speculation and some possibly wishful thinking in the rumor mill.

No idea if that number is close to reality or not but keep in mind that we haven't seen any programmable rasterizer yet, while LRB should make practical a lot of stuff that right now only exists on graphics research papers.
Various estimates on this board came to the conclusion that the area of the units is small. Part of my guesstimate is based on the premise that increasing transistor budgets for other features with process transitions leads to a decrease in the proportion of die area left for the fixed-function section.
 
Back
Top