Pixel fill rate V.s Texel fill rate...

Pixel Fill rate Or Texel Fill Rate..

  • A GPU of Configuration X (tell use below)

    Votes: 0 0.0%
  • A GPU with 8 pixel pipes and 1 TMU per pipe

    Votes: 0 0.0%

  • Total voters
    191
Tonyo, Saem

How do you know there is data-dependent branching in the fragment shaders - link? I figured this to be the case for the vertex processors,
as Dave's preview mentions that each vertex unit has it's own program storage. But - is it really realistic to assume that all 128 of the pixel processors have their own program/temp storage?

Tonyo, I'm not sure if i understand what you mean with the tiles - here is what my guess was : take a tri, split it into 8x8 tiles. For the 64 pixels in each tile, run the same pixel program on each pixel inside the triangle.
Is this what you're saying?

Saem, what confuses me the most in that diagram is the 4 "texture pipes". Each has a "setup" stage - does this mean each can handle pixels from a different triangle?
 
psurge,

Well the thing is that there doesn't need to be that much of a "program" for the shading pipes -they might just get on instruction and then get another in the next clock, not sure where the control logic is. All that needs to happen is they get a pixel, execute all the work necessary -what they're told to do- for that pixel and say "done". When "done" is said, they get more work, lather, rinse and repeat. It doesn't depend where that work comes from, if I understand things correctly -this is handled by the allusive control logic. >=|

The "setup" stage is bugging me as well. It says it's house keeping, but I'm not sure if that's the whole story. I could be some sort of program setup and evaluation. Pixel operations might be rather numerous and some if not all will require some special provisions, this could be an area where these are taken care of. Perhaps, one can even program the "house keeping."

As for data-dependent branching, I'm not sure. It could be that the "setup" stage actually evalutes a program and then runs it. Again, this setup stage could be large or small, I'm not sure what to think of it right now. They might have basically recycled the RISC cores they used in the vertex shaders, here they use it to do some extra logic to handle house keeping tasks. ARGH, what the heck does that do? *poke Mr Baumann*
 
psurge said:
How do you know there is data-dependent branching in the fragment shaders - link?

Well, I know it ;), but I don't have links to back it up :"(

psurge said:
But - is it really realistic to assume that all 128 of the pixel processors have their own program/temp storage?
The instruction storage could be shared across all the SIMD processors (Single Instruction), but I guess that would complicate the routing :?.

Note that all the processors in a tile work on the same primitive, so they all are executing the same program, or as you put it:

psurge said:
Tonyo, I'm not sure if i understand what you mean with the tiles - here is what my guess was : take a tri, split it into 8x8 tiles. For the 64 pixels in each tile, run the same pixel program on each pixel inside the triangle.
Is this what you're saying?

Yes, it's exactly exactly that.

One of the slides says "Rasterize triangle into tiles" and then, with those tiles:
The 64 processor arrays through the pixel and texture pipelines are arranged in an 8x8 block, which is the basic unit of processing and memory transfer - 3Dlabs refer to this block as a 'tile' or 'patch'.
http://www.beyond3d.com/articles/p10tech/index.php?page=page3.inc
 
Tonyo - if the instruction storage is shared amongst the pixel units (64 of them), then I think it's likely that each unit is executing the same instruction every cycle (i've never heard of any kind of cache with 64 read ports). Data dependent branching means that different pixels running the same program will potentially take different execution paths in the program, i.e. be executing different instructions at a given clock cycle than neighbouring units.

I suppose as Basic said in a previous thread you can still make each
pipe execute an identical instruction by issuing NOPs to pipes in which
conditional branches taken by other pipes fall through.

So if they are in fact processing 64 pixels at a time, it seems like an enormous waste of resources when a triangle is just a few pixels large - in this case less pipes with more ALUs is much more efficient (assuming my speculation is correct of course).
 
Note that all the processors in a tile work on the same primitive, so they all are executing the same program, or as you put it:

First of you say processors in a tile? The tile is the data, do you mean when a tile is being rendered all the processors are working on the same primative?

Also, what causes you to think this?
 
Saem - there are 64 pixels in a tile... 64 texture coordinate processors, and 64 pixel shader units.

What I was speculating and what I think Tonyo is confirming is that rendering occurs one triangle and one tile at a time, and that each pixel in the tile has a static association to a texture-coordinate/pixel processor pair.

i.e. the rasterizer outputs (for a tile) which pixels a given triangle covers.
Then the 8x8 "tile" of processors work on those pixels. Once they are done with this set of pixels, they process the next tile of pixels output by the rasterization stage.

in the diagram below interpret x's to mean covered pixels <=> active
pixel units


Code:
xxx.....     
xxxx....
xxx.....        
xx......
x.......
 
psurge,

If I understand your idea correctly, sounds like it's going to be rather inefficent.

To me it makes more sense to simply look at a patch. But then we're back to the what's up with the SMID wording?
 
psurge said:
Tonyo - if the instruction storage is shared amongst the pixel units (64 of them), then I think it's likely that each unit is executing the same instruction every cycle (i've never heard of any kind of cache with 64 read ports).

First note I said The instruction storage could be shared across all the SIMD processors, so I'm not really confirming or denying anything on whether the instruction storage is shared or not.
Regarding the execution flow, isn't a SIMD layout, by definition, a layout with a unique flow of execution for all the PE's in that unit and working on multiple data?

psurge said:
Data dependent branching means that different pixels running the same program will potentially take different execution paths in the program, i.e. be executing different instructions at a given clock cycle than neighbouring units.

The factor missing in this equation is the granularity at which you can have data dependent flow of execution, in this case and with all the available data, it could be concluded that the granularity is a tile. The other option I see is to allow data dependent flow of execution at a PE level (but then we are no longer talking SIMD, but MIMD), but even in that case, you would still have to wait for the last PE to finish in this grid of PEs to be able to process the next tile.

psurge said:
What I was speculating and what I think Tonyo is confirming is that rendering occurs one triangle and one tile at a time, and that each pixel in the tile has a static association to a texture-coordinate/pixel processor pair.

What I can confirm is what I think you can derive from Wavey's article: primitives are split in 8x8 tiles and there's an 8x8 SIMD grid of processors, each PE working for one fragment in the tile, thus if the primitive doesn't span all the fragments in the tile, you will endup with idle PEs.
This is a tradeoff of any parallel design, exactly the same as when you have a texture cache where you fetch texels of the texture you may not use at all.
But note that there are no implications on whether two different grids of SIMD processors have to be working on the same primitive.

EDIT: I'm being a little vague on all this, I know, please undestand that I don't really know how much information on specifics I can publicly share :-m
 
All right, so let me take another whack at this.

These 64 processors are spread amoung the 4 pipes, right? So what you're (Tonyo) saying is that one could have idle processing elements in a pipe because part of the tile that it's processing isn't part of the primative that isn't setup for that pipe. This is because each pipe can only work on one primative but the 4 pipes can work on different primatives because each have their own setup stage, which -according to Dave's tech preview- does the necessary primative plane calculations. All of this within one tile.
 
Saem said:
These 64 processors are spread amoung the 4 pipes, right? So what you're (Tonyo) saying is that one could have idle processing elements in a pipe because part of the tile that it's processing isn't part of the primative that isn't setup for that pipe. This is because each pipe can only work on one primative but the 4 pipes can work on different primatives because each have their own setup stage, which -according to Dave's tech preview- does the necessary primative plane calculations. All of this within one tile.

Couldn't have said that better myself :)
 
Then how come there is also a setup stage at the the end of the rasterization stage (right before the pixel stage)? And why are they called "texture pipes"?

Anyhow - seems like Tonyo is in the know - can each of these 4 pipes run different shader programs?

Regards,
Serge
 
psurge,

A patch is created in one area -it has a setup stage, then the patch is sent to another area where stencil, GID and depth is all taken care of - this also has a setup stage- and finally, it's sent to the pipes -which each have a setup stage which does the primative plane calculation.

The first 2 setup stages are covered on the first page large block diagram at the end of the page. The last part is covered in the "texture processor" part of the article on the third page, IIRC.
 
Back
Top