Pixel fill rate V.s Texel fill rate...

Pixel Fill rate Or Texel Fill Rate..

  • A GPU of Configuration X (tell use below)

    Votes: 0 0.0%
  • A GPU with 8 pixel pipes and 1 TMU per pipe

    Votes: 0 0.0%

  • Total voters
    191
Given todays games,,, and perhaps games comming out over the next year... (but no further)....

All Theoretical Cards have the same core clock, and comparable memory bandwidth (using whatever means), and similar shader performance pixel/vertex.... Would you rather have....
 
There is only one thing that you didn't mention: Independent of textures, how much math can one pixel pipeline perform?

If the 8x1 pipelines can actually perform all the same math as each of the 4x4 pipelines (as appears to be the case with the R300...), then the 8x1 would definitely be better, for use with anisotropic filtering.
 
call me crazy, but i wouldn't mind seeing a good implementation of 16x0. If you had good loopback capabilities, 16x0 could do a lot of damage to current games. of course the tradeoff would most definitely be slower or harder-to-implement-at-decent-speeds trilinear/anisotropic filtering.
 
Multigl2,

Not only would you need 16X0, you'd likely need multiple triangle setup engines and the setup in a non-fixed rendering pattern (say 4*4). Otherwise the diminishing returns would really hose your performance.

Personally, I like the super generalized P10 architecture. General execution units, geared for the problems you'll usually encounter.
 
most definitely saem... but from my just toying with shaders perspective:

it would be really nice to see what a 16x0 math power house could do... i mean if it should techinically (setup and bandwidth permitting) as fast as 8x1 in multi texturing duties, but it could do some serious shaders if the pipelines were setup nicely. Like for instance, if the setup permitted, you could treat it as a 4x2 card with 2 free pipes to help process shaders :) again, setup permitting.
 
Sorry for being ignorant but how exactly does a 16*0 setup work? I mean wouldn't that end up being a bunch of untextured polies (obviously not, so please explain ;) ).
 
The poll is too simplistic IMO but if I have lots of shaders in my game, I'd probably prefer a card with more pipes. However, given the differences in architectures (which will probably always exist), the bottomline is the performance - it won't matter to me if it is 8x1 or 4x4 or whatever since this is transparent to a developer.
 
Sorry for being ignorant but how exactly does a 16*0 setup work? I mean wouldn't that end up being a bunch of untextured polies (obviously not, so please explain ).

Think PS2. It can do something in the order of 2400 mp/s all untextured. You cut that number in half for adding in a texture layer. The point of this setup is for doing things that don't involve texturing -I'm guessing stencil buffers would be one- this ends up being more efficient since you're not using the TMU anyways. People can argue that the returns provided by a TMU are huge. One could allow for significant loop back and this would be a less of a problem, a simplificantion of the circuit could also lead to higher clocks. Though, I'm guessing TMUs aren't your big inhibiters.
 
8 pixel pipes and two TMUS per pipe

I do have to agree with Saem though, P10's architecture is really flexible in many ways and is targeted towards generalization of everything.

I'm very intrested in the flexibility of NV30's architecture, since it's been suggested that it might be even more flexble than P10's! (not in all areas obviously, but in most of them).
 
8*2 would then require a very high memory bandwidth to take advantage of. Something a lot higher than the 19gb/sec with the Radeon 9700 Pro.
 
Yawn. Lets get back on topic shall we.

Sure Dave! Now where were we? oh yeah, 8 pipes and 2 TMUS on each pipe would be great with aproximately 25-30gb/s of bandwidth and of course a 256 bit memory bus.
 
Question on the pixel pipes of the p10.

Notice that they rasterize tris into 8x8 tiles (and perform visibility culling at this level).

On top of that they have 64 = (8*8 ) texture coordinate processors and 64 pixel shading ALUs.

To me this says: 64 pipe card, with each pipe locked to a specific pixel in an 8x8 tile? I haven't seen any claims that p10 can do data-dependent branching in pixel programs (there does appear to be some form of loop support for texture sampling), or that it can handle programs of arbitrary length, or that it's pixel pipes can operate on arbitrary pixels, pixels from different triangles, or even pixels with different shaders.

IMO if this kind of thing were possible with p10, wouldn't the performance numbers reflect it?

(Before you all say, 64 pipes! no way! - note that the p10 ALUs are not SIMD - i.e. they process 1 float/int at a time as opposed to 4.)

However they do describe their programmeable units as "SIMD vertex texture and pixel arrays". That would tend to indicate that each ALU is executing the same instruction as all the other ones each cycle.

So why exactly does everyone think p10 is "so flexible" compared to say r300?
 
psurge said:
So why exactly does everyone think p10 is "so flexible" compared to say r300?

Points to marketting material from 3DLabs. It says so right there. :p

--|BRiT|
 
psurge said:
Question on the pixel pipes of the p10.
[...]
To me this says: 64 pipe card, with each pipe locked to a specific pixel in an 8x8 tile? I haven't seen any claims that p10 can do data-dependent branching in pixel programs (there does appear to be some form of loop support for texture sampling), or that it can handle programs of arbitrary length, or that it's pixel pipes can operate on arbitrary pixels, pixels from different triangles, or even pixels with different shaders.
[...]
However they do describe their programmeable units as "SIMD vertex texture and pixel arrays". That would tend to indicate that each ALU is executing the same instruction as all the other ones each cycle.

So why exactly does everyone think p10 is "so flexible" compared to say r300?

Because it is :).
Yes, P10 has data dependent branching and looping in the fragment shader. Regarding the relationship between shaders and pixels/fragments: At rendering time, a primitive (say, a triangle) is decomposed in the tiles the projected 2D primitive touches and the shaders are run for each tile, so in that sense the shader cannot displace pixels around the screen and the shader run is the same for the whole primitive.

From Wavy's P10 preview:
The maximum number of instructions that the vertex processor can handle at a time is 256 instructions (per unit); but, as mentioned before, the processors can use loops and subroutines so it can be much more efficient in the use of the 256 instructions.
http://www.beyond3d.com/articles/p10tech/index.php?page=page2.inc

I haven't been able to find any source disclosing the number of instructions in any of the fragment-pixel units (coordinate, shader, address and pixel)though :(
 
psurge,

In this thread over here. I asked Dave Baumann about whether the "pixel pipes" were fixed, and he felt that they weren't.

When looking at the diagram at the end of the page here which describes the P10 microarhitecture. It seems that it is possible to load more than one triangle and have the pixel processing -of course this will take more cycles. I feel this is the case because as Dave mentioned in his P10 technology preview the P10 uses a lot of mulitlevel cache, the P10 could easily have the ability to cache a few tiles or patches. As pixels would be processed, the cache (FIFO buffer) would spit out another pixel onto the chopping block.

As for the "SIMD arrays", this could be very much like the vertex processer where this is simply an abstracted look and in actuality, the pipelines are independently executing.
 
Back
Top