GeForce FX: 8x1 or 4x2?

Discussion in 'General 3D Technology' started by Dave Baumann, Feb 10, 2003.

  1. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    pcchen,

    That sounds like my description of 8 register sets with 4 effective sets of processing units (proxel pipelines), with the distinction that each unit is restricted to use by one of only two pixels. This is more of a nightmare of scheduling since if you do the same processing on each pixel you'll always get a stall unless you've scheduled that occurence away. But, let me step back from my assumptions a bit:


    For the sake of my description, a SIMD unit capable of processing the output for "more than one" pixel simultaneously is "more than one" processing unit for that circumstance.

    What is a "pixel unit"? I'm assuming you mean what I'm calling a procesing unit above, but do you mean a complete set of processing units able to handle all shader functions? Doesn't seem to make sense to transfer one pixel's processing to another in that case, so I'd presume not, and will continue with my assumption.

    So, are you saying there are (in effect) 4 instances of the complete shader functionality, and referring to a "pixel unit" as something able to perform certain operation(s) on the two pixels in your "4x2" example? If so, as I mentioned at the beginning, that fits my "8 set of registers, 4 proxel pipelines" description except with even more restricted scheduling than I originally proposed, and I wonder if there is significant transistor savings (in contrast to allowing all units to be freely allocated over multiple proxel pipelines) in return for increasing the difficulty of optimization.
    It seems to me that significant opportunities to schedule more complex (non-repetitive) shaders effectively would likely be removed (since optimization opportunities would be limited to the 2 pixel space instead of the virtual 8 pixel space if all units could be freely allocated).

    If you replicate registers for two "proxels" in one pixel pipeline, the only benefit I can see in this description of "4x2" in terms of transistor savings is if the ability to freely allocate the processing units to only one of two different pixels saves significant complexity (i.e., for example to facilitate branching).

    Are you saying it is something else I haven't quite figured out yet?

    In my description, you could stall a pixel easily if you repeat the use of the same processing unit over several pixels, as the processing capability would be saturated and you'd only have 4 pipelines ("4x1" you'd call it if I understand correctly) for the duration of the stall (what would really suck is if fp32 both stalled and saturated... :-? ).

    Note also that it is my understanding that the R300 acts like a 8 proxel pipeline lock-step processor...I think it would be a matter of the organization of the pixels being rendered (4x2? 2x4? I could swear someone actually said sometime) and statistics whether it ever performed lower than the "4x2" as you are calling it for non-branching shaders.
     
  2. Dave H

    Regular

    Joined:
    Jan 21, 2003
    Messages:
    564
    Likes Received:
    0
    Hmm...so what you're saying is that a "4x2" shader pipeline really behaves sort of like two "4x1" pipelines in series. At first I thought this was a silly idea, because it's something you would never, ever see in a CPU. (Why put functional units in series when you can put them in parallel?) I know GPUs are highly pipelined, but putting one functional unit after another??

    Thinking about it more, it seems like my problem was failing to realize an important difference between a GPU and a superscalar CPU: CPUs work on instructions; GPUs work on pixels. In a superscalar CPU, fetch and execute are decoupled; in a GPU, they're linked in rigid pixel pipelines, because you know each pixel is not dependent on another. In a CPU there's no concept of one instruction needing to do two things at once (ignoring VLIW, in which words are made up of non-dependent instructions), only seperate instructions which may be dependent on each other (in which case the compiler and/or the OoO engine kick in to seperate them). In a GPU, a pixel can carry lots of instructions together.

    The point is, where the smartest design for a CPU is to keep fetch resources to a minimum and present all available functional units in parallel to a unified buffer of all the instructions that need to use them, in a GPU you have seperate fetch functionality for every pixel pipeline and thus it makes sense to put execution units in series rather than in parallel.

    I had assumed 4x2 meant the two ops/lookups would happen in parallel in order to save state. But considering pixels must drag their state along with them through the huge GPU pipeline, I suppose an extra couple stages isn't going to hurt. I had figured that if you expend the transistors to save that extra state you might as well go to an 8x1, but that disregards the savings in fetch and retire units (again, they're decoupled on a CPU so it doesn't matter) and possibly from saving on a bit of math due to having the second TMU/execution unit work on the same pixel as the first (certainly in the case of the TMU).

    Finally, the big point is that while the unit of work in a GPU is the pixel, you can still resolve dependencies within a series of instructions that work on that pixel. This is simple in hindsight because each pixel has its own seperate instruction stream, but when you're used to thinking of CPU pipelines where it's a lot more complicated (not least because the results of the instructions themselves affect state, instead of merely being guaranteed temps contributing to the calculation of a pixel which is the only piece of permanent state), it's not that obvious.

    Um, so this is all a giant rambling way to say: you're right, I'm wrong, and I should have gone to bed a long time ago. Well, at least I learned something new...
     
  3. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    I'll make it clear: a 4x2 architecture have 8 complete pixel shading units. Anyone of them is capable of doing any color blending instruction (not necessarily in one cycle though). They need to maintain their own non-constant register file.

    Since you have two units per pixel pipeline, you have two ways of using them: the first way is to connect them into a lock-step pipeline; the second way is to use the first one to handle first pixel and the second one to handle second pixel (on different clock cycle).

    In DX8 era, pixel shading units are relatively cheap (compare to interpolators for color and Z, and other units required for per-pixel processing). So it makes sense to put two pixel shading units for a pixel pipeline. However, with floating-point pixel shading units are going to be expensive. Therefore, one may want to increase the number of pixel pipelines while maintain the number of pixel shading units.
     
  4. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    So, pcchen:

    Your "pixel pipelines" = my "pixel registers" (i.e., separate state for each pixel), then?

    ...and...

    Your "pixel shading unit" = my "proxel pipeline"?
     
  5. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    Hmm... I think that's correct? I was a bit confused by your post :oops:

    I don't know how R300's 8 pixel pipelined is designed. It is probably a 4x2 or 2x4 fused design. It could also be a double 2x2 design, but I doubt that.
     
  6. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    pcchen - I'm a little confused, hope you have the time to explain :)

    What exactly do you mean by a "pixel shading unit"? (a set of execution units, execution units with register file, etc...)

    Are you saying that each pipe can issue 1 instruction per clock, but has context for up to N different pixels and issues from the N contexts in round robin fashion (thus hiding up to N cycles of latency)?

    Serge
     
  7. UncleSam DL iXBT

    Newcomer

    Joined:
    Feb 27, 2003
    Messages:
    36
    Likes Received:
    0
    FX pixel pipe

    seems that alexok like to translate my sentences from iXBT forum and post it's here...

    may be better if i start to write my opinion itself :lol:

    now iam investigate architecture of pixel pipe, as it was already quoted - it kind of extremly wide VLIW (lets call this megapipe :wink:) so no need to always go for any scheme like 8х1 or 4х2 or more stupid analogy like 4(1+1).

    main question - how many pixels and temporary register sets this megapipe tracks when it run thru shader microcode. later i think that 8

    (strange gain for paired TMU can be explained in therm of tex/arithmetic instruction pairing for integer or stage execution that used by microcode compiler from driver. this proved by absence such difference that must be for strongly paired scheme for 5vs6 and 7vs8 textures becouse here you can interleave tex and calc commands in more sophisticated way and optimize this!!!)

    now after more deep study, it is under question - may seems that here only 4 FULL register sets but in some cases you can compile microcode that runs thru and emits 8 values (for 8 pixels) in paralel.

    so i simply count constrains that we have for this pipeline

    1. big common temporary register set for story probably 4 or 8 full ps20 register sets
    2. 8 load/store/tex access/move execution units. so you probably can write out from shader only 8 different values per clock - from here 8 Z or 4 Z+C. also this means 8 access to texture fetching/filtering blocs per tact
    3. 32 or 64 (seems that 32) I12 ALU
    4. 16 or 32 (seems that 16) F16 ALU that can be paired into 8 or 16 F32 ALU

    driver can compile highly optimized code for different shaders or stage setups - but seems that main constrain is we have slots in VLIW code and in each can be 2 tex/load/story/move for [4] register or int12[8] (==two int12[4]) or one f16[4] or one f32[2] (!) operation.

    probably operations have per component definition of swizzle and math op so you can somethimes compile two or even more (??) shader comands in one such slot

    c[3]=a[3]+b[3]
    d[1]=e[1]-f[1]

    can be somethimes packed into one microop and so on
    all this and all other decided by compiler in driver!

    so count of "virtual pixel pipes" can be any - its only logical thing now

    not only 8 or 4

    but may be 2 or 1 if here will be a need for that (but i cant imagine such case yet with current capabilities of shaders but may be for internal OGL purpouses such scheme can be generated)

    whats about pairing of texture fetches

    for example it can try to reprezent texture stages like (tex,tex); (i12[4],i12[4]) or use more intermixed schemes for 5 or more textures.

    but we know that in modern games use of 3 is rare - so may be compiler prefer grouping by two tex loads becouse it gives additional time to texture units to fetch next values

    also seems that we can emit 8 _32 bit values per tact
    so it may be pretty slowdown-ed by F32[4] or F16[4] render target

    discuss that?
     
  8. Sabastian

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    991
    Likes Received:
    2
    Location:
    Canada
    Hi UncleSam DL iXBT, 8)

    Ok first off from what I am reading you are saying is that the NV30 pixel pipeline is a logical one without any fixed pipeline as the Radeon 9700s 8x1. Or not? You are saying that it has no fixed pixel pipeline and that uses a "virtual pixel pipeline" that can emulate any number of pixel pipes you like and not relying on a fixed pixel pipeline structure at all? Please do correct me if I am wrong here.

    I have struggled through this entire thread and end with you. :D While I do realise that English is not native to you I found your explaination even more criptic. No insult intended. Thanks for your input here.
     
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    The "pixel shading unit" I talked about is a unit capable of running one pixel shader instruction, like a register combiner (register combiner is actually a bit more powerful).

    Since a pixel pipeline can compute only one pixel state per cycle, so it can issue only one pixel shader per cycle at most. However, if you have two pixel shading units per pixel pipeline, you'll be able to issue a two instruction pixel shader per cycle.


    And I have a question for UncleSam DL iXBT :)
    If NV30 can write 8 values per cycle (8 Z or 4 Z+C), why can't it write 8 C? IIRC 3DMark's fillrate test does not use Z buffer.
     
  10. tb

    tb
    Newcomer

    Joined:
    Feb 7, 2002
    Messages:
    241
    Likes Received:
    0
    Location:
    Germany / Thuringia
    Wrong. Two sided stencil extension is/was supported in the alpha, you could deactivate it and see the poly count goes up...

    Thomas
     
  11. UncleSam DL iXBT

    Newcomer

    Joined:
    Feb 27, 2003
    Messages:
    36
    Likes Received:
    0
    pixel megapipeline

    Sabastian

    yep - you understand all right

    1 very wide flexible pipeline with many ALU's that can work sim. on set of pixels

    in r300 you have always 8х1 scheme, even if its also kind of such wide VLIW units array, you can talk about pipelines becouse scheme always same and never change an TMU coupled with its own logical/physical pixel pipe but in nv30 - not.

    as i heard from ATI guy's r300 have stages like 4*4 scheme. i.e. its more like a big 1.4 shader with 4 phases by up to 4 texture fetch on each.

    for that you pay by delays while you make dependent texture fetching (more than one phase). from that goes DX9 restriction for no more than 4 level of dependency and no more that 16 fetches

    shader splitted on phases automaticaly - by compiler, not by hands like 1.4

    english is not my native and quality of explanation very depends from how much i am sleep today ;)

    unfortunatly i am back to St. Petersburg from USA yesterday (10 + 6 hours of flight + 8 hours of train + 5 hours of interconnection) so definitely not in good state now :(

    pcchen

    yep i am also think that it PROBABLY can write 8 color only values but

    1. i can'not surely check it now becouse of memory bandwith limitations
    2. may be it's depends from used registers count - i.e. for 2.0 shader we must to see - fit 8 USED register sets for this shader into our common register file or not, but for dx7 stages it probably can fit anytime
    3. its unknown - when nv30 capable to start 8 pixels per clock - for exapmle how many parameter can be interpolated per tact. in full case you need 8+2 4D interpolators per pixel per tact - realy big amount for full 8 pixels case

    now i asked John Spit by email for that - and waiting for answer
     
  12. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    So NV30 Pixel Shader is just a Ultra Long Instruction Word architecture that can 'emulate' Nx pixel shader pipelines like thé 'old' ones.

    ULIW that in fact would still be a microcoded programed processor. The instructions are so long precisely because they are just all the values for the control signals already decoded, which is different from a modern CPU where there is a decode stage.

    The physical register bank would also be unified but I guess that there wouldn't be using hardware renaming. The compiler in the driver would do the renaming and scheduling, filling the microcode slots for each functional unit each cycle. So in my opinion that compiler would make IA64 compilers seem as easy stuff. That could explain the problems they say they have in the drivers with the pixel shaders (but if that was the problem I would say that they deserve the problems they are having now, to do everything in a compiler doesn't use to be the right decision ever).

    There would be 8 TMUs for texture operations and 4 ports for color output and 8 for Z output. Or maybe 8 output ports that could be either 4 color + 4 Z or 8 Z (but it doesn't seem to work as 8 color). In any case that aren't ports to memory but to the later pipeline stages: fog, or could be that implemented as a shader program, but reducing throughput?; Z and Stencil tests; Alpha Tests; and final blending with the color buffer.
     
  13. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    I don't think so. And I think it relates back to the comments about an integer pipeline and and FP pipeline. Its likely that the integer pipeline is fixed, and probaly fixed as a geForce 4 configuration - reengineering this would take time. Whats going on with the FP pipeline probably is very different to the configuration we are seeing with the integer pipeline, and I suspect there is a larger element of decoupling going on.

    However, NVIDIA have already said that it only does 4 colour writes, so I don't buy that it will ever be able to actually write more than 4 colour values per cycle.

    I still do not get the idea of having the integer pipline at all though. When the FP pipe is being used you have a large chunk of silicon going to waste and when the integer pipe is used you have a large chunk of FP silicon unused.
     
  14. 2B-Maverick

    Newcomer

    Joined:
    Feb 10, 2002
    Messages:
    49
    Likes Received:
    0
    But this fits perfectly with NVidias modular way of building GPUs:
    reuse as much as possible, extend the known with a bit of new stuff.
    So you can easily reuse the old drivers and you know what you have .....

    This doesn't have to seem logical in any way, its just the way i see that NV is building its chips.
     
  15. UncleSam DL iXBT

    Newcomer

    Joined:
    Feb 27, 2003
    Messages:
    36
    Likes Received:
    0
    Its not integer and float PIPELINES its better to call it one wide pipeline with 4 * (2 tex and one float32/float16/int12 ALU and then 2 integer stages). all this working in paralel

    [tex] [tex] or [ALU f16/f32/i12]
    | | |
    ----------/or/-----
    |
    [stage 1 i12]
    |
    [stage 2 i12]

    so i receive answer from Jonh. what we have:

    if we calculate color in _any way we always have 4 pixels in process
    8 for Z/S only!

    for each of 4 pixels (state and register sets) we have

    2 tex fetchers + 1 floating math ALU capable of one arbitary operation (_any _precision, John didnt say is F32 is twice slower than F16 or not, but by tests it seems like so)

    and, downstream, you have 2 integer combiners (stages) capable of 2 operations per clock

    so for 4 pixels we can do in each clock cycle

    4 * 2 = 8 texture lookups
    and
    4 * 2 = 8 int12 operations

    (TEX/TEX/BLEND/BLEND)

    "this is what our marketing group called 8 texture "pipes", since you can
    do 8 lookups, and 8 blend operations based upon them"

    or we can do

    4 * 1 = 4 arbitrary precision (fp32/fp16/fx12) operations
    and
    4 * 2 = 8 fixed12 precision operations on blend stages

    but then with no tex lookups at that clock cycle

    so the maximum amount you can do per cycle - 12 operations.
    at 500 MHz, comes to 6 Gops/s for fx12, 2 Gops/s for fp32/fp16 (since
    the combiners are currently limited to fx12 precision)

    fufff...

    so much depends from HOW compiler try to utilize this 12 slots (operations) per clock but its always better to do x2 textures per pass and always better (for NV30) to use blend stages or up to 1.3 shaders
     
  16. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Sorry, I'm at a loss to see how that differs from what we've been saying?
     
  17. UncleSam DL iXBT

    Newcomer

    Joined:
    Feb 27, 2003
    Messages:
    36
    Likes Received:
    0
    seems difference not big

    but for me not clear - did you told about separate i and fp pipes?
    thats not separate FP and I - thats one pipe wit fp/i and i stages

    and stages can be used per clock - they integrated into pipe not backside

    on gf4 you simply have such conveyer (* 4)

    [tex][tex]
    [stage alu]
    [stage alu]

    here added (* 4)

    [tex][tex] -or- [fp/i alu]
    [stage i alu]
    [stage i alu]

    but in same pipe

    later probably they go to the
    (its my estimation)

    [tex][tex] -or- [fp/i alu]
    [stage fp/i alu]
    [stage fp/i alu]
     
  18. Hyp-X

    Hyp-X Irregular
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,170
    Likes Received:
    5
    So the 2 texture lookups and the 1 floation point operation per pipe is exclusive.

    The texture lookups/fp operation is parallell to the integer operation (register combiner) but he didn't say the result of the integer unit can be fed back into the beginning of the pipeline - in fact downsteam indicatates otherwise.
    Note, that this is how we (at least me) already imagined it.
     
  19. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Well, they said they have dedicated support for integer, which is as you describe. However, your disagrams indicated that the textures stages are coupled in groups of two - is there a case where you think that they operate independantly?

    (Given the performances we've seen so far and whats was said at the London D2D I would say not)
     
  20. Hyp-X

    Hyp-X Irregular
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,170
    Likes Received:
    5
    Actually GF4 is more like the latter it has an fp unit to calculate things for dependent texuring.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...