Performance hit for 64bit and 128bit rendering ?

Discussion in 'General 3D Technology' started by BRiT, Sep 26, 2002.

  1. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    The last days i have take a closer look to the new PS API from DX9.

    IMO the 300 PS (2.0) is only a improved Version from the R200 PS (1.4).

    We all rember that a PS 1.4 works this way

    1. Texturereads (6)
    2. Arithmetic (8)
    phase
    3. Texturereads (6)
    4. Arithmetic (8)

    PS 2.0 do not have a phase command but i am belive it is still there but only in the driver. In PS 2.0 there is a limit of 4 dependent reads. That's the reason why i am belive that R300 works in this manner:

    1. Texturereads
    2. Arithmetic
    phase
    3. Texturereads
    4. Arithmetic
    phase
    5. Texturereads
    6. Arithmetic
    phase
    7. Texturereads
    8. Arithmetic
    phase
    9. Texturereads
    10. Arithmetic

    If you uses less than 4 depent reads the nummber of steps is reduced.

    Step 1, 3, 5, 7 and 9 runs in the texture unit
    Step 2, 4, 6, and 8 runs in the address processor
    Step 10 runs in the color processor

    The only think i am not sure about is the number of pixels that runs at the same time in one pixelpipeline. IMO 3 is the best way to make sure that the PS is used as much as possible. But still in this case each one of this pixel can block a other one.

    At the moment i am doing some theoretical tests with this configuartion.
     
  2. SA

    SA
    Newcomer

    Joined:
    Feb 9, 2002
    Messages:
    100
    Likes Received:
    2
    By the way, just thought it was a good time to mention that deferred rendering tilers have a big advantage over IMRs for high precision floating point pipelines (as most here already know).

    The major problem with floating point textures and pixel pipelines at the moment is that they do not implement what has become the essential features of the integer pipeline (texture filtering, antialiasing, etc.). This substantially limits their usefulness as a general pipeline for realtime 3d graphics. They are, however, very useful for offline work where filtering can be done in (shader) software and performance is not an issue. They are also currently useful for some specialized realtime pixel shaders.

    With integers it is easy to implement a large amount of computation directly in hardware in parallel. Floating point requires far more transistors to implement. As a result, it makes no sense to dedicate all those transistors to fixed functions. Allocating all those transistors to floating point only makes sense if you make the pipeline programmable.

    The problem is that while you might do dozens of operations in parallel in a single clock per pipe for integer vectors, you are generally limited to as little as one (or a few) for floating point.

    This was a lesson learned long ago for other types of hardware. As soon as you increase the sophistication of the data types and the computations, direct hardware implementation is no longer feasible. You must rely on software. As soon as you do this the entire hardware design picture changes dramatically. It becomes extremely important to maximize frequency, data availability, and software computation parallelism to squeeze the most out of all those transistors in each pipe. Since you can perform far fewer operations per pipe per cycle, you must increase the number of cycles and the number of pipes dramatically.

    In the future, I expect to see much more emphasis on frequency and the number of pipelines than in the past.
     
  3. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    There are also high precision integer formats up to 64-bit (maybe even 128-bit). 2 or 4 16-bit integers per texture sample can be very useful, and they can be filtered just fine. Because they are higher precision, you can still do a scale & bias in the pixel shader to get HDR, especially if you store HDR values non-linearly in the texture and linearize it in the pixel shader. I remember hearing (maybe at Siggraph?) about storing all values as s = 1/(h+1) in a texture, then expanding them as h = 1/s - 1 in the pixel shader (h is HDR value, s is texture sample). Another method is the RGBScale method, mentioned in ATI's paper, although I like first method better.
     
  4. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Welcome to the forum Demirug!

    Agreed. PS 1.4 was quite a big step from PS 1.3, although technically you could pretty much convert anything from PS 1.4 to multiple PS 1.3 passes. ATI was saying in their PS 1.4 papers that this is where we're headed with PS 2.0, and they weren't unjustly promoting R200 by doing so.

    I don't really think there is a separate address and colour processor, not even in R200. I think the GF3/4 worked the way you described, though, which is why they had arithmetic operations like texdp3 or texm3x3. I believe there is only one shader processor that does arithmetic, and a texture unit that fetches texture samples based on values from either texture coordinates or the shader.

    An interesting point here is that it seems R300 is relatively evolutionary from R200, as compared to NV30 being a significant architectural overhaul from NV25/NV20. Maybe this is one of the reasons that R300 came so much sooner.

    I'm not sure what you mean by this, but I'm pretty sure only one pixel "runs at the same time" per pipeline. Between "phases", a group of pixels is buffered while the dependent texture lookup is performed, and then those pixels are injected back into the pipeline in a cycle. If that's what you mean, then the number of pixels is probably something like 10 or 20 per pipe, but this is a pure guess. You need enough time to ride out the latency of a texture fetch.
     
  5. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    http://www.digit-life.com/articles2/radeon/ati-r300.html

    if you look at the pic of the Pixelshader you can see that there are 3 parts

    -Texture Unit
    -Address Processor
    -Color Processor

    ok maybe the "Address Processor" doing something different but why do the call it "Color Processor" if it callculate Texturecoordinates for depend reads?

    Yes I mean something like this but i think that "run" is the right term if i declare the TMU as part of the Pixelshader. I do not think that a Pixel shader buffered many pixels because the register set is not small at all. And we know that the NV30 can work on 32 Pixel with 8 Pipes that mean that every Pipe work on 4 Pixel at the same time.

    IMO at the same time the TMUs fetch the Samples for one Pixel the Processor works on an other.

    Something like this:

    T A C
    1
    2 1
    3 2 1
    4 3 2

    T = TMU
    A = Address Processor
    C = Color Processor

    In reality it is more tricky because the time that a pixel spent in one of the Pixel Shader parts may different. As i say befor i play a littel bit with this model and a model of the NV30 pipeline, too
     
  6. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    Where does this information come from?

    NV30's pixel shaders use 32 registers per pixel, each 128 bits wide. Such a register file need not take more than ~33K transistors, so I would find it perfectly feasible to implement 30+ such register files in each pixel pipe (which would still amount to only ~6% of the total transistor count in NV30). If you want an efficient pipeline, you pretty much need a pipeline that can actually absorb texture cache misses (several clock cycles + 1 full DRAM latency, in sum ~50-80 ns = 25-40 cycles @ 500 MHz) without stalling.
     
  7. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    The information is from a interview with David Kirk.
    http://www.extremetech.com/article2/0,3973,710384,00.asp

    This is the the same technic that was used befor in the vertex shader. But in this case NVIDIA use only 3 stages. This is done to make sure that every vertex operation can handle in virtual one cycles per vertex. Some Ops need up to 3 cycles but the vertex shader work on 3 vertex at the same time so every cycles one operation is finished.

    In the new pixel shader is looks like that they use 4 stages and a decoupled TMU to reduce the depend on latency.

    Your callculation miss something. the input and output register are a part of the pixel register set too. The only thing i ask me is: "Is it a good idea to use such a huge part of the die to only reduce the effect of latency in texturefetch"
     
  8. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    OK, although I still think that 32 pixels in flight sounds awfully little for an 8-pipeline part, given that it would allow for only 4 cycles latency per TMU lookup before fillrate takes a nosedive - TMUs that are *that* fast are hard to make, even if you don't count the texture cache miss latency.
    Sounds reasonable.
    Input and output registers add about 50% to that register count. 12 or so million transistors out of 125 doesn't sound *that* large to me, given that it can be done in rather compact SRAM. And for what happens when the pipeline is not optimized like that, it look like the SiS Xabre chip is a rather good example: it needs a huge positive LOD bias (=> blurry textures) to reach competitive performance levels, and its dependent texturing/EMBM performance is *horrible*.
     
  9. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,902
    Likes Received:
    218
    Location:
    Seattle, WA
    Yes, I believe I remember reading not too long ago that modern processors have hundreds of pipeline stages (300-400 was what I thought I remember reading). This makes perfect sense to me, as the static nature of 3D graphics just makes such huge pipelines an obvious optimization.
     
  10. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    If you can build a TMU that can handle a lookup with a 4 cycles latency it will fast enough for any case because the Pixel need in the Pixelpipline 4 cycles too. The other thing I belive is that the NV30 Driver try to optimize a shader programm. "Fetch early use late" ist the right way for this. The number of cycles between the start of the fetch and the use of the sample is the key. If you can do something other after start the lookup before you need the sample you will win some cycles. Each operation give you 4 aditional cycles. OK, in the case you are limited by TMUs because you use a high AF-Mode the pixelshader processor will not with maximum performance, but this is an old story and nothing new.


    Sure but we still need the same count of transitors for the 32 pixel that active in the pipelines. IMO it is better to use this transistors, if you have it left, as cache for textures to improve the hit rate.
     
  11. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    The number of stages seems a littel bit high. AFAIK it is more something in the range of 20-40.
     
  12. Althornin

    Althornin Senior Lurker
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,326
    Likes Received:
    5
    Yeah, from what i remember in reading my computer architecture book, modern processors have more like 7-20 pipeline stages. Else, a branch mispredict would cost even more.
    Quick googling seems to say that AthlonXP has 10 pipeline stages, and Pentium4 has 20. IIRC, pentium2 has 7.
     
  13. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    I found long ago a reference for 600 - 800 pipeline stages here:

    http://www.ce.chalmers.se/undergraduate/D/EDA425/lectures2002/gfxhw.pdf

    Not that I think that is a fully credible source because I don't remember any NVidia document talking about hundreds of stages in GeForce 3. May be it is counting all the stages in all the pipelines (for example 20 for each pixel pipe: 4x20 = 80)

    I think around 50 - 100 is a more reasonable number ;).

    Now we could start a poll about the number of stages in a GPU :lol: .
     
  14. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    Well, but a GPU is a lot different from a CPU. Shader units (the ones that are CPU alike) should have around 5 - 8 stages, taking into account short latency instructions (integer, branches) and long latency instructions (FP). But that doesn't take into account the stages in the fixed function hardware which can be a lot of more.

    Perhaps the guy in the pdf I posted before means with 600 - 800 the number of cycles it takes a vertex or a triangle from the start to the end, not the real number of stages in the hardware pipeline. That could include that a single triangle can produce multiple pixels.

    BTW, the 20 stages in P4 are just from Trace Cache to Write Back (from what I recall). That doesn't include instruction fetch and decode from L2 and commit, so the actual number perhaps is a bit larger (though Intel doesn't seem to like to talk about it). It is a hyperpipelined superscalar out of order architecture, well and bithreaded now ;). I still recall the last PACT with three papers talking about how many stages could have a CPU before becoming inefficient (around 40 in some of the papers).
     
  15. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    RoOoBo, yes i think that the 600 - 800 are the number of cycles that need from the start to end.
     
  16. Basic

    Regular

    Joined:
    Feb 8, 2002
    Messages:
    846
    Likes Received:
    13
    Location:
    Linköping, Sweden
    A GPU isn't like one CPU, it's instead several CPUs connected in series with FIFOs between them (VS, triangle setup, posibly some rasterization stuff, PS). If the 600-800 stages number ar right, then I'd guess that it refers to the sum of all pipes, plus the depth of the FIFOs. This does not by any means mean that the pipeline depth for just the PS is anywhere near these numbers.

    I assume that most GPUs are heavily dependant on texture coherency, even in dependant reads, so that precaching of textures is possible. PS' should get their data directly from the caches in most cases, and the cache misses should be rare enough that the high cost doesn't matter that much.

    It would be interesting to test dependant texture reads where the accesses to the second texture is rather random, and with mipmaping turned off on that texture. If it doesn't take a deep performance plunge, I'd be impressed.
     
  17. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    Demirug - from the extreme-tech article, it doesn't sound like there are necessarily 8 decoupled pipelines at all. Kirk simply states that there are 32 functional units, and that 8 pixels can be written per cycle.

    http://www.extremetech.com/article2/0,3973,710352,00.asp


    Something else that's interesting (from the same article):
     
  18. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    psurge, the information with the decoupled TMUs was in an other artikel but i can not find it at the moment.
     
  19. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    In fact instructions in a CPU can also remain in the pipe for hundreds of cycles: a load that miss in L2. And now think about a chain of dependent loads that remain in the instruction queue ;). So you could say that a CPU has *cough* hundreds *cough* of stages. But it doesn't make any sense ...
     
  20. Althornin

    Althornin Senior Lurker
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,326
    Likes Received:
    5
    I understand that, my post was in relation to Chalnoths comment on modern processors having hundreds of stages.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...