Performance hit for 64bit and 128bit rendering ?

Discussion in 'General 3D Technology' started by BRiT, Sep 26, 2002.

  1. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,903
    Likes Received:
    218
    Location:
    Seattle, WA
    Sorry, I meant modern graphics processors. Of course you're right, CPU's have far less.
     
  2. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    586
    Likes Received:
    2
    Location:
    UK
    To maintain peak fillrate, you want to be able to absorb the latency of your pipeline plus the latency of a cache miss, this can add up to a lot of clocks. In a pre Dx8 processor this latency could be absorbed by adding a large fifo between the texture fetch unit and the colour processing unit, however as soon as you allow a texture to be looked up using the result of another lookup and/or some arithmetic, that solution no longer works, you have to either duplicate execution units (with fifo's in between i.e. think phase marker) or make your temp regs as deep as the latency you want to absorb. The latency of a typical GPU texturing pipeline (e.g. isue->cache->filter->ps) is probably of the order of 30-50 clocks, however a cache miss can add 100's of cycles to this, this potentually makes temp registers very expensive!

    Incedentally modern CPU's generally grind to a halt on a 'seen' cache miss e.g. try reading from uncached memory (such as FB, this typically yeilds around 5MB's!). This is one of the things hyperthreading helps with, i.e. try and use my processing resource for something else while I'm stalled waiting for external resources to come back...

    Hmm, seem to be rambling.

    John.
     
  3. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    I wonder how many cycles is the max latency for a texture load. I don't think it is that big as a hundred cycles. You have to remember that the video memory is way faster than the main memory and that the GPU is a lot slower than the CPU. The CPU to main memory relative clock is around 10-20 CPU cycles per bus cycle (the address clock is still 133 MHz even if the data can be read/write at double speed). The GPU to video memory is 1:1. That without counting AGP tecture reads, but I don't think anyone takes that into account while designing the optimum path in the pipeline (the latency would be way too large).

    Of course a texture read can be more than a 32bit data read when using filtering (or textures with more precission) but the data burst (reading a whole line) is faster than address setup and if the texture is stored correctly all the data is fetched in few cycles with just one or two accesses.

    The fixed amount because the filtering algorith shouldn't be that large, bilinear is 3 adds (32bit as I think FP filtering isn't supported yet) and one division/multiplication (probably something way more efficient that a real division), trilinear is 9 adds and 3 div/muls, AF could be more expensive but that is expected in any case (it isn't suppossed to be single cycle). I don't think it is that much larger than the FMAD latency.
     
  4. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,903
    Likes Received:
    218
    Location:
    Seattle, WA
    Well, one thing that you do need to consider is that the memory controller will need to cache resonably large blocks of data for there to be any kind of reasonable efficiency in using the memory bandwidth, so while the memory may only take a few clocks to access, the memory controller may require more so that it can make best possible use of the total memory bandwidth.

    Still, I don't really know how this can translate to 100's of clocks...
     
  5. Kristof

    Regular Alpha

    Joined:
    Jan 30, 2002
    Messages:
    733
    Likes Received:
    1
    Location:
    Abbots Langley
    RoOoBo its all about pipelining and latency. When you do a texture read you are going through the following steps :

    PS -> Texture Address Generator -> Cache <=> External Video Memory
    Cache -> Texture Filter -> PS

    You need to remember that all these stages will be pipelined so they can execute a full operation per clock. Even though they deliver a result each clock this does not mean that an operation is being worked on for one clock. Each of these blocks is a set of pipeline stages and each stage adds another clock of latency. So even though filtering might not sound expensive due to the fact thats its pipelined its going to add quite a few stages, also don't forget that there are other things involved with all of this for example decompression : DXT1 is not going to magically decompress itself, lower bit depth formats will get upsampled to the internal format (conversion to float for example) and all of this takes time and stages and adds latency.

    If you have a true miss then you'll need to access external memory, and you don't just read the texels you need you'll read a burst which is a whole block of texture data - just reading this will take quite a few cycles. But the real issue is that your memory interface is doing a lot of things (nopt just texture fetches) and so all of this goes into a buffer with several stages (need to wait your turn, you can't break for example a vertex buffer burst just because you need a texture). So just because you have a miss does not mean you can stop all the other data that needs to get onto or off the chip and then you'll most likely have a page break (another wad of cycles lost) or some read/write turn around. Another thing to remember is that you have a lot of data that needs to flow into and out of the cache, think 8 pipelines all supporting trilinear thats 8x8=64 texels flowing at 32 bits each so thats 64*4 bytes or 256 bytes of data flowing out of your cache, thats not a trivial thing to do so the cache itself is a quite complex beast. Anyway all in all you might be quite surprised how quickly latency adds up in a complex design.
     
  6. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    586
    Likes Received:
    2
    Location:
    UK
    The other thing to bear in mind is that the cache won't necessarily get the bus the instant it requests it as other ports may be using it e.g. FB R/W, DAC, Vertex fetch etc all of which will be trying to supply as large a burst as possible in order to maintain efficiency, add in the time to burst fill the cache itself and then page breaks and you start having to allow for an awful lot of clocks.

    There's also a pile of latency sitting behind the cache in the form of arbitration logic, synchronisers (for async mem) and memory interface. This all adds up very quickley...

    John.
     
  7. PSarge

    Newcomer

    Joined:
    May 7, 2002
    Messages:
    147
    Likes Received:
    0
    Location:
    UK
    Exactly!!!

    Consider that there might be 10-20 different areas of the chip all wanting access to memory bandwidth (Video Displays, Writes from host loading geometry/textures, DMA engines, texture reads, z reads, z writes, framebuffer reads, framebuffer writes, and then add in all the clever stuff :) ) Then think that each time a request for data goes out of a cache it has be arbitrated and then wait for all the other higher priority requests to complete before it gets hold of the memory bus.

    Time just flys while you're having fun 8)
     
  8. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    586
    Likes Received:
    2
    Location:
    UK
    Actually, going back to the original subject of this thread. Neither the 9700 or NV30 will have sufficient memory BW to support 128 bit surfaces at their max fillrate. E.g. at 9700's 2.6GPix/s fill rate 128BPP requires ~40GB/s just for data write out alone i.e >2x peak memory BW, add texturing to this, and well...

    John.
     
  9. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,903
    Likes Received:
    218
    Location:
    Seattle, WA
    Yeah, if the cards can actually calculate a 128-bit pixel in a single clock per pipeline. Remember what these things are going to be used for? Primarily for intermediate passes of multipass algorithms.

    Think about that for a moment. If you have to resort to multiple passes on DX9 hardware, how long is the shader going to be? I don't think memory bandwidth is going to be much of a concern for 128-bit PS, usually (Might be a concern for people attempting to use nVidia's packed pixel format...not sure...).
     
  10. ERP

    ERP Moderator
    Moderator Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    I would have thought that most of the cases where writing out a 128BPP buffer it's going to be an intermediate result and by enlarge your going to be spending a lot more than one clock computing the results.

    So I don't think you have to approach the necessary bandwidth required for peak fill.

    Actually I think that anytime your stressing R300 or NV30 feature wise your probably going to end up more worried about pixel computations than raw framebuffer bandwidth.
     
  11. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,560
    Likes Received:
    157
    Location:
    In the Island of Sodor, where the steam trains lie
    You might still get bottlenecks. The "average" rate may be within the limits but if the peak rate is too high and the FIFOs too small then you won't get near that average.
     
  12. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    586
    Likes Received:
    2
    Location:
    UK
    Yep, the typical algorithms we've come up with that use these do tend to have a reasonable number of instructions so, as you say, the avarage rate tends to be quite a bit slower, although I can probably come up with scenarios in which passes with relatively few instructions are required.

    The other way of looking at this is how much memory are these surfaces going to take ? e.g many algorithms might want a 1:1 ratio with the "real" render target, so, say at 1024x768, with 4xAA @ 128BPP, requires 48MB.

    John.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...