[B3D Analysis] R600 has been unleashed upon an unsuspecting enthusiast community

Discussion in '3D Hardware, Software & Output Devices' started by Farid, May 14, 2007.

  1. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,299
    Likes Received:
    137
    Location:
    On the path to wisdom
    The context implied getting samples into the shader, which most likely means texture reads.

    To have the MSAA buffer as just a simple texture would require disabling color compression, which could substantially reduce the usefulness of MSAA.
     
  2. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    That would be a naive implementation, a better implementation would give you all the subsamples you need in your pixel shader automatically uncompressed: I expect R600 to implement something like that.
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I don't see why the RBEs can't have a "backward" datapath to the shader pipeline. After all, there's a forward path.

    I'm sure I saw somewhere a hint that the 8KB of read/write cache, per SIMD, is used as the path for this. I interpret this to mean that the RBEs have a write path to the R/W cache and this cache is then used as the input for the AA-resolve shader, as though they are per-pixel attributes. The R/W cache is also designed to share data between "neighbouring pixels", a feature of tented-AA - though obviously texture-based AA can do the same.

    The RBEs are the perfect units to fetch the render target because they're designed to do this anyway, and because the RBEs will use the least amount of bandwidth doing so (since the RBEs have access to the compression flags). Though the bandwidth saving would diminish as per-frame triangle count increases (assuming ~constant overdraw factor).

    Which would run faster? The texture units, in point-sampling mode, can fetch 80 samples per clock. What's the sample rate for the RBEs? 64 colour samples per clock?

    Clearly, though, the programmer-exposed concept of shader AA in D3D10 is texture-based, e.g. for HDR or deferred-rendering where each sample is accessible. That fact does rather undermine my argument.

    Also, I'm thinking there's a very strong likelihood that ATI has used the multiple-concurrent context support of R600 to effect the shader AA pass:

    Code:
       Frame 0 rendering   AA
    |---------------------|--|
                             Frame 1 rendering   AA
                          |---------------------|--|
    
    so that AA resolve for frame 0 is able to perform texture fetches while frame 1 is also able to perform texture fetches as it starts rendering.

    Jawed
     
  4. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    This would require double buffering the AA buffers as you can't still be resolving the old frame while beginning rendering on the new.
     
  5. pelly

    Newcomer

    Joined:
    Sep 5, 2002
    Messages:
    159
    Likes Received:
    5
    Location:
    San Jose, CA
    I'd love to see the second part of B3D's coverage for this card.....I know staff said they're waiting on a few items.....but couldn't the article be posted now (or soon) with the "special bits" being added as an update once the info is available? :sad:

    By the time we see things, there will likely be some major driver changes....with some possible "key" performance enhancements...
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Hmm, so R600 can't use regions of memory for distinct render targets, in order to "double-buffer"? The hierarchical- Z and -stencil buffers aren't regionalised either?

    What happens when Aero is running? e.g. you have the 3D interface running, which is a z-buffered render target and within that multiple 3D applications each of which has its own "private" z-buffered render target.

    Or, erm, isn't that possible? I thought it was...

    Jawed
     
  7. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    You lost me. If you have a single AA buffer, you can't resolve it and render to it at the same time. You don't know where the application is going to render so it may corrupt data that still needs to be resolved. In any event, as Eric mentioned, we are using shader resolve which means the 3D pipeline is being used for the resolve operation in any event.
    That's no problem at all because everything has their own private buffers. Aero has one Z buffer, each 3D app has their own, etc.
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Ah, I was guessing that because each application can have its own private buffers, a game could use two buffers and flip between them for the purposes of AA resolve, so that there'd be no AA clash.

    Oh well.

    So what you're saying is that this is a texture unit based resolve, much the same as D3D10 exposes for programmers. Presumably this means that there's no bandwidth or ALU cycle savings to be made, by detecting compressed tiles. Or does R600 dump the compression information out to a second texture, allowing the AA resolve shader to navigate the AA data efficiently?

    Jawed
     
  9. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,299
    Likes Received:
    137
    Location:
    On the path to wisdom
    Yes, of course. I was referring to Deano's "MSAA buffers being just linear textures" which isn't true when compression comes into play.

    You can have a data path from anywhere to anywhere. I didn't mean to imply any limitation of that kind.
     
  10. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    The app could do this, but there'd by no savings as the pipeline would still be in use for the resolve.
    I'm not at liberty to go into specifics, but we do take steps to do things efficiently.
     
  11. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Let say R600 shader core can read back a compression flag which tells us if a tile worth of subsamples is compressed (all samples equal) or not, one would need dynamic branching to dinamically skip extra texture reads/extra math, imho it does not sound a good idea.
    What if we can read such a flag and fill a render target with such a mask that we can use later to automatically early-skip every pixel that has compressed subsamples (or the other way around).
    Then our resolve pass could be decouples in 2 full screen passes, one processing every subsample belonging to a pixel and the other one sampling just one subsample per pixel.
    AA resolves wider than one pixel would need some special care at mask generation time though.
     
    #871 nAo, Jun 19, 2007
    Last edited: Jun 19, 2007
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    If the compression flags (which are just a 2D structure) are dumped into memory to be used as a texture, it should be simple to do this in a single pass. In the AA-resolve pixel shader, each pixel knows whether it's part of a compressed tile or not. The thread size of the GPU (16 in RV610, 32 in RV630, 64 in R600) determines coherence, which means that the execution time is constant for uncompressed tiles (it's also constant for compressed tiles). We don't know the size of the compression tiles...

    This patent document:

    Method and apparatus for anti-aliasing using floating point subpixel color values and compression of same

    refers to tiles of 2x2 and 4x4 pixels. The size of tiles may depend on the colour precision in the render target... The number of samples per pixel (2, 4 or 8) may also affect the tile size.

    Also it uses a 3-level compression scheme: uncompressed, partially compressed and fully compressed and seems to be designed around the concept of an fp16 pixel format (i.e. 64-bit per pixel)

    So, ahem, juggling this data is a bit more complicated than I was thinking, :lol:

    ---

    If the render target is in a 32-bit per pixel format, presumably the texture fetch for a destination pixel can be performed in a single texture load operation using a "32-bit fetch". This will actually fetch 128-bits, which is 4 AA samples' worth of colour data. The AA resolve shader then needs to "unpack" these samples.

    Jawed
     
  13. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,984
    Likes Received:
    34
    Rendering errors in CoJ, eh?
    http://www.geforce3d.net/index/node/35?page=0%2C5
    R600's not alone there.
     
  14. zealotonous

    Newcomer

    Joined:
    Feb 23, 2007
    Messages:
    29
    Likes Received:
    0
    I can't seem to locate the rendering issues you are presumably referencing. Can you be a bit more specific and tell us where to look?
     
  15. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    Screen-door artifacts that no one seems to understand are inherent to the algorithm employed for anti-aliasing alphas and have jacksquat to do with this or that card. The R600's artifacts in COJ were of a different sort(with those early drivers that were used for the first batch of reviews, it was blurrier than the G80 in a number of scenes, the most apparent one being the one with the fireplace in it at the end), but I think they`re fixed/significantly reduced now, as the benchmark produces fairly clean graphics with the R600 nowadays, in my experience.
     
  16. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,984
    Likes Received:
    34
    Actually, after seeing the latest review of CoJ @ Geforce3D with screenshots comparing DX9 and 10, it looks like the screen-door/cheesecloth issue can be attributed to "soft edges", as this effect simply isn't seen in DX9 mode.

    Guess I jumped the gun on that one :oops:
     
  17. thambos

    Newcomer

    Joined:
    Sep 29, 2007
    Messages:
    194
    Likes Received:
    0
    Just popping in to say thanks.:smile2:

    Great article Rys, enjoyed reading it.
     
  18. Geeforcer

    Geeforcer Harmlessly Evil
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,297
    Likes Received:
    464
    Well, while the thread got bumped, I might as well as the question: A year later, what do we make of R600? Back in the day, there was a lot of hissing whenever someone mentioned NV30, but in the grand scheme of things, the similarity between the two products, right down to short lifespan, has proven to be remarkable. I guess the biggest difference is in how the two companies have handled the situation - ATI seems to have admitted defeat right away and priced R600 to move while Nvidia attempted to alter the perception of their products by means WaltC has written many a novel about.
     
  19. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,415
    Likes Received:
    348
    Location:
    Germany
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Woah that's quite tasty.

    68-84% utilisation. Is it just coincidence that as the shader count per game goes up the utilisation goes up?

    One interpretation would be that the games with less pixel shaders tend to use static or dynamic branching (these games tending towards the use of "uber shaders"?). The branching may be directly hurting instruction level parallelism due to short-ish blocks in the branches.

    Against this, if the population of shaders is very high that would imply that each shader is short. In which case I would tend to expect there's relatively limited instruction level parallelism, therefore the utilisation should be lower. But it isn't :???:

    Jawed
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...