ATI's idea on transistor budgets

Discussion in 'Architecture and Products' started by superguy, Feb 22, 2006.

  1. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    For all intents and purposes, 1 fragment == 1 thread and 1 vertex == 1 thread. This is how people code their shaders. Let's not mix up terminology here.

    If you think about it for a second, you'll see that this pattern is not very efficient: All L1 texture caches will contain roughly the same data. Instead of having the effect of adding up L1 texture cache sizes, you'll just be taking the max texture cache size as the only cache size of the system.

    More practically, if you have 4 quad pipes and each one has a 1 MB L1 texture cache (for the sake of argument), you want your access pattern to turn those 4 L1 caches into effectively a large 4 MB cache. This is much more effective than making all 4 1 MB caches act like just a single 1 MB cache (thus wasting ~50 million transistors).
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    In deference to RoOoBo, I write "thread" these days instead of "batch" :grin: XB360/Xenos terminology is "vector", as a matter of interest - but that's hella confusing, I think. "Thread" is common in discussion of R5xx, too.

    I know this isn't the real pattern being used - but the trouble is, what is the real pattern? I was hoping you'd say... I've seen a fancy "snake-walk" somewhere but I can't put my finger on it.

    Hmm, I think there's a concensus round here that texture caches are in the KB, not MB. I saw a figure of 8KB for L2 in NV40 the other day (on another webby) - I can imagine that L1 is something like 1KB...

    The suggestion was also that L2 contains compressed texture data, while L1 holds the data in un-compressed form. Additionally, how is L1 organised? Is L1 actually distinct per fragment-texturing pipe, or is L1 shared for all fragments in the quad?

    Any good info on cache size and organisation?

    Jawed
     
  3. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    Jawed, he's just throwing out a number (1MB) for the sake of discussion. I wouldn't view it as anything other than that.
     
  4. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    I've heard Nvidia has up to 1024 pixels in a batch, but where did the 6 threads come from?
     
  5. JF_Aidan_Pryde

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    601
    Likes Received:
    3
    Location:
    New York
    Each quad gets its own thread. G70 has six quads.
     
  6. Dave B(TotalVR)

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    491
    Likes Received:
    3
    Location:
    Essex, UK (not far from IMGTEC:)
    Or vertices.
     
  7. Humus

    Humus Crazy coder
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,217
    Likes Received:
    77
    Location:
    Stockholm, Sweden
    :cool:
    Yes, with R2VB. Actually, depending on how many components you need for your vertices you can work in parallel with more. If you're working on a single channel attribute you can work on 192 vertices in parallel (well, actually even 768 if we consider MRTs).
     
  8. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    I'm not sure you can really say that, Humus. Going beyond 96 parallel ops requires that you can pack your operations into 4-vectors, and you can't do that with every operation. Going beyond 192 would require that once you have calculated one attribute, calculating another is a simple manipulation, and even then it's not really 768 completely in parallel.
     
  9. jpaana

    Newcomer

    Joined:
    Jul 31, 2002
    Messages:
    154
    Likes Received:
    2
    Location:
    Tampere, Finland
    Morton and Peano-Hilbert orders are pretty nice for texture cache hitratios and such.
     
    Jawed likes this.
  10. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    Indeed. It doesn't matter what the actual numbers are. I was just trying to illustrate a point: you'd see pretty much the same result no matter how large (or small) your caches were.
     
  11. vember

    Newcomer

    Joined:
    Mar 20, 2004
    Messages:
    48
    Likes Received:
    2
    so, where is this R2VB I keep hearing about?
     
  12. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Render to Vertex Buffer? It's been supported by ATI since the R300, I believe.
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
  14. Basic

    Regular

    Joined:
    Feb 8, 2002
    Messages:
    846
    Likes Received:
    13
    Location:
    Linköping, Sweden
    Do they really use Peano-Hillbert? At first glance it looks a bit better than Morton. But when thinking some more about it, I just see a more complex addressing without any practical benefits.
     
  15. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,303
    Likes Received:
    137
    Location:
    On the path to wisdom
    L1 is likely 256 bytes per quad.
     
  16. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    As far as I can see, Peano-Hilbert is 'better' in that diagonally opposite blocks are not stored next to each other in memory, thus avoiding the most severe locality problem of Morton. This may have a small beneficial effect if the texture map is not aligned to DRAM or virtual-memory pages (fewer page breaks/page misses) or if the size of the individual texel is not a power of 2, but other than that, the effect should be zero.
     
  17. Basic

    Regular

    Joined:
    Feb 8, 2002
    Messages:
    846
    Likes Received:
    13
    Location:
    Linköping, Sweden
    Yes, that's what I meant. At first it seems as the better coherency would help. But when thinking more about it, it seems that it'll only help in cases that never happens. Are there any non-power-of-2 texture formats? Will a gfx card ever do reads, writes or memory allocation for textures that aren't well aligned with power-of-2 blocks. I doubt it, even for virtual memory.

    Only exception I can think of is non-power-of-2 sized textures. But address swizzling doesn't work well with those anyway.
     
  18. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    Hilbert order? There was a paper in the 2001 Graphics Hardware from McCool and others about implementing a recursive rasterization method that made use of Hilbert order. But it didn't went into the benefits of Hilbert order as a pattern for texture or framebuffer access. They key reason for Hilbert order was that the recursive algorithm could be implemented without a large stack using an automat that would walk through the whole framebuffer using a Hilbert space filling curve.

    For the simulator I also use recursive rasterization but as I gave up on implementing something that realistic (researching rasterization wasn't that interesting) the algorithm and the recursivity is only in the emulation side, and the simulator only requests n tiles of fragment quads per cycle and expects on average to get them. I think the different levels of tiles (down to the quad) are generated in Morton order now.

    After lazily implementing round robin for 'better' shader workload distribution I went towards a workload distribution based on tiles (as reported for ATI) and later implementing a Morton distribution of those tiles in memory and between the quad pipelines. Which reduced significantly the amount of extra texture bandwidth consumption and the unbalance accessing the memory.

    The textures were always implemented with multiple levels of tiling and stored using Morton order at each level (down to the texel level).

    About using the name 'threads' or whatever it becomes really confusing. When discussing with the other people working on the simulator I end talking about shader inputs (as we have also vertices and potentially any kind of input going into the shader units), fragments (as it becomes hard to forget that there isn't just fragments), threads, quads and groups. I tried to avoid batches because the other people working here weren't reading Beyond3D or other similar sources that used batches for 'fragments on fly in the shader' and the use of batches for primitives and the unit workload sent to the GPU with no internal state changes was already consolidated.

    Now I tend to use quad as the minimun work unit for fragment processing with the shader 'thread' being a multiple of a quad. I then use 'thread', 'thread group' or just 'group' (tending to the later to avoid confusion) to the group of quads that are scheduled together on a shader and take n cycles to complete. I would say that 'thread' for me is more the hardware concept of having a PC, a position in a schedule window and other related state that is shared for all the quads in a group. A group is just the workload assigned to a thread which determines how the register storage is reserved. The available threads define the 'thread window' from where work is scheduled. You could have an architecture with the same number of threads but a different group size (R520 vs R580). Then another parameter would be how many of those quads in a group are processed in parallel in a shader (the number of ALUs).

    For very large 'batches' of quads that are scheduled together (what old ATI GPUs and NVidia GPUs still seem to use) I don't like the term thread or group because there isn't any scheduling or true grouping. So I preffer to talk about queues (of quads) even if an implemention may be using and scheduling between a couple or two of those large queues.

    But that is just my personal prefferences.
     
    Pete likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...