Larrabee's Rasterisation Focus Confirmed

Discussion in 'Rendering Technology and APIs' started by B3D News, Apr 24, 2008.

  1. nAo

    nAo Nutella Nutellae Veteran

    You misunderstood me, I'm not talk about how many quads from different triangles can be used to fill a batch.
    What I'm talking about is how many fragments (if any) that belong to different primitives can be shaded within the same quad (when possible)
     
  2. Jawed

    Jawed Legend

    :???: I'm still confused over the point you're trying to make... A quad (consisting of two triangles) can be as small as a single fragment or as big as the largest possible render target :???:

    Jawed
     
  3. Scali

    Scali Regular

    Quad as in 2x2 fragment block.
     
  4. Jawed

    Jawed Legend

    OK, so that's from 1 to 4 fragments and in theory could consist of as many as 4 independent triangles. I still don't see how this is distinct from batching, if older ATI hardware was only rasterising 1 triangle per batch. Oh well.

    Maybe he's referring to a case where say 20 triangles all fit within a quad and ATI GPUs were faster at weeding out the 16 triangles that are irrelevant? I hadn't thought of that and that's indeed interesting, particularly if the NVidia hardware is ~1/4 the speed in this case, say.

    Jawed
     
  5. 3dilettante

    3dilettante Legend Alpha

    What is a core then?

    Larrabee's individual cores do seem to fully meet the idea of what a core would be, but G80's clusters aren't fully equivalent to an independent core.
    It is also possible that G80 has buffers and local store in areas that are not assigned per-cluster, but I don't know how to quantify them.

    I'm not sure.
    It may not be a matter of any one big problem as opposed to dealing with a lot of small stubborn ones.
    If the engineers see a huge chunk of Larrabee's cores are doing the exact same thing and taking a while to do it, the temptation will exist to optimize something in the successor.

    That's just about one major break with the past in the last 5-6 years, right? Intel's done about as many in the same time frame, from a hardware POV.

    Depending on how Intel manages its threading, it might be able to sustain the performance in the neighborhood of Silverthorne per clock, if it is truly optimized for SMT and can sustain a full dual-issue in integer code. Vector code is something difficult to compare as of yet, though it seems obvious Larrabee would be massively better.

    If Larrabee uses a coarser threading, it could be something like a quarter of that, in the case of fine-grained round-robin threading, or something in the middle, if it has issue restrictions that limit the mix of instructions that can executive simultaneously.

    It seems unlikely Larrabee can sustain 4 threads-worth of instruction issue in SMT mode, not without seriously over-porting the instruction and data caches (there's that port thing again).
    Perhaps Larrabee's designers assume 2 scalar and 2 vector threads, with stalls sufficient to make it so only one of each type is ever full bore.
    Another possibility is a 1 vector + 1 scalar thread combo, with the other two thread slots being almost "spares" that allow for switchouts in the background.

    Z was one of them. I was also thinking of buffers like that transfer R600 uses to feed data back into the pipeline without going to the ROPs and even G80's parallel data cache. The latter isn't necessarily specialized, but it is explicitely addressed and separate from the cache hierarchy.
    Any other buffers that allow GPUs to feed results back into the graphics pipeline without going to memory would count as well.
    A general cache hierarchy can do the same, of course, but it becomes a question of where on the cost/utility curve an implementation would fall.
    If future workloads truly penalize GPUs for their inflexibility, Larrabee will be compelling.
    If the shift takes too long, silicon implementations of the GPU pipeline will have time to evolve.

    Depends on what you define as a core. I've shied away from making comparisons on a per-core as opposed to per-chip basis because I don't consider G80's clusters to be full cores.
    Figuring out what is comparable is a doozy of a problem.
    G80 has 64 TUs, but only sets up 32 separate addresses, and those interface with a pretty small L1.
    I guess I'd say the 32 separate addresses point to a capability of a combination of porting and banking leading to 32 half-clocked read ports, as they run outside of the shader domain.
    If I were to say each ROP counted as a write port, it's about 24 at the base clock, perhaps?
    There are buffers on-chip that are even more restrictive in use, but which would take up a non-zero number of cache ports on an emulating core, but may do so only for a fraction of the time.
     
    Last edited by a moderator: May 5, 2008
  6. TimothyFarrar

    TimothyFarrar Regular

  7. trinibwoy

    trinibwoy Meh Legend

    Well of course the ALU's don't know or care where data is coming from or going to as they're not responsible for instruction issue.

    This patent goes into a lot of detail on instruction issue on G80. It clearly states that instructions to the MAD, TEX and SF pipelines are issued independently. You can simultaneously have a vertex thread instruction in the MAD and a pixel thread instruction is in the SF.....
     
  8. TimothyFarrar

    TimothyFarrar Regular

    Yeah, but how often does fragment re-pack to 2x2 quad actually happen? Does seem as if perhaps same triangle would be possible (same plane equation), but other cases?
     
  9. Scali

    Scali Regular

    Don't the 2x2 blocks need to be all from the same triangle by definition, for the ddx/ddy things to work?
    Even if some of the fragments aren't actually inside the triangle?
     
  10. Jawed

    Jawed Legend

    If you look at that presentation Trinibwoy linked you'll see that each multiprocessor contains an operand window. Going outwards one step, there's RF, constant cache and parallel data cache. Going outwards another step you get instruction cache.

    Separately the TMUs are connected to L1 cache and the register file(s) - if not the register files then the operand windows, hard to be sure. All the multiprocessors then share access to the L2s which are located in the ROP partitions. There's a suggestion that there's also memory in each ROP partition separate from L2.

    Clearly there'll be some similarities to this hierarchy in Larrabee, but for example I see it as unlikely there'd be a dedicated constant cache. Larrabee might lock lines for things like constants though :?:


    From those slides, here's all the points about caching:
    • Caches with decent behaviour
      • They are coherent
      • Relatively low latency (~10 clock L2)
    • Reasonable communication latency
      • <20 clock locks, semaphores
      • Primitives for thread synchronisation
    • Good bandwidth
      • 1TB/s on die aggregate bandwidth
      • >150GB/s off die bandwidth
    • Unified cache
      • Data sharing
      • Efficient, in-memory communication
      • Dynamic sizing (it is a cache)
    • Dynamic Cache partitioning
      • Private caches for high, aggregate bandwidth
      • Arbitrary data replication
      • Data swapping between partitions
      • Heuristic based configuration changes
      • Will likely come with non uniform access latencies
    Some diagrams show the L2 as being monolithic, outside of each of the cores. Separately some diagrams show L2 partitioned, per core (though still outside). I doubt it's truly a monolithic L2 with 10s of ports, instead it seems to me it'll consist of per-core L2s with the cores interconnected by the ring bus to effect sharing of L2 data to non-native cores.

    Yeah, just like cache behaviour improves with successive generations. I was reading this rather sobering diatribe the other day:

    http://x264dev.blogspot.com/2008/05/cacheline-splits-aka-intel-hell.html

    I can't remember where I read it but I discovered recently that Core 2's TLBs run out of entries rather easily, the worst case with the smallest entries being that only half of L2 ends up being used. Maybe that's a common issue with CPU caches :?:

    Presumably Nehalem does the job properly and Larrabee won't be so afflicted either.

    Yeah you could align ATI's breaks with major D3D revisions, 9, now 10 and perhaps another for 11 (though I'm doubtful there'll be much revision there - but I think NVidia's got quite an overhaul ahead of it). After 11 there may not be many more so Larrabee-architecture's adaptability could be moot...

    Presumably short term Intel doesn't have any interest in single-thread performance from the point of view of a general user. Single-thread performance only needs to keep up with what's needed to make their threading model work for GPU/GPGPU. So coarse grained, perhaps not even hyper-threaded :?:

    The brunt of utilisation/throughput focus will presumably be on the SIMD.

    G80 uses wide reads into its operand window. It fetches 16 operands per clock (16 instances of r0, say) while the SIMD can only consume 8 instances of r0 per clock. There's no reason why Larrabee can't window data in the same way. So the cache ports get wider rather than increasing in number.

    For what it's worth I assume that the scalar and vector pipelines are practically independent of each other.

    Interestingly enough R600 operates fully independent vector and scalar pipelines. I imagine the instruction cache is common (since code is small anyway), but I presume that register file and memory read/write cache ports are dedicated for each type of pipeline.

    Everything appears to point towards Intel dedicating itself to a hardcore "super-L2". It might be version 2 or later before the damn thing actually works convincingly.

    I agree. What's going to be interesting is how much of a leading hand Intel takes in the functionality of D3D. Intel can bring pressure to bear, pulling graphics features forwards, forcing the fixed-function GPUs to spread their transistors too thinly. Then again, I'm sceptical that beyond D3D11 there's much in the way of fixed-function hardware that's going to be added to GPUs.

    Jawed
     
  11. Jawed

    Jawed Legend

    No the patent suggests that as a possible embodiment.

    If MAD and SF were able to be issued independently, why design a compiler to work around the co-issue, register-bandwidth and dependency-chain constraints of their design:

    Dynamic instruction sequence selection during scheduling

    Jawed
     
  12. Jawed

    Jawed Legend

  13. TimothyFarrar

    TimothyFarrar Regular

    TLB misses can be a big issue on some PPC chips as well. Easy enough to work around with larger page sizes. Also helps reduce the problem of ignored prefetch on TLB miss.
     
  14. trinibwoy

    trinibwoy Meh Legend

    Wow that's some good work. It's crazy how those corner cases can kill performance across the whole chip. I wonder if anything similiar ever happens in real workloads - hard to imagine one or two TP's getting stuck with much more expensive shaders than the rest of the chip.
     
  15. trinibwoy

    trinibwoy Meh Legend

    The patent I linked is far more detailed and specific to G80 than this one so I'm not sure how you can arrive at your conclusion while dismissing my own. Independent instruction issue to each of the execution units may be a possible embodiment but it's also the only embodiment referred to in the text and it forms the basis for everything that follows.

    It's also plainly stated that each issued instruction makes use of only one of the available execution resources so co-issue isn't in play at all. In particular this excerpt emphasizes the issue of instructions from different threads to ensure that there are instructions available for each execution unit.

    The patent you linked seems to be very generalized and not specifically related to enabling co-issue or alleviating register bandwidth. Any particular reason why you think this patent is even related to the G8x compiler?

    This should be pretty easy to test right - to see if dependent SF instructions from warp A fill the SF pipe while MADs are running from warp B. If I knew the first thing about writing a simple shader I'd do it myself but alas....
     
  16. Jawed

    Jawed Legend

    This patent's main concept is that two or more pipelines, each having either different latency or issue rates or both can be issued from the same instruction buffer.

    The clearest, simplest embodiment that is known to meet this is an ALU pipeline + a TMU pipeline.

    A co-issue across two ALUs that have the same issue rate is logically the same as a single instruction. The ALUs have an issue rate of 1 every 2 ALU clocks. ALU + TEX issue is the classic asynchronous texturing case.

    Additionally it's not clear how instructions are scheduled when considering the cluster as a whole. Since a cluster contains two ALU SIMDs, two TA SIMDs and one TF SIMD (G80) or two TF SIMDs (everything else), how are operands for the TAs sourced (since they're known to come from the register file corresponding to the originating batch) and how are returned results from the TMUs mapped to registers? Is there a hierarchy of instruction buffers, one for TMUs and another for ALUs or are all of these mangled together into one ginormous instruction buffer for all SIMDs? Can't say I like the sound of either...

    It talks about choosing optimal instruction sequences e.g. it can be advantageous to break a MAD into MUL + ADD. You can also read this as implying the use of the "missing MUL". Instruction dependency and read-after-write hazards both come into play (though the latter doesn't directly impact compilation - merely a good compilation can reduce RAW by increasing instruction-level parallelism).

    Remember that a pixel shader batch (32 fragments) actually consists of two true warps - warps are really 16 wide in G80. Ever since my original discussion with Bob on the subject of instruction issue my view has been that the two warps that make up a pixel shader batch (or CUDA "warp") issue as mirror pairs. So that, indeed, Warp A issues on MAD while Warp B issues on SF and then they mirror so that Warp A issues on SF while B issues on MAD. But the point I'm making is that this is a static optimisation - it doesn't allow for random pairs of warps to mirror at any given time. Two warps are bound together for the lifetime of the shader on either warp.

    This mirroring may be related to why only 4 points, not 8 points, actually make up a pixel shader batch according to that page Timothy linked. Dunno.

    Jawed
     
  17. 3dilettante

    3dilettante Legend Alpha

    The instruction cache is populated by the setup hardware in the global scheduler.
    The setup and context management are not managed by the clusters, which means they are not fully realized as cores.
    That presentation is pretty focused on the ALU portion of the clusters, and it doesn't really cover the more specialized graphics hardware such as the rasterization and ROP portions of the chip. In addition it doesn't cover the scheduling FIFOs that exist above the per-cluster schedulers.
    Those other storage sections on the chip would map to a generalized cache port at least part of the time while emulating, if no specialized hardware exists.

    That might help with keeping constants on chip. Using them still involves taking an explictly separate register pool and folding into the x86 register set and cache storage. This still would involve a demand load that could have gone to another operand fetch, though I'm not clear on whether G80 and R600 can do much better.

    One possibility could be to use dynamic compilation to feed constants into the code stream and fold them into the instruction cache.

    I wouldn't say it was discovered recently, and it doesn't prevent half the L2 from being used.
    The TLB entry count wasn't really top-secret, and AMD has touted for some time that their bigger TLB setup was better for large workloads as a way to differentiate Opteron over Woodcrest.

    Whether it's common for a desktop app these days to go for the tiny page sizes is a good question. I think on average desktop and portable loads it isn't that bad, otherwise Intel would have expanded it.

    Anyway, the full penalty of having small TLBs isn't that half the L2 can't be used, but that there is an additional latency penalty when a non-covered address called.
    There's a very good change the evicted TLB entry is in the L1 or L2 anyway. So it just means Intel's otherwise very fast cache is in rare instances slower.

    Nehalem pretty much removes Opteron's TLB advantage.
    As for Larrabee, I'm not even sure its workload is going to use so many pages anyway. Shader code could probably sit in a few large pages, and a lot of textures are bigger than the minimum page size anyway.

    Larrabee's vector registers are already 512 bits wide, which does match the width of an L1 cache line very nicely.
    For a single register fetch, we already have an awfully wide (physically) port.
    The individual threads are independent, so in an SMT case, there's no reason why they coudn't try to issue a vector load each.
    To really service that, Larrabee's cache would need 4 512 bit ports on an awfully small core.
    I think that the more likely design decision would be to limit the sustained throughput of the design to maybe 2 threads-worth at most.

    The physical constraints of pulling that much data in at once would be troublesome.

    Given all the things it has to do, the L2+ring bus control hardware is sounding like downright specialized by this point.
     
    Last edited by a moderator: May 5, 2008
  18. Jawed

    Jawed Legend

    I was just trying to point out that G80 has a hierarchy of port operations and it time-splats wide reads in order to get around port-count limitations in general. I don't think a port-analysis of G80 is particularly transferable to Larrabee, though.

    R600 supports in-opcode literals (4x 32-bit per VLIW instruction usable across any of the 5 lanes). It also supports reading either from a relatively small buffer of indexed constants - or it supports reading constants from a set of cache lines that are "locked" for the duration of an ALU clause (i.e. for a period lasting from 1 to 128 ALU clocks). The Sequencer is responsible for setting up the cache lock.

    The idea is to hide fetch latency for constant cache lines (just like hiding texture latency) and to provide a degree of constant indexing.

    As far as I can tell G80 does pretty much the same.

    Dynamic compilation is the approach used by NV40. This causes huge state change overheads (recompilation of the entire shader) when a constant is changed and apparently tended to restrict the use of constants (since the performance hit was quite noticeable).

    By contrast D3D10 constant buffers are meant to support large data structures and varied scopes (frequency of update) in the rendering pipeline. Indexing into constant buffers (there are 4096 elements allowed per constant buffer) is a key concept.

    :lol: I discovered this recently in some article, no big deal :lol:

    I dare say it seems that most apps are so undemanding that no-one seems to know better.

    Did you read that blog entry? I know it's about unaligned accesses, but the fact is C2 had half the performance it should have had - while Opteron whizzed through with barely a bump. It just seems Intel has got away with things like this in the consumer space because it's not very demanding...

    And if each L2 is localised to a core the TLBs will be localised and reasonably efficient I guess.

    R600 has an enormous operand bandwidth in comparison :smile: But then each SIMD is way wider.

    Don't forget that there's some kind of gather mechanic in Larrabee (and scatter). I presume this goes between L1 and the SIMD.

    I dare say I'm blasé simply because this whole thing is about throughput.

    So it should be!

    Jawed
     
    Last edited by a moderator: May 6, 2008
  19. trinibwoy

    trinibwoy Meh Legend

    You're ignoring the text which clearly identifies three different types of operations corresponding to the TEX, ALU and SF pipelines. I'm not sure how much more explicit it can get than this.

    Sure, you can read it as many things but the concepts described apply to the general case of instruction re-ordering and aren't particulary focused on anything related to the "missing MUL" co-issue.

    Ok, I'll try to track down that exchange. Haven't seen any documentation that points to that though.....
     
  20. Mintmaster

    Mintmaster Veteran

    When you scan convert a triangle into quads, lots of the quads don't have all four pixels (or in the case of 4xAA, all 16 subsamples) lie in the triangle. There's a coverage mask sent with each quad so that the ROPs only update the appropriate pixels.

    When you have two connected triangles, quads intersecting the common edge will be generated in both scan conversions, with opposite coverage masks. nAo is saying that if the two triangles have near-equal derivatives, one of these quads gets merged into the other to eliminate redundant work. I'm surprised that this is true, but I presume he has good reason to make this claim.

    Bob is talking about something different from what nAo is. A pixels shader has to run through all the quads of a batch in lock-step. The quads can, however, come from different triangles, and according to Bob this means up to 20 different triangles. Basically, this means that the iterators can access up to 20 different sets of interpolant data.

    nAo is talking about the case where several of these quads are actually located at the same spot on the screen and simply have different coverage masks. Sometimes they can be coalesced into fewer quads if the conditions are right.
     
Loading...

Share This Page

Loading...