Larrabee's Rasterisation Focus Confirmed

Discussion in 'Rendering Technology and APIs' started by B3D News, Apr 24, 2008.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Sorry, no. Just pointing out that I think porting on L2 is less exciting than it first appears.

    I think you can partition this problem into per-core.

    I'm now fairly confused :oops: over what's going to be fixed function in Larrabee (rasterisation or texture addressing/fetching or texture addressing/fetching/filtering). Clearly the fixed function stuff depends on the cache hierarchy.

    I certainly won't disagree.

    Apart from performance?

    Presumably Intel has its eye on D3D11, so the question is, after D3D11 what kind of new fixed-function units are going to turn up that break Larrabee performance so badly (when implemented in software) that it needs reworking. Anyone got any ideas?...

    For the GPUs? Interestingly if you examine AMD's IL (for CAL) and Nvidia's PTX (CUDA) there's a lot more general purpose capability in there than might be expected.

    R600 evolved pretty much every unit specifically for D3D10. There's plenty of breaks from the past (e.g. the entire L2-based texture system). And the virtual memory system affects every part.

    Ah, I wasn't thinking of x86 baggage. It'll be interesting to see what single-thread performance is like, that's for sure.

    Video decode apparently looks better on R600 than UVD-equipped GPUs because of its muscle and programmability.

    It'll be interesting to see whether programmable rasterisation comes along any time soon:

    http://gamma.cs.unc.edu/logpsm/logRasterizationGH.ppt

    so, once again, the question needs asking: why have fixed function rasterisation when advanced algorithms want anything but?

    I've seen various papers on the subject of compression for floating-point textures. Again, something that's going to work immediately on Larrabee but is perhaps years away on a traditional GPU (ok, Larrabee's currently years away).

    Some aspects of compression (e.g. texel de-/compression) will prolly find themselves as "fixed function" instructions - so yeah, that's fixed function in the strict sense. "Compression" for colour/Z in memory is arguable.

    I'm not sure what kind of specialised buffers you're thinking of. Hierarchical-Z?

    It's a tricky question actually as ASPs for GPUs appear to have been falling dramatically (i.e. unexpectedly) for a while now. That's definitely something that's going to put Larrabee on its back foot - something that Intel couldn't have planned for when it started out.

    Jawed
     
  2. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    so how many cache ports do you think G80 has per core?

    Aaron Spink
    speaking for myself inc.
     
  3. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    I wonder how they will handle the vectorized loads (ie. texture loads). Lots of 64 bit wide ports? Or perhaps vectorized loads are serviced from the LSU? (Having it act as a L0 texture cache.) They have enough threads to cover up filling up the LSU with wider loads from L1.
     
  4. Barbarian

    Regular

    Joined:
    Jun 27, 2005
    Messages:
    289
    Likes Received:
    15
    Location:
    California, USA
    Somehow I doubt the rasterizer will be fixed function.
    Back in the Pentium1 days people had scanline DDA rasterizers that took just few cycles per fragment. The limiting factor was the divide which I'm certain would be solved with some reciprocal estimate in the vector unit. I hear Michael Abrash is on the team, and he is definitely the guy that can pull it off (for reference check out the Quake1 source code and his insane 4-cycle-per-pixel-fdiv-in-flight render loop).

    Additionally, Larrabee doesn't have to rasterize at all and can instead just subdivide to subpixel level. It's not even limited to just triangles.

    On the other hand I do hope they have some FF to help with texture address generation. Something that can convert a normalized UV to swizzled texture address in few cycles and maybe compute bilinear weights etc.
     
  5. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    195
    Location:
    Stateless
    A lot here seems anxious about the X86 overhead.
    How much improvment (for the MIMD part of the core) would you expect if the design were based on PPC (or vliw design like ARm)?
    (we have informations about Atoms, some must have information about say xenon (in order ppc) may be some could extrapolate?)
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Just noticed this thread's article mentioned logarithmic rasterisation :smile:

    Yeah, that's what I've been assuming - due to the previous round of discussions on the subject of FF hardware a while back.

    Jawed
     
  7. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,057
    Likes Received:
    3,114
    Location:
    New York
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    That's very useful, ta. Doesn't include things like early-Z rejection and MSAA but it's a nice snapshot of a "pre-DX8" GPU, I guess.

    155 G operations per second for a 2001-era GPU. I dare say it's misleading to contemplate the programmable capability from back then since GPUs' programmability was extremely limited.

    Assuming the figure stated for 7900GTX is modelled the same way, 2100 Gops/s compares with 302GFLOPs of vertex+pixel shading. One could argue that G71's math ALU utilisation was so poor that 302 GFLOPs should be treated as being more like 175-200.

    Jawed
     
  9. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Jawed, what you have been saying is starting to make more and more sense. If we were to assume that Larrabee had minimal fixed function, and go with the ~20% in-shader work combined with ~80% overhead. The idea of hiding texture cache misses with overhead work with only 4 hyperthreads now almost makes sense to me.

    Presumably, there is no point in adding much fixed function when it would only reduce the work load which is running (for free) in the background of (texture) cache misses. Address generation / fetch really doesn't map well to software, while filtering does, so perhaps filtering stays in software.

    But, speaking of single float instructions/sec (peak unobtainable capacity), at the high end we are left with a ~960M instructions/sec part which provides only ~192M instructions/sec for actual rendering shader work. Which would only be comparable with a 9800 GTX (not even top end now) at about ~216M instructions/sec. At Larrabee launch time I think we can expect at least >432M instructions/sec on top end GPUs.

    Which makes Larrabee a lot less interesting for rendering.
     
  10. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    I can see one aspect of fixed function being very very important in the future. With cache line size, SIMD vector width, and main memory granularity increasing, fixed function does provide an important feature, managing grouping of tasks into vectorized bundles with high data locality, scheduling between multiple concurrent programs such that memory+ALU latency can remain hidden, and un-grouping / re-grouping output of vectorized tasks (ROP/output merger). All of this is VERY useful for GP computation.

    With performance increasing, display resolution growth slowing down, micro-polygons become more important. I could see FF evolving into something which handles grouping of individual pixels into SIMD groups (say 32 pixel-sized micro poly "threads" per vector). This type of work would be tough to do well in SIMD vectorized software.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    I think it's best to think of "80% overhead" as the worst case. Did you look at the pipeline utilisation plots I referenced earlier? There are significant gains to be had in "overhead throughput" when practically the entire rendering pipeline is in software. The stencil shadow volume rendering example has 40% higher throughput.

    I don't know what a realistic throughput gain for Larrabee would be in typical games but I'm sure there'll be substantial gains simply because so many rendering passes in modern games utilise the entire pipeline poorly. The rendering passes for shadowing and post-processing both present a rather uneven workload and deferred rendering also presents the pipeline with great swings in utilisation over the course of rendering a frame.

    Those figures are in line with what I posited earlier, about 2x the performance of 9600GT.

    But, if the throughput gains due to the software pipeline amount to only 20%, then Larrabee has just doubled its programmable capability, from ~200M instructions per second to ~400M. That's what makes Larrabee interesting :grin:

    Even if the bandwidth won't be there :razz:

    I imagine that more traditional GPUs could see a big boost in throughput if they were to trade ROPs for programs - colour fillrate is very rarely maxxed out and blending is avoided like the plague. R600 took a step here with MSAA resolve but there's lots more to do. NVidia, for a long time now, has been running attribute interpolation on the programmable ALUs.

    Jawed
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    This is interesting. What you seem to be referring to is a FF unit that manages the creation/lifetime of objects in each thread (vector, batch). This kind of fixed function unit doesn't sound graphics-specific to me. I know that sounds like quibbling, but I think it's important to separate graphics-specific FF units from stuff that makes a data-parallel processor perform efficiently. Caches are fixed function too.

    I imagine you're prolly thinking that this unit could also be used to shuffle objects from one thread to another, e.g. to take account of culled/killed objects (re-packing 15 threads down to 6, say) or to reduce the impact of dynamic branching divergence by sorting objects by predicate. Dunno if this kind of conditional routing unit would end up tied in knots though.

    Generally speaking I view Larrabee as running "pipeline OS" threads on one or more scalar pipes whose task is to evaluate workload and distribute work. Presumably one of the important features of the pipeline OS will be to manage the operation of the L2s, e.g. pooling of L2s or tiling of resources across several L2s.

    Jawed
     
  13. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    SIMD Ray Stream Tracing - SIMD Ray Traversal with Generalized Ray Packets and On-the-fly Re-Ordering
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Very nice, ta :grin:

    One very key sentence there:

    In general I think this is required for all SIMD computation in the future: the SIMD executes the same instruction across all lanes, but the program counter and operands "seen" by each lane can be different (the lanes are actually agnostic about both PC and origin or operands). It is up to the gather and scatter units to deal with the time-and-space transformations that enable the SIMD to run at full utilisation.

    Jawed
     
  15. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    The PC is the same for each slot in a SIMD register.

    Frankly their study is so obvious that I find it almost embarrassing, I guess RT research reached its pinnacle many years ago.
    While this approach may very well be a winning one it still requires an hardware architecture that supports it and that it's fast at it at the same time.
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    All I'm saying is that instruction execution, multiply, compare etc. doesn't use PC - it's gather/fetch and scatter/store that worry about PC, indirection etc.

    I'm simply suggesting that the boundaries of a SIMD will be tightened to instruction execution only and data manipulation happens separately in a quasi-pipelined fashion in units sitting in front of and behind the execution SIMD.

    There's already a simple version of this going on in G80, where operands are fetched out of order for submission to the execution SIMD. The ALUs don't interact with the register file directly.

    Jawed
     
  17. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    What I am talking about is very graphics related, and would simply be an evolution of current hardware to better support very small triangles.

    Current GPU is somewhat good at packing 2x2 quads of triangles into SIMD vectors for fragment computation, but not very good at packing points. I believe someone experimentally figured out that G80 only packs 4 points into a "32 thread warp". Which hints that perhaps ROP/OM works at 8 pixel granularity.

    Now take your current triangle setup and rasterisation pipeline, and make it more efficient at fragment packing. Such that it could actually pack 32 pixels into a "warp", or better pack fragments for very small triangles. It would still be up to the program (ie software) to insure that the primitive input stream had good data locality (especially for points) else output merger would be very cache unfriendly.

    Who knows, in hardware what I am suggesting might not even be possible or desired from an efficiency perspective (loosing the fixed 2x2 quad granularity could be bad for the texture samplers).

    One thing I am NOT suggesting is re-grouping (during thread scheduling) of fragments based on very expensive divergent branches. Still think it would be better to use the proposed better setup packing and a second pass (stencil) to deal with those fragments.

    In GPGPU terms, this provides a very fast scatter, which is needed, for example currently on GPUs it is faster to work backwards and search (gather) to do data compaction, than it is to scatter (google histopyramids). I literally see the fixed function GPU pipeline for GPGPU as a way to group and start highly parallel tasks by data locality.

    And yes, a side effect of this hardware would be that it is easier to raytrace on GPUs if you actually wanted to do such a thing...
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    NV4x etc. could apparently aggregate upto ~20 triangles into a batch. On those older GPUs, of course, a batch was considerably bigger (hundreds of fragments).

    So, are you saying that G80 can only pack 2 triangles (as a quad) at most into a fragment batch? If so, that's a hardware limitation. I wouldn't expect Larrabee to be stuck with such a restriction. It's certainly not a restriction defined by D3D.

    So, I'm still struggling to understand the justification for a fixed function unit for packing in Larrabee.

    Yeah with CUDA there's no access to the stencil/Z-buffer hardware. Oh well eventually there prolly won't be any anyway...

    Jawed
     
  19. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    All modern GPUs have (more or less effective) coalescing units that can pack fragments generated by different primitives in the same quad when possible (ie. when interpolants still make sense..)
    Actually I got the impression that ATI has always been better than NVIDIA at this coalescing game, though I don't know what improvements G8x has, if any, over G7x from this standpoint.
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    :lol: I've long thought ATI was crap at small triangles:

    http://forum.beyond3d.com/showpost.php?p=596933&postcount=337

    while NV4x was quite happy:

    http://forum.beyond3d.com/showpost.php?p=597388&postcount=345

    But, I admit I've got no ideas about G80 and R600.

    Jawed
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...