Larrabee's Rasterisation Focus Confirmed

Discussion in 'Rendering Technology and APIs' started by B3D News, Apr 24, 2008.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Larrabee will have plenty of work to do to fill its ALUs while it's waiting for memory accesses. Seemingly the only fixed function hardware it'll have is texture addressing or texture addressing+filtering. Everything else will be running as code.

    And as I said earlier, "everything else" should swamp the ALU code that a developer writes for a GPU, so there's no need for a large number of threads.

    Well, if you think of the register file in a GPU as a cache, then they seem much closer. The problem GPUs have is that they have so many threads that the per-thread register-file space is seriously constrained. Admittedly, this doesn't hurt graphics as much as it hurts GPGPU, but GPUs are in what appears to be a losing war unless they can freely allocate registers as though they are memory locations.

    R600 treats its register file as a cache against memory - though normally the entire per-thread allocation of registers stays on die. It uses the read/write cache and DMAs controlled by the sequencer to manage register file swaps to/from memory, doing so without incurring ALU stalls as long as there are enough threads.

    As far as I can tell both G80 and R600 actually have ALU instructions for reading/writing video memory using absolute (per context) and indexed addressing. R600 appears to cache those accesses (not going to work well if they're at all random though), while G80 doesn't. G80 has the parallel data cache for sharing data amongst threads, obviating the use of video memory. But, like the register file, this quickly runs out.

    Jawed
     
  2. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,057
    Likes Received:
    3,114
    Location:
    New York
    If that's the case I fail to see how it's going to be competitive. It will be hiding memory latency doing non-trivial work that GPU's get "for free" so its real throughput will be far lower than theoretical numbers suggest.
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    The theoretical numbers being ~2TFLOPs and 150GB/s, I believe.

    http://forum.beyond3d.com/showpost.php?p=1156120&postcount=82

    Bear in mind that in a GPU there's always some part of fixed function hardware that's going under-utilised (or idling). In Larrabee there'll be practically none. This is a repeat of the old "unified shader" argument, which justifies unified shaders in preference to discrete vertex and pixel shader units due to load-balancing and utilisation. No matter how many threads the discrete GPU throws at the problem (before running out of die space) it'll average worse performance per unit than the unified GPU.

    Jawed
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The flip side to that is that there's always general hardware that's working 2-10 times harder, taking up more space, or working more often than it needs to.

    I have a soft spot for the "hurry up and wait" approach, when it makes sense.
    It's much easier to gate off a small unit that is often unused rather than a larger bulk of more general hardware that is often woken up.

    Or is it a repeat of the old "FP unit versus software emulation" argument? ;)

    I don't expect the eventual truth to be wholly of either, so the interplay will be interesting.
     
  5. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    I think it makes sense.
    In the early days of 3d acceleration, rasterizing was pretty much all that was done. But these days we have quite sophisticated per-pixel shading, and we use lots of textures per pixel...
    This means that the balance between rasterization and shading/texturing has changed completely, and the rasterize-portion is not at all the bottleneck. So there is a good change that a less efficient software solution will still be 'fast enough' with modern technology. And ofcourse it can be reprogrammed/reused for other tasks, which could be an advantage in the other areas that Intel wants to explore: raytracing and GPGPU.

    As for the FP unit vs software emulation... Floating point itself was performance-critical, and the workloads increased as software became more advanced. Rasterization will only get less performance-critical as shading becomes more advanced, and rasterization workload is pretty much 'fixed'.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    And, due to the mix of data formats that are common in graphics, 8-bit to 32-bit (per channel), the fixed function units that work on those formats normally slow down with increasing size, or grow considerably in order to retain performance. So the factor of 2-10 the FF hardware started with takes a fair hit.

    D3D is littered with creeping-featurism, e.g. the whole >8-bit filtering/blending questions - still not fully resolved in D3D10.1 (as int32/fp32 blending isn't required). The funny thing is, we're expecting GPUs to go to fully programmable OM, at which point the question of supporting int32/fp32 blending is moot.

    Additionally FF units require dedicated buffering on their inputs/outputs and use their own scratchpads and buffers. None of this is necessarily large, but the fragmentation of all these bits of memory begs for a single cache-based memory hierarchy. That way the capacity of units (e.g. hierarchical Z, which is currently limited to 4MP, seemingly) isn't arbitrarily limited, enforcing performance cliffs and other gotchas. Or like the rather arbitrary 1024x 32-bit elements limitation on the output of GS, or the ability to write to a maximum of 4 streams. Or having z fillrate constrained by triangle setup/rasterisation rate.

    Unlike GPUs, Larrabee isn't designed as a set of roadblocks with a long-term project plan to tear them down (in some yet to be determined order). Sure, Larrabee's road is narrower than we were hoping, but once Intel's on a roll...

    It's not just the occasional small unit though, there's a pesky little army of them.

    I do expect software rendering/pipeline to be the endgame, but the interregnum will be fascinating.

    Jawed
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The advantage to the "hurry up and wait" approach isn't whether it performs well enough, but whether the design runs cool enough.

    If the expectation is that only 20% of Larrabee's resources are devoted to ALU shader code, then the upshot is that a significant fraction of Larrabee's TDP and peak resources is not devoted to shading.

    In effect, we're saying that out of 24 cores, Larrabee's using the equivalent of 19.2 of them for anything but shading.
    That leaves less than 5 cores'-worth for general shading, and 120 watts used up on emulation.

    I hope the overhead for emulating GPU hardware is not as high as that estimate.
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    If the biggest GPU that man can ever build is limited to, say, 3 billion transistors and 300W (pick your numbers) then I think that argument would hold true.

    In terms of today's performance, say 9600GT, ~200GFLOPs is adequate. So about 400 GFLOPs for Larrabee in 2 years' time. Not brilliant, but hardly laughable.

    Jawed
     
    #168 Jawed, May 2, 2008
    Last edited by a moderator: May 2, 2008
  9. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    But since Larrabee will get 'regular' x86-like ALUs alongside the SIMD units, and SMT to make use of all the units, this logic does not apply.
    Because of these 'extra' ALUs the fixedfunction operations don't have to take away any shading power from the SIMD units at all, with SMT they'll just operate in parallel, just as they do on a regular GPU with only SIMD and fixed units and no 'general purpose' ALUs.

    What Intel is basically doing is trading transistors used on fixedfunction operations with transistors used on x86-like ALUs and SMT.
    We really cannot estimate how well this works out until we have more details on what kind of x86-subset these cores actually work with, what their SIMD units will be capable of, and how everything is going to be tied together with 'rendering software'.
    It's just a different approach altogether. While I doubt that the first generation of this technology will outperform 'regular' GPUs, I don't think performance and power usage will be that far apart (for the mid-end market, which Intel is aiming for currently). Given Intel's various advantages in terms of x86 ALUs, high clockspeed designs, process technology, they may well compensate in other ways and actually turn the tables in their favour.

    And as we all seem to agree, the future is indeed more software, less fixedfunction... so Intel's design should be quite future-proof for now... Question is what nVidia is going to do.
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    That's okay, we're expecting them to be idling most of the time anyway, so it just means they'll spend less time idling. ;)

    Inflexibility aside, dedicated storage isn't always a bad thing.
    The other side of the issue is that dedicated or at least non-general storage leads to a much higher count of cache ports, and there are examples of SSE code limited by the read/write bandwidth from the L1, in particular the 1 or 2 cache ports for the entire core.

    If we assume Larrabee's data cache is dual-ported, that leaves the 24-core variant with a possible maximum of 48 cache ports.

    Looking at R600, which can with point sampling produce 16 values, which I'll charitably label as read ports.
    If I count the ROPs as write ports, that's 16 right there.
    Just how many local stores, special caches, and whatnot are there?
    They have ports capable of generating concurrent accesses.
    In raw port count, a rather dated GPU is approaching Larrabee's count.

    Larrabee's instruction cache port count will be interesting, as it can significantly weight things in x86's favor, but instruction fetch isn't a limiting factor for in-order designs, or at least has not yet.

    The debate is still the same, as it is with hardware.
    Specialized memory ports are inflexible, but they are cheaper in hardware, power, and space.
    General ports are universal, but they are expensive in all three.

    If we expect to use the "good enough" argument for x86, the it can readily be applied for the other side.

    Larrabee's designed to carry several decades of other designs' roadblocks with no end in sight.
    So a GPU in 2020 might be a lot slimmer.
    That's not necessarily a bad thing.
    By that time, Larrabee could probably emulate apps from 2015, and have AVX3 on top of SSE7.

    But armies are obedient and regimented. General cores doing things as they will will be more like herding cats when it comes to gating.

    I expect generalization to push forward until it hits an area that is capable of being optimized by more specialized hardware, and this will seesaw over time.

    The transistor count limit is more flexible than the power density and overall TDP limits.
    Transistors are approaching negligible costs per unit. Heat removal and power supplies are not getting cheaper at the same rate, or even appreciably improving in effectiveness.

    Larrabee's TDP, and Intel may have used this for the chip only (not the card, the RAM, the VRMs), was posited to be north of 150 watts.

    http://www.techreport.com/articles.x/14168/9

    An entire system with the 9600GT burns ~190 Watts loaded.

    Perhaps those Larrabee TDP numbers were for an examplar at 45nm, or at least I hope so.
    Otherwise, 2x SLI 9600GT in 2008 cards at load are about only 50 watts worse than Larrabee in 2010 (going SLI seemed to add about 100 watts), which my future self would find a little funny.

    That wasn't the way I was using the percentage, which was one of all ALUs and bandwidth, not just selected portions. It did not appear to me like there was more selective use in what I responded to.
    If the non-vector ALUs are free, it would mean the overhead percentage is smaller.
     
  11. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Well you were assuming that the shading power was mutually exclusive:
    "In effect, we're saying that out of 24 cores, Larrabee's using the equivalent of 19.2 of them for anything but shading.
    That leaves less than 5 cores'-worth for general shading, and 120 watts used up on emulation."

    I'm pointing out that this will not be the case. Shading will mostly be done on the SIMD units, but rasterization can probably mostly be done on the ALU and whatever other x86-units it inherits. Combine that with SMT, and running rasterization threads on a core will not affect its shading performance much, if at all. Nor will it mean that ALU threads will use the maximum theoretical rated power, because they are only a small portion of the entire core.

    So by the looks of it, even if it were to use '19.2 cores' for other tasks, it could still use ~24 cores for shading, because of SMT. Because technically it will have 24*4 = 96 logical cores to distribute its workload over. Intel just has to make sure that workloads are divided properly, so you get the most out of the execution units with SMT.
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Only if you don't count the x86 ALUs as ALU resources.
    As shaders are not pure vector code, the scalar units would be used as well, counting towards the 80%.
    If they are not involved, the overhead percentage would drop below 80%.
     
  13. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    I'd say that depends on how they were implemented.
    If they were implemented 'vertically' with 4x4 pixels, processing the elements of a vector in serial, then they'll be nearly entirely SIMD, even if the operations are scalar... scalar or vector won't matter then, it will all run on the SIMD unit.
    This is also what I think to be the most likely scenario, since it has other advantages aswell.
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Book keeping, pointer math, and some memory accesses would still run on the integer pipeline.
    The scalar pipeline still plays a role.

    I've been going by the assumption that an ALU is an ALU, regardless of how it's structured.
    If we go by how you are defining things, the overhead percentage using my math would be lower that the 80%, which I said I hoped it would be.
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    I dare say that Larrabee's cache is supposedly the key secret sauce. As far as I can tell L2 is segregated per core, so each portion of L2 has to feed just one core, and each core has 4 threads. There'll be data and instruction L1s between L2 and each core I presume, too.

    Separately, presumably, the L2s do clever stuff, e.g. texel data will be tiled across L2s much like texel data is often tiled across memory channels in GDDR.

    In Larrabee the "ROP" program and the point-sampling (vertex fetch) program are both running time-sliced on the same core and take it in turns with the cache. At most there might be one thread on the scalar pipe and another thread in the SIMD pipe each simultaneously accessing cache.

    But there'll be less of them and with throughput concentrated in only a few types of unit (texture-sampling, x86 cores, caches), utilisation will be higher across the board. Clearly a generalised cache system needs to be designed to cope, so requiring greater set-associativity, space, bandwidth. I'm not going to argue that. But generally speaking cache is dense and cheap - look at the gobs of it Intel is putting into current CPUs and the performance is generally regarded as very impressive.

    x86 needs to be good enough to start competing. Intel doesn't have a third party (TSMC) trying to eke-out 50% gross margin... And, further, the design cycles for Intel are concentrated on software to run on ever-scaling versions of Larrabee. After version 1, what kind of architectural revolutions will be needed? I expect Intel to concentrate on scaling it up, rather than trying to re-invent every 2-3 years like GPUs currently do.

    No, the software it'll run is required to emulate those roadblocks :wink:

    Meanwhile, real time graphics algorithms are champing at the bit to leave behind the regimented "fixed-function" pipeline - but are still hobbled by sometimes bizarre limitations. I agree regimentation works, e.g. GPUs neatly dodge around the concurrency problem by giving programmers none that they can write for themselves.

    I'm interested in what kinds of new, specialised, hardware you think might be introduced in the future of GPUs. Looking beyond D3D11, because I think it's reasonable to assume that D3D11 is founded on the fixed-function mindset.

    Assuming Larrabee is 2x 9600GT in performance (or perhaps 2.5x or more?) I'm not really sure what you're getting at.

    I will say one thing though: as NVidia plays catch-up with the architectural pieces that are currently missing from G80 I think performance per watt/mm (assuming constant process) is going to get worse.

    Jawed
     
  16. Rufus

    Newcomer

    Joined:
    Oct 25, 2006
    Messages:
    246
    Likes Received:
    61
    Jawed: could you give some examples on this 80% overhead that needs to be done, I just can't think of what on earth it could be. For example, how much work (cycles * threads) does it take to rasterize a triangle into pixels, even if you're doing TBDR? How much work does it take to shade those pixels with a 100-1000 instruction long Crysis-in-DX11 shader? One of those has an order of magnitude more work than the other, and it's not the one you're indicating.

    You're basically arguing that Larrabee is going to be massively overbuilt and massively inefficient, which will cancel each other out and make a reasonable chip. I can't see how that would make any competitive sense unless Intel likes throwing fab capacity away.

    Anyway, this still doesn't answer my question of how Larrabee will handle latency hiding in the shading threads. With only 4 threads the chance of hitting the 200-cycle main memory latency has to be extremely small, or the 4 threads will all be stalled at quite often. Every triangle you're pulling in a lot of new texture data (possibly with a bit of overlap from a previous triangle), using the data a lot, and then never touching it again.

    A normal cache with this data pattern will almost always miss on the 1st access to a line, killing performance with only 4 threads. My speculation: Larrabee needs some sort of very advanced, possibly generally programmed, prefetching. The question is how do you implement the prefetching. Hardware-based could probably work fairly well within a tile, but I doubt it would work at prefetching the next tile to be rendered's textures (which I'm guessing is how far ahead you need to be running). The other option is software prefetching, but that requires you to know the texture addresses by which time it's way too late. Maybe every thread will actually be working on 2 pipelined tiles at a time. Something like Calculate addresses/Prefetch textures for A, Prefetch for B, Shade A, Prefetch C, Shade B, etc. Problem with that is obviously any sort of dependent texture lookups.
     
    #176 Rufus, May 2, 2008
    Last edited by a moderator: May 2, 2008
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Are you saying the L2s can be accessed concurrently with the L1s?
    Otherwise, the L1 port count is the limit.

    I'm trying to find a rough number for the number of simultaneous accesses to storage a chip can make.
    Since so much high-throughput code runs into L/S limitations, I'm trying to find some way to normalize the various kinds of ports available.

    And longer word lines, more TLBs, more coherency updates, larger minimum capacity to capture wider workload ranges, a single possibly suboptimal cache line length, more fully-featured load-store units on the critical path (even if 90% of the time they don't most of their capabilities), more potential simultaneous memory exceptions, fewer cache ports meaning fewer simultaneous accesses, etc.

    I've gone over in one of the other Larrabee threads that I hope Intel's put some serious thought into making their cache control more flexible, which would at least help mitigate some of the coherency issues.

    I guess that kind of depends on what inadequacies we find when it's released.
    What kind of microarchitectural revolution is needed to bridge the gap between the SIMD-based multithreaded execution done by GPUs and the SIMD-based multithreaded execution done by Larrabee?

    Hooray for drivers.
    The underlying hardware has been rather evolutionary. Just how many gigantic dislocations do you see in GPU lines' hardware between generations?

    There's no software emulation for x86 backwards compatibility.
    Either it's stuck in the more complex decoder or in a bigger microcode ROM.
    That long history for x86 leaves a mark.

    Off the top of my head, I can't give concrete examples beyond the maintaining of at least some specialization in the units already present.
    I haven't been clear, but I believe specialization is its own place in a continuum from general-specialized-fixed.
    I'd expect that the video decode block will stick around, just because the cost is so minor.
    I'd expect a fair amount of rasterization functionality will remain just because the workload will be dominated by it for much of the future.
    I'm expecting compression hardware will remain.
    Specialized buffers will remain (can't do that with a fully generalized cache hierarchy).
    I'll put some thought into what else could be done, but I believe past history indicates we'll find something.

    Depends on what you'd expect the Nvidia's product will be in the same market segment in two years.
    I'm not guessing it will also be a 2-slot cooler card burning something in the neighborhood of 200 watts, but I could be wrong.

    Some of the recent analysis of future processes will be putting more of a premium on the watt and less on the mm, at least for TSMC.
     
  18. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Larrabee will not be fully x86-compatible. Intel clearly stated it will have a *subset* of x86.
    So all the irrelevant legacy-crud will already be stripped from the x86 ISA.
    I suppose it's similar to what Motorola did with their DSP line (ColdFire and such, based on 68k, but stripped of archaic nonsense such as BCD support and 80-bit floating point data).
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Well good, instead of a whole pile of crap, we'll have a pile of crap with holes in it.
    ;)

    The rumors I've seen about it being a subset in the other Larrabee thread seemed to indicate it was a matter of leaving out MMX and the older SSE instructions.

    It doesn't seem like Intel's ditching x87 entirely, as the Larrabee slides said it could produce 2 DP non-SSE operations a cycle, which seems to point to x87 if Larrabee is truly a subset (aside from texturing and other specialty instructions).

    Other quirks of the ISA won't go away with a few omissions.
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Start here

    http://www.graphicshardware.org/previous/www_2005/presentations/moreton-presentation-gh05.pdf

    page 9. Now that's 1:1 ALU:TEX ratio and G71 obviously increased the pixel shader FLOPs capability (second ALU gained MAD). Also G70/71 are pretty appalling in general terms for ALU and TEX throughput. But NVidia's only counting headline operations such as floating-point texture filtering and render target blending as well as the math required to set up filtering:

    [​IMG]

    Render target blending is simpler. Then there's Z tests, such as for hierarchical-Z in R600 a tile of new pixels is tested against the minimum and maximum already recorded for that tile in screen space and Z testing as each sample is written to the render target during MSAA.

    Rasterisation itself is reckoned to be about 10x as efficient per unit of die space as general ALUs. But rasterisation in absolute terms is trending towards a vanishingly small portion of total frame computaton. Sadly a nice presentation called "Implementing the Graphics Pipeline on a Heterogeneous Multicore", by Jiawen Chen and Jonathan Ragan-Kelley seems to have gone AWOL behind the ACM firewall. ARGH. It consists of a mesh of compute cores + a rasterisation core :smile:

    The utilisation plots here are very interesting:

    http://cag.lcs.mit.edu/commit/papers/05/streamit-graphics-slides.pdf

    :grin:

    More info here:

    http://people.csail.mit.edu/jiawen/gh05/gh05.pdf

    Yep, prefetching. R600 does it, with the rasteriser driving.

    3-D rendering texture caching scheme

    Having a fair amount of L2 cache (256KB) obviously helps.

    Larrabee, being so flexible, is going to have a ball. You can prefetch a texel as soon as you know the coordinates of a fragment, i.e. during rasterisation. See slides 65 onwards:

    http://www-csl.csres.utexas.edu/use...ics_Arch_Tutorial_Micro2004_BillMarkParts.pdf

    Strangely enough in my travels through a couple of dozen PDFs I found a note suggesting that Larrabee's only fixed function hardware is a rasteriser. Not sure if I believe that...

    Jawed
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...