LRB - ditching x86?

Discussion in 'Architecture and Products' started by rpg.314, Jun 9, 2009.

  1. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yep I've read it, but they aren't exactly hitting a real-time implementation, although that's not their focus. That said, it's hard to argue for the "efficiency" of their implementation for real-time rendering since they're burning a hell of a lot of hardware transistors for "seconds per frame" rates at best. Again, this isn't their focus, but they certainly don't attempt to prove anything about the efficiency of the implementation on GT200/CUDA vs. the ideal.

    This paper is more relevant for real-time, data parallel REYES implementations I think, and it suggests that efficient REYES implementation may benefit from some fixed-function hardware, as just the rasterization part would take around 11 "larrabee units" at 1080p-ish with 4x MSAA.

    No. The "subroutines" support is syntactic suger that requires you to declare all your possible permutations up front when compiling the shader.
     
    #41 Andrew Lauritzen, Jun 10, 2009
    Last edited by a moderator: Jun 10, 2009
  2. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Let's see. Suppose your 32 cores of lrb map to the the 30 MP's of GT200. Now each 256k L2 cache of lrb maps to 16k shared memory. The cooperative fibers in intel speak become warps. If I read the lrb siggraph paper, which step is it that is not implementable efficiently on GT200? In rasterization, some lrb specific instructions will be needed to get it to be just as efficient on GT200, but that is a minor detail when we are comparing whole architectures.

    You'll have smaller tiles as the shared memory is smaller compared to the cache. And yes you should be able to read your render targets in this hypothetical rasterizer. :wink:

    Regarding function pointers, polymorphism etc.? Well, my question was about workloads,and not programing constructs.
     
  3. Cypher

    Newcomer

    Joined:
    Jun 28, 2005
    Messages:
    85
    Likes Received:
    1
    The part where you're using more than 16KB of memory on a given task (well, on a set of 16 simultaneous tasks), e.g. for rasterizing, working on a tile greater than 32*32 in size.
     
  4. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    The whole rendering pipeline is organized differently and uses a lot more hardware implementation on GT200. Like I said, it'd be cool if you could implement an efficient LRB-like tile-based binning hierarchical rasterizer on GT200 in CUDA, but I doubt it'd be real-time for even a moderately complex scene... again, I'd love to be proven wrong though :) Even just on the cache issue though, I don't see how you could pull it off very well on GT200... you could barely hold a single 32x32 tile w/ a single 32-bit render target and no MSAA in local memory, and the norm is actually multiple, wider render targets nowadays meaning that you'd be down to binning, what, 16x16 tiles at best? At that level you'd be stressing the efficiency of the rasterization algorithm even with small triangles.

    Also note that there's a pile of LRBni instructions that are pretty useful for this purpose (such as the ability to efficiently do "horizontal"-style SIMD operations and pack/unpack masked lanes) that don't exist in CUDA.
     
    #44 Andrew Lauritzen, Jun 11, 2009
    Last edited by a moderator: Jun 11, 2009
  5. crystall

    Newcomer

    Joined:
    Jul 15, 2004
    Messages:
    149
    Likes Received:
    1
    Location:
    Amsterdam
    Not really, after a ridiculous number of iterations SSEx remains terribly non-orthogonal. Heck, there's a lot of stuff which was in AltiVec in '99 which is not yet in SSEx and instead we got all kind of horizontal operations which are useless except for a couple of applications which end up in benchmark suites.
     
  6. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Small tiles may be inefficient. I agree. But on a DX11 hw you have a minimum 32K shared memory. So a 32x32 render tile can fit in with 8 floats/pixel. So if you do, 4 floats for colour and say 4 floats for depth, you can do 4xMSAA or with half for depth even 8x. For horizontal ops, you'll need extra support from hw, I agree. Or perhaps shared registers in RV770 can help. But with a 64K shared mem, you could do horizontal ops using shared mem or may be if you use a 32x24 tile, it would be doable with 32k too.
     
  7. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    Fundamentally, any algorithm which uses indirection is going to do vastly better on LRB than a GPU with no caches.

    Complex data structures that use indirection do horribly on GPUs. So if you want to traverse a doubly linked list - best use LRB. Or a b-tree, or hash table, etc. etc.

    So your question boils down to: "What workloads use indirect data structures or algorithms?"

    Databases. Filesystems. Route planning. Network traffic analysis. Even stuff like Finite Element Analysis can use complicated data structures that would do far better on LRB than the alternatives.


    Additionally, there are all the benefits of using a well known, universal instruction set architecture. x86 is really the first class citizen for most software. Not just windows, but most software.

    And CUDA - don't get me started on proprietary languages. It was learned a long time ago that proprietary languages are a bad idea (TM). It's just as true today as it was then.

    DK
     
  8. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,850
    Likes Received:
    285
    Location:
    Taiwan
    I think that depends. If the data structure is larger than the small cache on LRB, it's probably not going to be much better than on a GPU. Actually, if you can get the number of threads large enough, GPU are not that susceptible to memory latency that bad. Of course, in terms of ease of programming, cache are better than software managed scratch pad memory.
     
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,771
    Likes Received:
    905
    Location:
    New York
    Ummm why are we comparing Larrabee (slideware at that) to 3 year old GPU technology? Also, is it accurate to say that most of the arguments for Larrabee seem to revolve around the larger cache size and not the programming model per se?
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Complex data structures that use indirection do horribly on GPUs. So if you want to traverse a doubly linked list - best use LRB. Or a b-tree, or hash table, etc. etc.

    Assuming that the workload scales to 32 cores x 4 hardware threads per core in first place, I'd have thought that it would also scale to, say 1024 threads. Which GPU's are good at. :wink: Or are you referring to using lrb serially, but leveraging it's O(10 MB) L2 cache.

    I disagree on this particular point. What is the advantage of writing C for x86 over writing C for PPC, arm, sparc, mips etc.?
     
  11. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Actually you need to leave a chunk of room - probably at least half - for streaming through vertex data, coverage, etc. and other local data structures that need to sit there.

    Anyways I'd definitely be interested if someone wanted to try and implement that on GT200/R770 and see how it works out, but honestly I'd be surprised if it was at all efficient/fast.
     
  12. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,850
    Likes Received:
    285
    Location:
    Taiwan
    I think these two deserves to be discussed separately: hardware managed cache, and programming model.

    If we compare Larrabee and current GPU (from both ATI and NVIDIA), these are the two major differences. GPUs do not have a hardware managed cache. Of course, they do have texture cache, but they are read only. The same goes to the constant cache. On the other hand, Larrabee has CPU style cache, and it's coherent, which makes atomic/locking operations much more efficient. This can be very important for some applications.

    Another difference is about programming model. Larrabee can support two models: the first is similar to GPU through LRBni, i.e. SIMD with gather/scatter, or the so-called "many threads" model. The second one is the more traditional SMP style, i.e. just use Larrabee as a multi-core CPU. Of course, to utilize the most of Larrabee's power, you need to use the first model, but the point is that if your problem is not suitable for the first model, you still have the second model. You can't say the same for current GPUs.

    If we are going for the traditional SMP style multi-thread programming, then I agree that cache is the way to go. You really can't expect to hide a good amount of memory latency just from this number of threads. But since current GPU can't do that at all, I think the advantage here is clearly in favor to the Larrabee. Although some may argue that using this model on Larrabee is probably not going to be better than just using a normal CPU.

    However, if we go for the "many threads" model, or the vector model, the benefits of hardware managed cache is not that clear. The idea behind the vector model (and the old style vector computers) is that using a large vector allows you to hide memory latency, so you don't need cache. However, to be able to hide the latency well, you need a relatively nice memory access pattern. For example, even with gather/scatter support, if your vector load data from different memory location for each element, you are not going to get good performance. But even with a cache, I don't see how you can get a good performance from that either.

    Of course, if your data structure happens to be fit inside the cache, it would be quite helpful. However, in real world applications this is very rare. There are some different possible situations. It's possible that your data access pattern is very cache friendly and the data fit into the cache nicely. Then the cache wins hand down. Another possibility is that the data access pattern is very random and data dependent so it's almost impossible to do anything about it. Then cache is not helpful at all. The third possibility is you'll need to do some "blockification" to make your data access more cache friendly (generally this will have to take the size of the cache into consideration). This is probably the most common situation (for vector friendly codes). In this case, it's almost always possible to "blockify" the data access pattern so a software managed scratch pad can handle it well enough.

    Of course, it's still possible that future GPU may converge with Larrabee a bit. For example, future GPU may have a few scalar processing units with nearly full CPU functions (and maybe with cache!) to control its vector units.
     
  13. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    This was a very good post here, pcchen.

    Here, I think lies the crucial point. LRB has far more hardware threads than ever built earlier in mainstream machines. 32x4=128, is my estimate. If an application scales to these many threads, then there seems little doubt that it will scale to O(10^3) threads where you are back to many threads model.

    And when your app scales to O(10^3) threads, you lose the benefits of cache.
     
  14. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Not necessarily. The key to efficient parallel programming is expressing your algorithm using "just as much parallism as you absolutely have to" to keep all of the processors busy. Efficient parallel prefix sum implementations are a great example of this: you block things up the absolutely minimal amount to keep your ALUs busy, and then run the serial algorithm in core. Being forced to use more threads => less efficiency in these cases, so hardware on which you can use hundreds of threads will be more efficient than if you needed to do thousands of threads. The key to these algorithms is immediately using computed data once it is available (i.e. hiding the minimal amount of latency as possible) and splitting these things up wider actually increases the total amount of work that you have to do.

    Similarly there are big advantages in dynamic branching when using fewer threads.

    I'm also not convinced that's true. Caches tend to "soften the edges" on performance cliffs, and programming for GPUs is a literal minefield of them (see the "optimization space pruning in CUDA" paper from last year's HPC literature IIRC for instance).

    Also memory accesses can be coherent, but unpredictable. i.e. it's not abnormal to have an algorithm for which the memory access pattern is data-dependent (and thus you can't explicitly block up/prefetch data into local memory), but extremely coherent. In these cases caches are close to ideal, and the best you can do on GPUs/Cell/etc. is implement a software cache, which is always outperformed by a hardware implementation.

    Algorithms that have fairly low contention, but some data-dependent sharing are another example where local memories fail and global cache coherency is a big win. OpenCL/ComputeShader global atomics help these cases somewhat, but don't always provide the nicest performance paths in low-coherence cases compared to hardware caches (which don't need to go to global memory to figure out that there is no contention for instance).

    So while I agree that there are trade-offs, and I personally have enjoyed my experience writing high-performance GPU computing code (I've done quite a bit), I think the advantages of typical CPU-like caches and threading shouldn't be trivially dismissed.
     
  15. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    This isn't and hasn't been true. The old style vector machines used SRAM as main memory and modern ones use large caches as well. Fundamentally, caches only provide performance enhancement and reduction in bandwidth requirements, they are pretty much applicable to only hardware architecture.

    unfortunately using a data store you have a significant overhead in managing the data store both in code AND more fundamentally in the programming model.
     
  16. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    They are not programming constructs. You can use vtable's in C too to achieve polymorphism, but only on a CPU not a GPU. Similar for the call stack. Recursion and deep calls are not just a gimmick of certain programming languages, but really enable the use of certain algorithms. And sure, there are probably ways to process the same 'workload' with a different algorithm that doesn't require function pointers or a calls stack, but this algorithm is very likely not to be optimal. So Larrabee gives the programmer full freedom to do what he intended to do.

    Talking about specific workloads is a bit pointless in my opinion. It's designed to be able to run anything, and only time will tell exactly what it excels at. Nobody has really programmed anything like this before, but it enables entirely new possibilities. Anyway one example (out of hundreds) would be artificial intelligence. Function pointers and call stacks allow each entity to run fully independent of each other, with arbitrarily complex algorithms. Take your CPU code and run it on Larrabee without any changes, for hundreds of unique entities. With all due respect that's a whole lot more exciting than the Froblins demo in my opinion.

    So function pointers and call stacks are something to get very excited about even though we don't know the exact limits of the possibilities yet. Unfortunately for Intel I expect that GPUs will support them too once it starts to get traction. But they'll likely end up being a strong third player.
     
  17. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,850
    Likes Received:
    285
    Location:
    Taiwan
    To my understanding only the earliest Cray uses SRAM. Later Cray don't and of course that make them less efficient (but not necessarily slower). More recent vector computers such as NEC SX series only have cache for their scalar processors, not vector units. The multi-threading Tera MTA (later merged with Cray) also have no data cache, only instruction cache.
     
  18. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Andrew, thanks for that REYES paper link, and BTW I'd also like to see NVidia (and AMD) provide something like LRB's COMPRESS/EXPAND opcodes which would work out of shared memory (fast scan)!

    pcchen, "On the other hand, Larrabee has CPU style cache, and it's coherent, which makes atomic/locking operations much more efficient." I don't think this is correct. First, "shared memory" atomics (or some atomics between the 4 hyperthread's local only data) should be roughly the same (assuming a L1 hit on LRB). As for "global atomics" (atomics with data shared between cores), under cache line sharing with atomic operations, the CPU stalls since the Atomic ALU operation is done on the CPU. Only having 4-way hyperthreading doesn't help much here. Under "cache line" sharing with atomic operations the GPU, ALU units keep processing and just hide the latency like any other latent operation (global Atomic ALU NOT done on the GPU's "CPUs"). This is a very important difference.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,402
    Likes Received:
    4,111
    Location:
    Well within 3d
    The behavior of global atomics also depends on the where the lock's cache line is.

    Depending on the workload, it is possible that there is a high probability that the next core to access the line is the one that had it previously. I don't know if this is the case for graphics, but other workloads sometimes demonstrate this behavior.

    If that line is still in cache and still exclusive, the operation can continue without delay.
     
  20. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Yes clearly lines tagged as exclusive aren't a problem, but good algorithm design (IMO) makes those cases non-atomic or "shared memory" cases.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...