LRB - ditching x86?

Andrew, on GPU REYES, you might find the RenderAnts paper interesting.
Yep I've read it, but they aren't exactly hitting a real-time implementation, although that's not their focus. That said, it's hard to argue for the "efficiency" of their implementation for real-time rendering since they're burning a hell of a lot of hardware transistors for "seconds per frame" rates at best. Again, this isn't their focus, but they certainly don't attempt to prove anything about the efficiency of the implementation on GT200/CUDA vs. the ideal.

This paper is more relevant for real-time, data parallel REYES implementations I think, and it suggests that efficient REYES implementation may benefit from some fixed-function hardware, as just the rasterization part would take around 11 "larrabee units" at 1080p-ish with 4x MSAA.

Function pointers are in DX11 right?
No. The "subroutines" support is syntactic suger that requires you to declare all your possible permutations up front when compiling the shader.
 
Last edited by a moderator:
I clearly don't understand your question... all of the things that I mentioned you can do with LRB1 and not (efficiently) with CUDA/OpenCL. Render target read is a prime example, which is basically "free" on LRB and rediculously expensive - if even ever implemented - on current NVIDIA/AMD-style architectures.

Uhuh, and I can run REYES on my phone if I bothered to code it up, but it's not going to be fast or efficient. All of the rasterization-based things that I mentioned can be implemented orders of magnitude more efficiently on LRB than on competing hardware. Go implement log rasterization - or any rasterization - in CUDA for instance and let me know how fast it is...

I'm not trying to be coy or dismiss OpenCL/ComputeShader, but are you guys really trying to argue that they are currently suitable for writing - for instance - an efficient software rasterizer in? I'd love to be convinced, but I'll wait until I see it. And if someone does rewrite the graphics pipeline in an efficient, programmable way on that hardware then maybe I can finally read my render targets :)

Let's see. Suppose your 32 cores of lrb map to the the 30 MP's of GT200. Now each 256k L2 cache of lrb maps to 16k shared memory. The cooperative fibers in intel speak become warps. If I read the lrb siggraph paper, which step is it that is not implementable efficiently on GT200? In rasterization, some lrb specific instructions will be needed to get it to be just as efficient on GT200, but that is a minor detail when we are comparing whole architectures.

You'll have smaller tiles as the shared memory is smaller compared to the cache. And yes you should be able to read your render targets in this hypothetical rasterizer. ;)

Regarding function pointers, polymorphism etc.? Well, my question was about workloads,and not programing constructs.
 
which step is it that is not implementable efficiently on GT200?
The part where you're using more than 16KB of memory on a given task (well, on a set of 16 simultaneous tasks), e.g. for rasterizing, working on a tile greater than 32*32 in size.
 
If I read the lrb siggraph paper, which step is it that is not implementable efficiently on GT200?
The whole rendering pipeline is organized differently and uses a lot more hardware implementation on GT200. Like I said, it'd be cool if you could implement an efficient LRB-like tile-based binning hierarchical rasterizer on GT200 in CUDA, but I doubt it'd be real-time for even a moderately complex scene... again, I'd love to be proven wrong though :) Even just on the cache issue though, I don't see how you could pull it off very well on GT200... you could barely hold a single 32x32 tile w/ a single 32-bit render target and no MSAA in local memory, and the norm is actually multiple, wider render targets nowadays meaning that you'd be down to binning, what, 16x16 tiles at best? At that level you'd be stressing the efficiency of the rasterization algorithm even with small triangles.

Also note that there's a pile of LRBni instructions that are pretty useful for this purpose (such as the ability to efficiently do "horizontal"-style SIMD operations and pack/unpack masked lanes) that don't exist in CUDA.
 
Last edited by a moderator:
It's far from perfect, but I wouldn't say clueless. Everything you need for multimedia and scientific computing is there.
Not really, after a ridiculous number of iterations SSEx remains terribly non-orthogonal. Heck, there's a lot of stuff which was in AltiVec in '99 which is not yet in SSEx and instead we got all kind of horizontal operations which are useless except for a couple of applications which end up in benchmark suites.
 
Small tiles may be inefficient. I agree. But on a DX11 hw you have a minimum 32K shared memory. So a 32x32 render tile can fit in with 8 floats/pixel. So if you do, 4 floats for colour and say 4 floats for depth, you can do 4xMSAA or with half for depth even 8x. For horizontal ops, you'll need extra support from hw, I agree. Or perhaps shared registers in RV770 can help. But with a 64K shared mem, you could do horizontal ops using shared mem or may be if you use a 32x24 tile, it would be doable with 32k too.
 
Fundamentally, any algorithm which uses indirection is going to do vastly better on LRB than a GPU with no caches.

Complex data structures that use indirection do horribly on GPUs. So if you want to traverse a doubly linked list - best use LRB. Or a b-tree, or hash table, etc. etc.

So your question boils down to: "What workloads use indirect data structures or algorithms?"

Databases. Filesystems. Route planning. Network traffic analysis. Even stuff like Finite Element Analysis can use complicated data structures that would do far better on LRB than the alternatives.


Additionally, there are all the benefits of using a well known, universal instruction set architecture. x86 is really the first class citizen for most software. Not just windows, but most software.

And CUDA - don't get me started on proprietary languages. It was learned a long time ago that proprietary languages are a bad idea (TM). It's just as true today as it was then.

DK
 
Fundamentally, any algorithm which uses indirection is going to do vastly better on LRB than a GPU with no caches.

I think that depends. If the data structure is larger than the small cache on LRB, it's probably not going to be much better than on a GPU. Actually, if you can get the number of threads large enough, GPU are not that susceptible to memory latency that bad. Of course, in terms of ease of programming, cache are better than software managed scratch pad memory.
 
Ummm why are we comparing Larrabee (slideware at that) to 3 year old GPU technology? Also, is it accurate to say that most of the arguments for Larrabee seem to revolve around the larger cache size and not the programming model per se?
 
Fundamentally, any algorithm which uses indirection is going to do vastly better on LRB than a GPU with no caches.

Complex data structures that use indirection do horribly on GPUs. So if you want to traverse a doubly linked list - best use LRB. Or a b-tree, or hash table, etc. etc.

Assuming that the workload scales to 32 cores x 4 hardware threads per core in first place, I'd have thought that it would also scale to, say 1024 threads. Which GPU's are good at. ;) Or are you referring to using lrb serially, but leveraging it's O(10 MB) L2 cache.

Additionally, there are all the benefits of using a well known, universal instruction set architecture. x86 is really the first class citizen for most software. Not just windows, but most software.

I disagree on this particular point. What is the advantage of writing C for x86 over writing C for PPC, arm, sparc, mips etc.?
 
So a 32x32 render tile can fit in with 8 floats/pixel. So if you do, 4 floats for colour and say 4 floats for depth, you can do 4xMSAA or with half for depth even 8x.
Actually you need to leave a chunk of room - probably at least half - for streaming through vertex data, coverage, etc. and other local data structures that need to sit there.

Anyways I'd definitely be interested if someone wanted to try and implement that on GT200/R770 and see how it works out, but honestly I'd be surprised if it was at all efficient/fast.
 
Also, is it accurate to say that most of the arguments for Larrabee seem to revolve around the larger cache size and not the programming model per se?

I think these two deserves to be discussed separately: hardware managed cache, and programming model.

If we compare Larrabee and current GPU (from both ATI and NVIDIA), these are the two major differences. GPUs do not have a hardware managed cache. Of course, they do have texture cache, but they are read only. The same goes to the constant cache. On the other hand, Larrabee has CPU style cache, and it's coherent, which makes atomic/locking operations much more efficient. This can be very important for some applications.

Another difference is about programming model. Larrabee can support two models: the first is similar to GPU through LRBni, i.e. SIMD with gather/scatter, or the so-called "many threads" model. The second one is the more traditional SMP style, i.e. just use Larrabee as a multi-core CPU. Of course, to utilize the most of Larrabee's power, you need to use the first model, but the point is that if your problem is not suitable for the first model, you still have the second model. You can't say the same for current GPUs.

If we are going for the traditional SMP style multi-thread programming, then I agree that cache is the way to go. You really can't expect to hide a good amount of memory latency just from this number of threads. But since current GPU can't do that at all, I think the advantage here is clearly in favor to the Larrabee. Although some may argue that using this model on Larrabee is probably not going to be better than just using a normal CPU.

However, if we go for the "many threads" model, or the vector model, the benefits of hardware managed cache is not that clear. The idea behind the vector model (and the old style vector computers) is that using a large vector allows you to hide memory latency, so you don't need cache. However, to be able to hide the latency well, you need a relatively nice memory access pattern. For example, even with gather/scatter support, if your vector load data from different memory location for each element, you are not going to get good performance. But even with a cache, I don't see how you can get a good performance from that either.

Of course, if your data structure happens to be fit inside the cache, it would be quite helpful. However, in real world applications this is very rare. There are some different possible situations. It's possible that your data access pattern is very cache friendly and the data fit into the cache nicely. Then the cache wins hand down. Another possibility is that the data access pattern is very random and data dependent so it's almost impossible to do anything about it. Then cache is not helpful at all. The third possibility is you'll need to do some "blockification" to make your data access more cache friendly (generally this will have to take the size of the cache into consideration). This is probably the most common situation (for vector friendly codes). In this case, it's almost always possible to "blockify" the data access pattern so a software managed scratch pad can handle it well enough.

Of course, it's still possible that future GPU may converge with Larrabee a bit. For example, future GPU may have a few scalar processing units with nearly full CPU functions (and maybe with cache!) to control its vector units.
 
This was a very good post here, pcchen.

I think these two deserves to be discussed separately: hardware managed cache, and programming model.

If we compare Larrabee and current GPU (from both ATI and NVIDIA), these are the two major differences. GPUs do not have a hardware managed cache. Of course, they do have texture cache, but they are read only. The same goes to the constant cache. On the other hand, Larrabee has CPU style cache, and it's coherent, which makes atomic/locking operations much more efficient. This can be very important for some applications.

Another difference is about programming model. Larrabee can support two models: the first is similar to GPU through LRBni, i.e. SIMD with gather/scatter, or the so-called "many threads" model. The second one is the more traditional SMP style, i.e. just use Larrabee as a multi-core CPU. Of course, to utilize the most of Larrabee's power, you need to use the first model, but the point is that if your problem is not suitable for the first model, you still have the second model. You can't say the same for current GPUs.

If we are going for the traditional SMP style multi-thread programming, then I agree that cache is the way to go. You really can't expect to hide a good amount of memory latency just from this number of threads. But since current GPU can't do that at all, I think the advantage here is clearly in favor to the Larrabee. Although some may argue that using this model on Larrabee is probably not going to be better than just using a normal CPU.

Here, I think lies the crucial point. LRB has far more hardware threads than ever built earlier in mainstream machines. 32x4=128, is my estimate. If an application scales to these many threads, then there seems little doubt that it will scale to O(10^3) threads where you are back to many threads model.

However, if we go for the "many threads" model, or the vector model, the benefits of hardware managed cache is not that clear. The idea behind the vector model (and the old style vector computers) is that using a large vector allows you to hide memory latency, so you don't need cache. However, to be able to hide the latency well, you need a relatively nice memory access pattern. For example, even with gather/scatter support, if your vector load data from different memory location for each element, you are not going to get good performance. But even with a cache, I don't see how you can get a good performance from that either.

And when your app scales to O(10^3) threads, you lose the benefits of cache.
 
Here, I think lies the crucial point. LRB has far more hardware threads than ever built earlier in mainstream machines. 32x4=128, is my estimate. If an application scales to these many threads, then there seems little doubt that it will scale to O(10^3) threads where you are back to many threads model.
Not necessarily. The key to efficient parallel programming is expressing your algorithm using "just as much parallism as you absolutely have to" to keep all of the processors busy. Efficient parallel prefix sum implementations are a great example of this: you block things up the absolutely minimal amount to keep your ALUs busy, and then run the serial algorithm in core. Being forced to use more threads => less efficiency in these cases, so hardware on which you can use hundreds of threads will be more efficient than if you needed to do thousands of threads. The key to these algorithms is immediately using computed data once it is available (i.e. hiding the minimal amount of latency as possible) and splitting these things up wider actually increases the total amount of work that you have to do.

Similarly there are big advantages in dynamic branching when using fewer threads.

And when your app scales to O(10^3) threads, you lose the benefits of cache.
I'm also not convinced that's true. Caches tend to "soften the edges" on performance cliffs, and programming for GPUs is a literal minefield of them (see the "optimization space pruning in CUDA" paper from last year's HPC literature IIRC for instance).

Also memory accesses can be coherent, but unpredictable. i.e. it's not abnormal to have an algorithm for which the memory access pattern is data-dependent (and thus you can't explicitly block up/prefetch data into local memory), but extremely coherent. In these cases caches are close to ideal, and the best you can do on GPUs/Cell/etc. is implement a software cache, which is always outperformed by a hardware implementation.

Algorithms that have fairly low contention, but some data-dependent sharing are another example where local memories fail and global cache coherency is a big win. OpenCL/ComputeShader global atomics help these cases somewhat, but don't always provide the nicest performance paths in low-coherence cases compared to hardware caches (which don't need to go to global memory to figure out that there is no contention for instance).

So while I agree that there are trade-offs, and I personally have enjoyed my experience writing high-performance GPU computing code (I've done quite a bit), I think the advantages of typical CPU-like caches and threading shouldn't be trivially dismissed.
 
However, if we go for the "many threads" model, or the vector model, the benefits of hardware managed cache is not that clear. The idea behind the vector model (and the old style vector computers) is that using a large vector allows you to hide memory latency, so you don't need cache. However, to be able to hide the latency well, you need a relatively nice memory access pattern. For example, even with gather/scatter support, if your vector load data from different memory location for each element, you are not going to get good performance. But even with a cache, I don't see how you can get a good performance from that either.

This isn't and hasn't been true. The old style vector machines used SRAM as main memory and modern ones use large caches as well. Fundamentally, caches only provide performance enhancement and reduction in bandwidth requirements, they are pretty much applicable to only hardware architecture.

Of course, if your data structure happens to be fit inside the cache, it would be quite helpful. However, in real world applications this is very rare. There are some different possible situations. It's possible that your data access pattern is very cache friendly and the data fit into the cache nicely. Then the cache wins hand down. Another possibility is that the data access pattern is very random and data dependent so it's almost impossible to do anything about it. Then cache is not helpful at all. The third possibility is you'll need to do some "blockification" to make your data access more cache friendly (generally this will have to take the size of the cache into consideration). This is probably the most common situation (for vector friendly codes). In this case, it's almost always possible to "blockify" the data access pattern so a software managed scratch pad can handle it well enough.

unfortunately using a data store you have a significant overhead in managing the data store both in code AND more fundamentally in the programming model.
 
Regarding function pointers, polymorphism etc.? Well, my question was about workloads,and not programing constructs.
They are not programming constructs. You can use vtable's in C too to achieve polymorphism, but only on a CPU not a GPU. Similar for the call stack. Recursion and deep calls are not just a gimmick of certain programming languages, but really enable the use of certain algorithms. And sure, there are probably ways to process the same 'workload' with a different algorithm that doesn't require function pointers or a calls stack, but this algorithm is very likely not to be optimal. So Larrabee gives the programmer full freedom to do what he intended to do.

Talking about specific workloads is a bit pointless in my opinion. It's designed to be able to run anything, and only time will tell exactly what it excels at. Nobody has really programmed anything like this before, but it enables entirely new possibilities. Anyway one example (out of hundreds) would be artificial intelligence. Function pointers and call stacks allow each entity to run fully independent of each other, with arbitrarily complex algorithms. Take your CPU code and run it on Larrabee without any changes, for hundreds of unique entities. With all due respect that's a whole lot more exciting than the Froblins demo in my opinion.

So function pointers and call stacks are something to get very excited about even though we don't know the exact limits of the possibilities yet. Unfortunately for Intel I expect that GPUs will support them too once it starts to get traction. But they'll likely end up being a strong third player.
 
This isn't and hasn't been true. The old style vector machines used SRAM as main memory and modern ones use large caches as well. Fundamentally, caches only provide performance enhancement and reduction in bandwidth requirements, they are pretty much applicable to only hardware architecture.

To my understanding only the earliest Cray uses SRAM. Later Cray don't and of course that make them less efficient (but not necessarily slower). More recent vector computers such as NEC SX series only have cache for their scalar processors, not vector units. The multi-threading Tera MTA (later merged with Cray) also have no data cache, only instruction cache.
 
Andrew, thanks for that REYES paper link, and BTW I'd also like to see NVidia (and AMD) provide something like LRB's COMPRESS/EXPAND opcodes which would work out of shared memory (fast scan)!

pcchen, "On the other hand, Larrabee has CPU style cache, and it's coherent, which makes atomic/locking operations much more efficient." I don't think this is correct. First, "shared memory" atomics (or some atomics between the 4 hyperthread's local only data) should be roughly the same (assuming a L1 hit on LRB). As for "global atomics" (atomics with data shared between cores), under cache line sharing with atomic operations, the CPU stalls since the Atomic ALU operation is done on the CPU. Only having 4-way hyperthreading doesn't help much here. Under "cache line" sharing with atomic operations the GPU, ALU units keep processing and just hide the latency like any other latent operation (global Atomic ALU NOT done on the GPU's "CPUs"). This is a very important difference.
 
The behavior of global atomics also depends on the where the lock's cache line is.

Depending on the workload, it is possible that there is a high probability that the next core to access the line is the one that had it previously. I don't know if this is the case for graphics, but other workloads sometimes demonstrate this behavior.

If that line is still in cache and still exclusive, the operation can continue without delay.
 
Back
Top