LRB - ditching x86?

.

I don't like the idea of x86 compatibility, but yes an ARM LRB would be great. :D But even there, there are many questions. IE, would you like to have the Jazelle, Thumb, NEON etc.?

Well, let's say i'm totaly not into this field so you could stun us with your techie-talk. But quick wiki visit gave out, that all these are just code improvement techniques over ARM inside processor, that are been subsequently added or improvement to existing one like MMX->SSE.

Anyway LRBni even SSE outperforms all that NEON talks ntm radix-4 engine in c2d, Thumb is advanced power saving technique thru reduced instruction width, and Jazelle (or Thumb-(2)EE) is the only thing that would be great to see on any processor not just ARM. And because those Larabees targets some future power monster rigs Thumb is obsolete from intel standpoint, just like they claim great Atom performance over ARM where something like Thumb would be really usefu. And JIT compiler improvement (Jazelle) on desktop/workstation ... We'll probably see that before from AMD lab than Intel. (It would be great to see that Thumb2 in some UVD power saving technique where we don't need even ~240SP@300MHz for some OHD (2160p) future resolutions in MPEG4.10, and for now ~80SP@300MHz for 1080p are more than enough)
Intel don't need nothing to proof to no one, from their standpoint and marketing will just flood the market with their Larabee.
 
Yes clearly lines tagged as exclusive aren't a problem, but good algorithm design (IMO) makes those cases non-atomic or "shared memory" cases.
In a lot of cases that isn't possible though without effectively writing a software cache as I mentioned above. For instance, consider something like histogram generation with some local coherence in bin scatter (a very common case). In this example cache-lines will "tend" to stay and be well-utilized on one processor/core, with the cache mechanism taking care of the bits that are incoherent. Certainly there's a tradeoff to be made in creating local histograms for each core/thread/SIMD lane, but if you're forced all the way to the granular end (as you often are on GPUs), you end up wasting a rediculous amount of memory and bandwidth, not to mention algorithmic efficiency.

That's a pretty useful memory access pattern IMHO and applies to a lot of problems.
 
Andrew, I'm missing something here, in your example, how are you suggesting to do histogram generation on LRB?

Divide up the domain into the cores (each process a subset of the entire data set), do a local histogram per core, then later do a parallel reduction (add) on the intermediate results or do a colliding global atomic add? Local histograms would be "shared memory cases" and thus fast on either arch (with differences as to bank conflicts vs cacheline hits). The final summation of the local results (non-LRB, each core doing a set of vector global atomic adds to the solution histogram) would be fast on GPUs.

Or perhaps you are talking about a histogram which wouldn't fit in the 32KB DX11 local store size?
 
Or perhaps you are talking about a histogram which wouldn't fit in the 32KB DX11 local store size?
Yeah, exactly. Or similarly, it's a waste to split it into that many histograms - one per core, SIMD lane, or whatever - if you expect each core to only touch subsets of the histogram, such as is the case when there is some coherence in the input data. It's a waste both in terms of memory bandwidth (writing out all of those separate histograms from local memory) and generally processing efficiency ("reduction" across all sub-histograms).

This is a case where there's few data dependencies and lots of coherence in the input data, but you don't know that statically and thus you need a cache (software or hardware) to approach the ideal work efficiency for the data set.
 
Render target read. That alone is the only example you should need, and it's just the tip of the iceburg :)

Exactly! In other words programmable ROPS (or alternatively pixel shaders with RT input) is all LRB needs to have better performance/quality than anything with CUDA/OpenCL even with DX11.
For example, programmable ROPS give you the possibility to define your own render format (like LogLUV for HDR), you can do your own blending, your own MSAA resolves (for correct HDR, gamma etc), you can have double/triple/N-tuple Z-buffers for depth peeling, soft shadows, correctly sorted and lit translucency (even with deferred lighting). You will no longer be hardware limited on MRT count or MSAA samples, the whole notion of predefined render targets becomes obsolete.
These alone take LRB into a league of its own.
Technically you can attempt to implement these with CUDA via writing your own rasterization/shading/blending but this will be a monumental waste of time - you'll be rediscovering LRB at this point.
 
Andrew, I think I will remain doubtful that I'd want to do a histogram calculation only via global atomic operations on a shared global histogram on LRB (or a multi-core CPU) until I see someone time it with various input data distributions ;) Clearly there are also program managed local store tricks to be done even in the case where the histogram doesn't fit in LS, if you know a little knowledge about it's distribution (perhaps from the previous pass).

I think it is somewhat interesting that we are talking about an isolated very low ALU work per MEM access task here. Also things like traditional image space post processing reductions, which would typically be done in one TEX bound pass, would fall into the same class.

Depending on how MPMD the other NVidia/AMD DX11 hardware ends up, LRB could have an advantage here regardless of the cache issue, by running other ALU bound tasks at the same time to completely hide the MEM bound nature of this class of problem. Also probably not an issue to only run the MEM bound job on a few cores instead of distributing the job across the entire chip. I'm more excited about this than I am about the cache!
 
Andrew, I think I will remain doubtful that I'd want to do a histogram calculation only via global atomic operations on a shared global histogram on LRB (or a multi-core CPU) until I see someone time it with various input data distributions ;) Clearly there are also program managed local store tricks to be done even in the case where the histogram doesn't fit in LS, if you know a little knowledge about it's distribution (perhaps from the previous pass).
Oh I'm not saying you necessarily want to do it globally, but you want to do it at the least granularity possible to avoid significant contention. If your input data is fairly coherent (come now, you must be able to imagine a lot of these cases in graphics :)), at the very most you'll want to have histogram copies per core... that's at least an order of magnitude less than you'd need to keep a typical GPU busy.

I think it is somewhat interesting that we are talking about an isolated very low ALU work per MEM access task here. Also things like traditional image space post processing reductions, which would typically be done in one TEX bound pass, would fall into the same class.
Well it's the place where cache vs. local memory matters most. Hell math is practically free on these types of processors nowadays anyways :)

Depending on how MPMD the other NVidia/AMD DX11 hardware ends up, LRB could have an advantage here regardless of the cache issue, by running other ALU bound tasks at the same time to completely hide the MEM bound nature of this class of problem. Also probably not an issue to only run the MEM bound job on a few cores instead of distributing the job across the entire chip. I'm more excited about this than I am about the cache!
Sure thing, but it's all somewhat related. Having a larger cache means that you can more often choose to run fewer "blocks" and directly operate on a minimal data-set at the "core" level rather than running thousands of fibers/"work items"/whatever which buys you something in memory efficiency.

And yeah, the option to do some task parallelism as well is really important IMHO... you almost always want a mix of data and task-parallelism, so having both be efficient is desirable.

That said, I'd be surprised if you could find a lot of ALU-bound workloads on moden GPUs... in my experience almost everything is memory-bound nowadays. The only good example I have is a pixel shader that solves for ray-patch collisions by newton iteration and direct evaluation of the patch function, but even that one *flies* on R770 (which is somewhere near 4x as fast as G80 in this example actually - haven't tried GT200).
 
Last edited by a moderator:
Exactly! In other words programmable ROPS (or alternatively pixel shaders with RT input) is all LRB needs to have better performance/quality than anything with CUDA/OpenCL even with DX11.
For example, programmable ROPS give you the possibility to define your own render format (like LogLUV for HDR), you can do your own blending, your own MSAA resolves (for correct HDR, gamma etc), you can have double/triple/N-tuple Z-buffers for depth peeling, soft shadows, correctly sorted and lit translucency (even with deferred lighting). You will no longer be hardware limited on MRT count or MSAA samples, the whole notion of predefined render targets becomes obsolete.
These alone take LRB into a league of its own.
Technically you can attempt to implement these with CUDA via writing your own rasterization/shading/blending but this will be a monumental waste of time - you'll be rediscovering LRB at this point.

Even with DX9 many of us have our own render formats, just we cannot provide all blending operations with them. So programmable ROPs provides the ability to blend into a special format. What's actually likely better is not to blend at all and instead do everything (lighting/etc) in one pass sans ROPs (which should be possible post MSAA raster via DX11 CS or OpenCL too BTW, and also is done in at least one DX9 level engine that I know of). As for MSAA post raster, you can fetch those samples to my knowledge in DX11 from CS. Also I thought render targets were sample-able in DX11 from PS or CS?

Besides is n-buffer z-buffing really practical even in DX11 lifetime? We are talking about some awfully thick RTs (deferred lighting), and thus with N peels/pixel, a really small tile in whatever percent of 256KB is available for RT storage? The answer to this might be yes, if you dynamically trade MSAA for peels per sub-tile.

I would NOT discount either LRB or NV/ATI in regards to the games we get to play with programmable "render targets" in DX11 hardware.
 
Oh I'm not saying you necessarily want to do it globally, but you want to do it at the least granularity possible to avoid significant contention. If your input data is fairly coherent (come now, you must be able to imagine a lot of these cases in graphics :)), at the very most you'll want to have histogram copies per core... that's at least an order of magnitude less than you'd need to keep a typical GPU busy.

I think you can do a single histogram per core via CUDA (and DX11 CS or CL), with one block per core, and don't use sync threads until done with ALL processing on the core. Warps can scatter atomic add to one local memory histogram just fine.

That said, I'd be surprised if you could find a lot of ALU-bound workloads on moden GPUs...

Well I know of a bunch of core ALU-bound stuff, and ironically one which is a highly optimized post processing pass (which you would think would be TEX bound)...
 
I think you can do a single histogram per core via CUDA (and DX11 CS or CL), with one block per core, and don't use sync threads until done with ALL processing on the core. Warps can scatter atomic add to one local memory histogram just fine.
Yeah for sure, but in my experience CUDA isn't exactly efficient when only running with, say, 240 "threads", 8 "warps", etc. That may have changed though, which would be cool.

Well I know of a bunch of core ALU-bound stuff, and ironically one which is a highly optimized post processing pass (which you would think would be TEX bound)...
Good to hear that there's still stuff, because it's definitely getting fewer and further between, to the point where I don't suspect that it will be very long before performance will be entirely dictated by an algorithm/processor's ability to efficiently move around memory.

What's actually likely better is not to blend at all and instead do everything (lighting/etc) in one pass sans ROPs (which should be possible post MSAA raster via DX11 CS or OpenCL too BTW, and also is done in at least one DX9 level engine that I know of).
Sure, deferred shading style. What true RTR provides though is the ability to construct per-pixel data structures over all of the triangles that hit a given pixel. In the deferred shading case, this equates to being able to do all of your lighting from an arbitrary number of lights while only reading in the G-buffer once, which is a huge bandwidth savings. There are more complicated algorithms that need to R/W the render target per pixel/triangle as well that simply can't be expressed efficiently with RTR... you'd need to do one pass per triangle :S
 
Exactly! In other words programmable ROPS (or alternatively pixel shaders with RT input) is all LRB needs to have better performance/quality than anything with CUDA/OpenCL even with DX11.
Even assuming it will be competing with just slighty upgraded present generation architectures, programmable FLOPs still matter just a bit. If the difference is large enough then no amount of flexibility can overcome the difference, especially since not being able to beat them at their own game will hurt adoption and developer support.

In the end plain DX11 competitiveness is going to decide it's early success, and I'm afraid that means that Intel's persistence will decide it's success in the long run.
 
Last edited by a moderator:
Yeah for sure, but in my experience CUDA isn't exactly efficient when only running with, say, 240 "threads", 8 "warps", etc. That may have changed though, which would be cool.

Per block, if you want to hide normal ALU latency, about 64 threads are good enough. If you want to hide global memory access latency (coalesced), you need about 192 or more threads per block. So it's basically depends on what task you are doing.

For histograms, you'll want to hide global memory latency so you need 192 or 256 threads per block. Therefore you'll want to use atomic shared memory access (which is supported by GT200 only). It'd be impossible for each thread to have their own histogram in shared memory since it's not big enough.

The histogram sample in CUDA SDK uses a specific trick to avoid using atomic shared memory access but that's not really portable. It also has a "proper" sample using atomic shared memory access and the performance is similar.
 
Even assuming it will be competing with just slighty upgraded present generation architectures, programmable FLOPs still matter just a bit.
Certainly true, but it's not hard to come up with algorithms that are thousands of times or more efficient with RTR vs. without, and I doubt the FLOP/memory bandwidth differences will be that large :)

I do agree that doing well in DX11 is a key point though. Note that even through that API I suspect there will be "different" tradeoffs between the various architectures, so software will certainly still have space to innovate and optimize.
 
Therefore you'll want to use atomic shared memory access (which is supported by GT200 only). It'd be impossible for each thread to have their own histogram in shared memory since it's not big enough.
Yep makes sense, although with "wide" histogram bins (i.e. structures) or high enough resolution, I suspect you'd be somewhat restricted by local memory even with just one histogram per block.

The histogram sample in CUDA SDK uses a specific trick to avoid using atomic shared memory access but that's not really portable. It also has a "proper" sample using atomic shared memory access and the performance is similar.
Cool, I'll check that out when I get a chance. Curiously how many "blocks" do you end up having to use on GT200? 30? Dozens? Hundreds? Presumably if there are too many it'll hurt memory bandwidth and the reduction cost at the end, no?

I'm also curious as to the efficiency of shared memory atomics with collisions. If all of the SIMD lanes scatter/atomic to the same location, does it get purely serialized, or is there some cleverness in "horizontal" log(n)-style reduction across the SIMD lanes?
 
Yep makes sense, although with "wide" histogram bins (i.e. structures) or high enough resolution, I suspect you'd be somewhat restricted by local memory even with just one histogram per block.

Sure. 16KB shared memory is not very big.

Cool, I'll check that out when I get a chance. Curiously how many "blocks" do you end up having to use on GT200? 30? Dozens? Hundreds? Presumably if there are too many it'll hurt memory bandwidth and the reduction cost at the end, no?

GT200 has 30 blocks, so it should at least use 30 blocks. The default value in the sample is 64, and that results in about 6.6GB/s for computing histogram. The number of threads per block is 192.

I fiddled with the number of blocks. Using only 32 blocks the result is worse, about 5.2GB/s. Using 128 blocks the results is better at about 8GB/s. It maxed at about 256 which gives ~9GB/s. Of course, the best number of blocks greatly depends on the hardware and sometimes have to be found through profiling.

I'm also curious as to the efficiency of shared memory atomics with collisions. If all of the SIMD lanes scatter/atomic to the same location, does it get purely serialized, or is there some cleverness in "horizontal" log(n)-style reduction across the SIMD lanes?

I'm not sure about this. The shared memory is banked so it get serialized if there's bank conflict. It supports broadcasting if all threads (in a warp) are reading from the same memory address, but it does not apply to write nor atomic operations AFAIK.
 
I'm also curious as to the efficiency of shared memory atomics with collisions. If all of the SIMD lanes scatter/atomic to the same location, does it get purely serialized, or is there some cleverness in "horizontal" log(n)-style reduction across the SIMD lanes?

For atomics, I don't see how anything other than serialization would be considered correct.
 
For atomics, I don't see how anything other than serialization would be considered correct.

I think Andrew is talking about cases of collisions on atomic operations when the atomic operation does NOT require a return value. Then one could do a parallel reduction in some cases. For example a 4 way colliding atomicAnd on address ptr could be done like this, with the two inner And()s in parallel,

atomicAnd(ptr, And(And(a, b), And(c, d)))
 
I think Andrew is talking about cases of collisions on atomic operations when the atomic operation does NOT require a return value. Then one could do a parallel reduction in some cases. For example a 4 way colliding atomicAnd on address ptr could be done like this, with the two inner And()s in parallel,

atomicAnd(ptr, And(And(a, b), And(c, d)))

The direction I'm coming from might be a bit low-level, so I'm not sure how to parse this.

atomicAnd loads a value at the ptr address and then performs a nested And over four components in that value?

I'm trying to figure out what parts of the operation may fail, if this is not a read/modify/write operation.
 
The direction I'm coming from might be a bit low-level, so I'm not sure how to parse this.

atomicAnd loads a value at the ptr address and then performs a nested And over four components in that value?

I'm trying to figure out what parts of the operation may fail, if this is not a read/modify/write operation.

d = And(a,b) // would compute d = a&d; in C syntax

// atomic and would be in like C cyntax
int AtomicAnd(int* ptr, int src) {int old = ptr[0]; ptr[0] &= src; return old; }

So what I was suggesting (in the prior case of a 4 way colliding address without needing the atomic return value) was a parallel non-atomic And reduction followed by one single atomicAnd. Easy to do in software if you know in advance which addresses are going to collide. Seems really messy to do this in general in hardware when the software doesn't know which addresses are going to collide.
 
So what I was suggesting (in the prior case of a 4 way colliding address without needing the atomic return value) was a parallel non-atomic And reduction followed by one single atomicAnd. Easy to do in software if you know in advance which addresses are going to collide. Seems really messy to do this in general in hardware when the software doesn't know which addresses are going to collide.
Yeah that's what I meant. My point was that "coherent"/contended atomic operations almost always reduce to a serialization of all of the values in hardware, which gets very slow in some cases, whereas the ideal is actually a log(n)-style reduction, which can often be implemented pretty efficiently in software with a few horizontal operations, and caches let you soften the blow a bit too if there's lots of coherence.

The GPU histogram generation papers all cover this issue - in the case where you're scatter/writing to the same address, things slow to a crawl. There are various things you can do to make the situation better, but they all involve increasing the number of independent histograms and thus the total cost/efficiency of the operation.
 
Back
Top