Intel working on Larrabee 2.

Jawed · Oct 9, 2008

Taking GT200 and RV770 as baselines, each GPU dedicates about 30% of the die to ALUs including register file and scheduling. Over the next 18 months these designs will both see a massive increase in computing power. So the question there is, what will be the rate of increase in RBEs and MCs?

We can be reasonably sure that ATI is sticking with 4:1 ALU:TEX for at least one generation and prolly the one after, so TUs will increase in count in proportion to the increase in ALUs.

I'm thinking that over the next 18 months we'll see something like a 4x increase in ALUs in ATI GPUs. RBEs and MCs will not increase by anything like this number, double at most. So the ALUs will be taking a dominant portion of the die.

In NVidia's case I suspect the ALUs won't increase so quickly, but there'll be a radical gain in performance per TMU and per ROP, i.e. as a proportion of the die, the ALUs will increase much like we'll see with ATI.

So ALUs will be ~60% of the die in 18 months? And the generation after that? 70%+?

Assuming a 256-bit bus and fixed function RBEs, are we looking at about a minimum of 25% of the die being fixed function - since I/O consumes a lot of area? Will a 512-bit bus ever become the norm?

So while Larrabee could initially look "unbalanced" being so ALU-heavy (though some of that is scalar core ALUs) we'll start seeing the other GPUs catching up over the next 2 to 3 generations.

Additionally, Larrabee promises to use much less bandwidth due to its tiled pixel shading approach. Perhaps half. That again increases the proportion of Larrabee that is ALUs rather than fixed function.

The other killer, particularly for NVidia, is that Larrabee's true-scalar per element (pixel) ALU design and it's four hardware thread and "dumb context switching" software threading model means that the control overhead on Larrabee's vector ALUs is way way lower than in NVidia's design. Larrabee also uses a 16-lane ALU design (as opposed to NVidia's 8), again radically reducing the control overhead per element.

So, as ALUs naturally become the dominant part of all high-end GPUs, Larrabee will be right near the front in terms of performance per mm for its ALUs.

Also we're expecting the other GPUs to replace fixed function units (e.g. ROPs) with programs running on the ALUs. This will further increase the proportion of the die that is ALUs.

We know that all of NVidia's major functional units, ALUs, TMUs, RBEs and MCs are horribly inefficient per mm right now. I see hope for all but the ALUs, but that'll only come in time for the era when ALUs become dominant. If NVidia is building ALUs that are half the performance per mm of Intel then how's it going to compete?

Finally, no-one in their right mind buys version 1.0 of anything. So, I'm very much looking forward to Larrabee 2. That should be the GPU that dishes out the pain, top to bottom. Can't wait

Jawed

Megadrive1988 · Oct 11, 2008

Thinking very optimistically for a moment, Larrabee 2 and Larrabee 3 could be the new 'golden age' of desktop 3D much like Voodoo and Voodoo 2 in 1996-1999.

CouldntResist · Oct 11, 2008

Jawed said:
Also we're expecting the other GPUs to replace fixed function units (e.g. ROPs) with programs running on the ALUs.

Why ROPs wouldn't stay?
- ROPs could be useful outside of the realm of rasterisation APIs. Atomic operations are already there in GPGPU.
- Texture sampling is going to stay fixed function, so why not ROPs? ROPs are complementary to TUs: one is used in read-only fashion, the other is write-only (almost). Perhaps, it might be more efficient to keep this dualism, rather than follow Larrabee's way and it's fully shared & writable cache?

Jawed · Oct 11, 2008

CouldntResist said:
Why ROPs wouldn't stay?

Key reasons:

As rendering gets more complex ROP utilisation falls, i.e. frame rendering becomes less fillrate bound with increasing proportions of ALU-bound, TEX-bound or bandwidth bound operations. Additionally ROPs have multiple modes, each of which can be independently used or used in conjunction with other modes, i.e. colour write with/without MSAA write or z-only write. Any time only one mode is used, the hardware dedicated to other modes goes idle.
Developers want to perform more complex operations on render target data - these operations generally aren't available with current fixed function hardware or require multiple passes. e.g. write their own blending functions. This k-buffer stuff came up before in a previous Larrabee thread:

http://www.sci.utah.edu/~bavoil/research/kbuffer/

- ROPs could be useful outside of the realm of rasterisation APIs. Atomic operations are already there in GPGPU.

Generally there's a desire to dissociate atomic memory operations from ROP hardware - since being forced to rasterise (as older GPGPU programmers were) is a pain in the arse.

Plus Larrabee's L2 cache neatly performs at the heart of atomicity for RBE (ROP) operations. All current GPUs have some form of cache for render target manipulation, so what makes them really different is their fixed function nature.

- Texture sampling is going to stay fixed function, so why not ROPs? ROPs are complementary to TUs: one is used in read-only fashion, the other is write-only (almost). Perhaps, it might be more efficient to keep this dualism, rather than follow Larrabee's way and it's fully shared & writable cache?

L2 in Larrabee is effectively the gather/scatter portal. Other GPUs are forced to work around their current cache configurations (and helping the situation by adding additional, dedicated, cache: parallel data cache in G80 and later GPUs and local data share/global data share in RV770 and later GPUs) in order to provide this functionality.

Texture decompression/filtering are something like an order of magnitude more computationally intensive than current ROP actions.

D3D11 doesn't directly target programmable output merger (instead of fixed function ROPs) but it relies upon pixel shading to create source data that is consumed by compute shaders that perform "post processing" of the render target/data structure - so it's a step in that direction. It's hard to tell whether ATI and NVidia GPUs are on the cusp of programmable OM or if it'll be sometime after D3D11 arrives.

Jawed

Geo · Oct 14, 2008

Megadrive1988 said:
Thinking very optimistically for a moment, Larrabee 2 and Larrabee 3 could be the new 'golden age' of desktop 3D much like Voodoo and Voodoo 2 in 1996-1999.

I think I want to see a software strategy that makes sense over time, with increasing ISV buy-in, before I start getting all starry-eyed.

Mars3D · Oct 14, 2008

Megadrive1988 said:
Thinking very optimistically for a moment, Larrabee 2 and Larrabee 3 could be the new 'golden age' of desktop 3D much like Voodoo and Voodoo 2 in 1996-1999.

Initially I think we will see a golden age for hobbyists. Larrabee may also be a great opportunity for low budget developers. The triple A developers are stuck developing for a large userbase. In the longer term I'm hoping we'll see a return of emphasis on programming innovation and less reliance on media content. There may be an explosion of diversity in rendering techniques.

It's a great time to be a graphics programmer, if only I didn't need to sleep...

3dilettante · Oct 14, 2008

Mars3D said:
Initially I think we will see a golden age for hobbyists. Larrabee may also be a great opportunity for low budget developers.

What exactly is the source of standard GPUs' high cost of entry versus writing one's whole software rendering stack?

The triple A developers are stuck developing for a large userbase. In the longer term I'm hoping we'll see a return of emphasis on programming innovation and less reliance on media content.

The trends in costs show content creation is an ever-expanding portion of budgets, but that seems almost agnostic to the hardware using it.

TimothyFarrar · Oct 14, 2008

Jawed said:
L2 in Larrabee is effectively the gather/scatter portal. Other GPUs are forced to work around their current cache configurations (and helping the situation by adding additional, dedicated, cache: parallel data cache in G80 and later GPUs and local data share/global data share in RV770 and later GPUs) in order to provide this functionality.

Texture decompression/filtering are something like an order of magnitude more computationally intensive than current ROP actions.

D3D11 doesn't directly target programmable output merger (instead of fixed function ROPs) but it relies upon pixel shading to create source data that is consumed by compute shaders that perform "post processing" of the render target/data structure - so it's a step in that direction. It's hard to tell whether ATI and NVidia GPUs are on the cusp of programmable OM or if it'll be sometime after D3D11 arrives.

Jawed

Or will L2 function more as a register spill scratchpad, due to the amount of "threads" which are going to need to be running to hide latency. So I'm not convinced that really using scatter dominate algorithms is going to be a good idea on Larrabee (similar reason one wouldn't do this on CUDA). Also from the SigGraph Paper, Larrabee can only access one cacheline per clock for vector scatter/gather. So it is not like scatter is free.

BTW, CUDA 2.0 PTX ISA 1.2 (have to download the newest CUDA toolkit to get the PDF) refers to a .surf memory space (which is not currently supported). This .surf is documented as R/W, context shareable, and accessible via surf instructions (ie like texture instructions). Would effectively be a high latency coherent cache. If this is the case perhaps ROP gets generalized to this in the future.

Jawed · Oct 15, 2008

TimothyFarrar said:
Or will L2 function more as a register spill scratchpad, due to the amount of "threads" which are going to need to be running to hide latency.

I guess L2 (and implicitly video memory - just like in R600 and later) will function as register spill and this will be a primary function of L2.

Additionally, though, I was hinting at the idea that L2 as registers and L2 as scatter/gather portal are "interchangeable" concepts. Algorithms may be able to use L2-as-registers instead of doing gather/scatter, exchanging "register index arithmetic" for "address indirection".

I'm just guessing - and I don't know what techniques/situations might make one method more preferable than the other. Though I think it's reasonable to guess that an algorithm that has a wide "indexing radius" will only work using scatter/gather.

So I'm not convinced that really using scatter dominate algorithms is going to be a good idea on Larrabee (similar reason one wouldn't do this on CUDA). Also from the SigGraph Paper, Larrabee can only access one cacheline per clock for vector scatter/gather. So it is not like scatter is free.

No. But the bank conflicts that come with "irregular" register/cache-line/PDC accesses in CUDA are not free either.

BTW, CUDA 2.0 PTX ISA 1.2 (have to download the newest CUDA toolkit to get the PDF) refers to a .surf memory space (which is not currently supported). This .surf is documented as R/W, context shareable, and accessible via surf instructions (ie like texture instructions). Would effectively be a high latency coherent cache. If this is the case perhaps ROP gets generalized to this in the future.

Yes, I've seen several references to .surf and people pining to be able to use it.

What you say about ROPs here is very pertinent - I think you're prolly right. And I've been studying the Gamefest D3D11 material some more and it seems to me that pixel shaders will have read/write access to resources (i.e. "textures" and the like). So this is a kind of back-door programmable OM, rather than the distinct programmable shader stage that I, at least, was expecting.

So NVidia's .surf and the ability of pixel shaders to read/write like this seem to be the same thing.

This'll be home territory for Larrabee, of course.

Oh, another interesting thing in D3D11 is that a thread group can consist of 1024 threads. The intention is that with each iteration of D3D thereafter, the group size will increase. Also, get ready to be boggled: each thread must be able to share 8192 scalar 32-bit values (or 2048 vec4s) to any other thread in the group.

Jawed

TimothyFarrar · Oct 15, 2008

I think this means share up to 8192 scalers in the same way that if one tried this in CUDA that the number of threads to hide latency drops like a stone. So I'm not to surprised at that number.

As for read/write access in compute and pixel shaders, I was under the impression that these were effectively the equivalent to global memory access in CUDA. With GT2xx the hardware is much better at gather/scatter than prior G8x/G9x, but not yet at the point where ROP would be a good idea to do in software. Besides the hardware Z path is so important (ie how it ties to rasterization) that ROP removal I don't see being done unless raster goes software. So I think .surf on NVidia GPUs literally would have to be an interface to the ROP and accessible via instructions separate from global memory accesses. Actually I would rather have .surf as separate instructions, and have triangle setup get parallelized. Then one would have an insanely fast GP binning interface.

Anyway if there is anything to take home from this, perhaps if non-Intel GPUs get coherent caches, they will just be smaller and higher latency --- still relying on global memory access, shared registers and latency hiding for core parallel computation. Which is in IMO a good idea.

3dilettante · Oct 15, 2008

Jawed said:
No. But the bank conflicts that come with "irregular" register/cache-line/PDC accesses in CUDA are not free either.

There's a possible wrinkle with Larrabee's L2, or at least Larrabee I.
David Kanter at RWT said that the L2 was blocking, which if I can find more confirmation would mean an L2 miss can have a significant impact on any scheme that leverages the L2 heavily. The cache control instructions and non-temporal prefetches might be able work around it.

TimothyFarrar · Oct 15, 2008

Or perhaps L1 miss is detected early enough and assumed to be a L2 hit in the pipeline, so if L2 misses only then the entire pipeline for that hyperthread needs to be flushed and blocked until the mem->L2->L1 transfers are finished.

3dilettante · Oct 15, 2008

TimothyFarrar said:
Or perhaps L1 miss is detected early enough and assumed to be a L2 hit in the pipeline, so if L2 misses only then the entire pipeline for that hyperthread needs to be flushed and blocked until the mem->L2->L1 transfers are finished.

I don't follow what you're getting at with this.

An L1 miss goes to the L2 anyway. There's no need to flush the pipeline, the load just stalls if there is a miss, and the thread stalls like any other time a hazard is encountered.
Stalling until the L2 miss is serviced is what a conservative in-order core will do.

The blocking L2 means that the L2 will not service any access from any thread until the miss is satisfied. Best to hope not any of the other threads running wants to go to the L2 in that time.

Jawed · Oct 15, 2008

L2 blocking should be mitigated by 3 factors:

L1 miss percentage, 5%?, independently on all four hardware threads - so if one hardware thread suffers an L2 miss, the chances of all 3 of the other threads suffering a miss on L1 depends on the length to service the L2 miss
L2 misses are potentially serviced from other L2s, reducing the L2 latency
Some L2 misses are serviced from the texture unit - texture results are always dumped into L2, if I remember right - so this should also help to reduce the L2 latency (though, of course, texture fetch/filtering latency can itself be longer than memory fetch latency - depends on TU cache hit performance)

Making L2 blocking like this is, I guess, a compromise in the face of the large number of concurrent demands on L2 - i.e. peers, texture unit(s), L1 as well as memory reads + writes.

---

The only reference to "L2 blocking" I can find on RWT is:

http://www.realworldtech.com/forums/?action=detail&id=93319&threadid=93146&roomid=2

Looking at the context:
http://www.realworldtech.com/forums/?action=detail&id=93318&threadid=93146&roomid=2
http://www.realworldtech.com/forums/?action=detail&id=93317&threadid=93146&roomid=2

it appears that "L2 blocking" is actually referring to the blocking of an L2 shared by TMU L1s in a conventional GPU. It doesn't appear to be a reference to blocking in Larrabee.

So, do you have an explicit link about L2 blocking in Larrabee?

Jawed

TimothyFarrar · Oct 15, 2008

3dilettante said:
I don't follow what you're getting at with this.

An L1 miss goes to the L2 anyway. There's no need to flush the pipeline, the load just stalls if there is a miss, and the thread stalls like any other time a hazard is encountered.
Stalling until the L2 miss is serviced is what a conservative in-order core will do.

The blocking L2 means that the L2 will not service any access from any thread until the miss is satisfied. Best to hope not any of the other threads running wants to go to the L2 in that time.

After your last comment, I think we are both saying the same thing. My comment was that L2 miss would only stall one hyperthread as long as the other threads didn't have an L2 miss. Also that an L2 miss probably includes a pipeline flush of all previous instructions which have not yet completed but only from those issued in the L2 missed hyperthread.

3dilettante · Oct 15, 2008

Jawed said:
L2 blocking should be mitigated by 3 factors:

L1 miss percentage, 5%?, independently on all four hardware threads - so if one hardware thread suffers an L2 miss, the chances of all 3 of the other threads suffering a miss on L1 depends on the length to service the L2 miss

Reference?
I've seen 95% L1 hit rate numbers quoted for single-threaded CPUs.

http://www.realworldtech.com/forums/index.cfm?action=detail&id=93319&threadid=93146&roomid=2
The only reference to "L2 blocking" I can find on RWT is:

Click to expand...

That was the one, I must have misunderstood which chip's L2 he was talking about.

Jawed · Oct 15, 2008

TimothyFarrar said:
I think this means share up to 8192 scalers in the same way that if one tried this in CUDA that the number of threads to hide latency drops like a stone. So I'm not to surprised at that number.

Agreed, it's a "high number" designed not to present too much of a false ceiling, just like D3D10 requires support for 4096 vec4s.

As for read/write access in compute and pixel shaders, I was under the impression that these were effectively the equivalent to global memory access in CUDA.

There's two separate concepts here: reading/writing a defined resource (a surface, texture or render target, if you like) and making arbitrary access to memory.

But there's not much to go on...

With GT2xx the hardware is much better at gather/scatter than prior G8x/G9x, but not yet at the point where ROP would be a good idea to do in software.

Really the key here is to hide the latency of these memory accesses, which is a "threads in flight" problem. Gather/scatter is obviously more troublesome than accessing a defined resource (such as a 2D surface), which is more amenable to rasteriser-based chunking, something that Larrabee does for all pixel shading.

Pixel shaders can also write to irregular data structures, but I'm unclear on whether they can read and update them.

But, hopefully, pixel shader updating of a 2D resource should be the highest-performance variant of these techniques - requiring the least number of threads in flight.

Besides the hardware Z path is so important (ie how it ties to rasterization) that ROP removal I don't see being done unless raster goes software.

Not being a developer, I'm not sure if there's much demand for writing code that reads and modifies Z or stencil values - perhaps ROP-style hardware is destined to stay for quite a while for Z/stencil - at least until the fixed-function GPUs move towards a tiled forward rendering style?

So I think .surf on NVidia GPUs literally would have to be an interface to the ROP and accessible via instructions separate from global memory accesses.

If .surf is read/write for "colour" only, it could be implemented solely through a latency-hidden cache against memory, leaving all the optimisations for early-Z intact working on the fixed-function rasteriser/early-Z/ROP units.

Anyway if there is anything to take home from this, perhaps if non-Intel GPUs get coherent caches, they will just be smaller and higher latency --- still relying on global memory access, shared registers and latency hiding for core parallel computation. Which is in IMO a good idea.

I think that's the only route to scalability. But then you get the separate problem of how close to peak performance can you tweak? In theory, when faced with 3 or more GPU architectures that all offer the same functionality, what kind of "peak" performance can you reasonably expect for this kind of hairy memory usage?

We're already familiar with features on current GPUs that can vary by 5x in their performance based solely on which GPU architecture you choose. Arguably the Larrabee fan club would say that this is the specific reason why everything other than Larrabee is doomed...

Jawed

Jawed · Oct 15, 2008

3dilettante said:
Reference?
I've seen 95% L1 hit rate numbers quoted for single-threaded CPUs.

It was a guess, hence the question mark. I was guessing that in throughput computing, L1 hitrates should be pretty good.

Though we're talking of throughput computing with arbitrary gathers/scatters mixed-in, so that's not the same. I dunno. Presumably, with working hardware, the Larrabee guys have a rough idea now...

That was the one, I must have misunderstood which chip's L2 he was talking about.

If Larrabee's L2 is blocking, it'd be "disappointing". That may be version 1.0-itis or just a reality of the complexity of L2 architecture.

Jawed

3dilettante · Oct 15, 2008

To possibly add more weight to the position that I misread the RWT post, I'll note that the Pentium's L2 was non-blocking.
Since Larrabee leverages so much of that core, it would seem a big change if L2 became blocking.

TimothyFarrar · Oct 15, 2008

Jawed said:
There's two separate concepts here: reading/writing a defined resource (a surface, texture or render target, if you like) and making arbitrary access to memory.

I took another look at the DX11 stuff this time from the SigGraph papers and http://s08.idav.ucdavis.edu/boyd-dx11-compute-shader.pdf describes a RWTexture2D and RWBuffer in Optimized I/O Intrinsics. So perhaps my assumptions are wrong.

Without some kind of support for fast swizzle instruction for address generation, it wouldn't make any sense to do a CUDA like write into anything but a linear texture. However it looks as if DX11 has ability to access "any" resource directly (including swizzled textures?). So perhaps this RWTexture2D business is the DX11 equivalent of the .surf in PTX.

Not being a developer, I'm not sure if there's much demand for writing code that reads and modifies Z or stencil values - perhaps ROP-style hardware is destined to stay for quite a while for Z/stencil - at least until the fixed-function GPUs move towards a tiled forward rendering style?

I can see a huge reason to want to write Z in the fragment shader. And in fact DX11 includes oDepth which is designed for this very purpose. Ability to write depth such that depth is father out than the plane equation depth for the triangle and still have fast Z culling in raster stage.

Intel working on Larrabee 2.

Jawed

Megadrive1988

CouldntResist

Jawed

Geo

Mostly Harmless

Mars3D

3dilettante

TimothyFarrar

Jawed

TimothyFarrar

3dilettante

TimothyFarrar

3dilettante

Jawed

TimothyFarrar

3dilettante

Jawed

Jawed

3dilettante

TimothyFarrar

Similar threads