Larrabee: Samples in Late 08, Products in 2H09/1H10

Ok, I have a question for you all. What functions on a modern GPU are handled by dedicated hardware?

I'm mostly familiar with CPU hardware, and know only a bit about GPUs. Hence my question about what remains in the dedicated fixed function hardware on a modern GPU? I assume that some sort of pixel/fragment processing and/or zbuffering is done in fixed function hardware, but I'm actually not sure. I'm even less sure of the exact computation being done. What are the inputs? What are the outputs?
 
Ok, I have a question for you all. What functions on a modern GPU are handled by dedicated hardware?

I'm mostly familiar with CPU hardware, and know only a bit about GPUs. Hence my question about what remains in the dedicated fixed function hardware on a modern GPU? I assume that some sort of pixel/fragment processing and/or zbuffering is done in fixed function hardware, but I'm actually not sure. I'm even less sure of the exact computation being done. What are the inputs? What are the outputs?
Maybe someone else has time to add to this list, but I'd like to comment that which each successive DX generation GPUs have added more fixed function hardware. This rate is lower than the addition rate of programmable processors, but it's definitely there. I may be forgetting something, but the biggies are texture filtering, depth ops, color ops, triangle clipping, triangle setup, scan conversion.

Edit: Oh and lots of FIFOs between and around blocks. For example, if you have early depth testing you must store the result until after the pixel shader.
 
Ok, I have a question for you all. What functions on a modern GPU are handled by dedicated hardware?

I'm mostly familiar with CPU hardware, and know only a bit about GPUs. Hence my question about what remains in the dedicated fixed function hardware on a modern GPU? I assume that some sort of pixel/fragment processing and/or zbuffering is done in fixed function hardware, but I'm actually not sure. I'm even less sure of the exact computation being done. What are the inputs? What are the outputs?

including specialized caches:

  1. pre-transform vertex cache
  2. input assembly (CPU friendly)
  3. post-transform vertex cache
  4. primitive assembly (CPU friendly)
  5. primitive culling (backface culling, viewport culling, etc..) (CPU friendly)
  6. primitive setup (CPU friendly)
  7. coarse grained rasterizer
  8. hierarchical z-buffer (CPU friendly)
  9. fine grained rasterizer
  10. interpolators (CPU friendly)
  11. texture address generator
  12. texture filtering (CPU friendly, kind of..)
  13. texture cache(s)
  14. ROPs (a mixed bag, part of the tasks implemented in these units map very well to CPUs)
  15. fixed function tesselator (CPU friendly?)

CPU friendly means that computations should algorithmically map well to a common CPU core, other requirements such as precision and accuracy need to be defined.
 
Your missing vertex shader FLOPs there. Thats an extra 50 GFLOPs for G71 in the comparison to G80.

That's true; but the architectures are so different that FLOP != FLOP anymore. In a twisted scenario s.o. could leave the MUL unit out of the calculation on G80 and simply say that it's barely 45GFLOPs more compared to G71.

As I said in my original post in this thread, it comes down to what each FLOP exactly stands for.

nao,

What about a programmable tesselator? I doubt there's going to be fixed function tesselation in D3D11.

By the way albeit probably an oversimplyfication of a diagram, how would this one correlate to your list? (honest questiion also)

http://www.imgtec.com/images/PowerVR/images/BlockDiagrams/SGX.gif
 
Frankly I would be very surprised if s.o. would say that a programmable tesselator wouldn't be "CPU friendly".

By the way:

ROPs (a mixed bag, part of the tasks implemented in these units map very well to CPUs)

How bout dumping those into a memory controller?
 
Frankly I would be very surprised if s.o. would say that a programmable tesselator wouldn't be "CPU friendly".
right, but for me it's not so much a processing problem to solve, rather it's a memory one. Programmable tesselation will be in DX11, for absolute sure.
Ailuros said:
How bout dumping those into a memory controller?
A lot of it already is, especially the blender.
 
a couple of thoughts:

1) If Intel is not suicidal they will put some sort of fixed function rasterizer on Larrabee in order to help speeding up rasterization. Unless they don't care about games.
They need something that can perform coarse and fine grained rasterization so that can also efficiently run early rejection algorithms.
I guess all sort of different rendering pipelines will be possible on Larrabee, the question is how efficient such an open/reconfigurable architecture
is going to be compared to what the competition will offer in a couple of years

Do you mean something like that:

http://csg.csail.mit.edu/6.375/projects/group1_final_report.pdf
http://csg.csail.mit.edu/6.375/projects/group1_final_presentation.pdf

Thats something I found a week ago when I searched for additional informations about MSAA
 
From what I've heard about Larrabee, the vector extensions are wildly different (which is one of the reasons that the rest of Intel doesn't like it). Larrabee doesn't support MMX or any of the SSE instructions.

I agree. As part of their work on Ct Intel has defined the Virtual Intel Platform (VIP) an abstract vector architecture with 75 instructions and vectors up to 1,024 bits long. At runtime Ct will compile VIP code (VCG - VIP Code Generator) for the current target be it SSE3, SSE4 or it would seem Larrabee.
 
  1. pre-transform vertex cache
  2. input assembly (CPU friendly)
  3. post-transform vertex cache
  4. primitive assembly (CPU friendly)
  5. primitive culling (backface culling, viewport culling, etc..) (CPU friendly)
  6. primitive setup (CPU friendly)
  7. coarse grained rasterizer
  8. hierarchical z-buffer (CPU friendly)
  9. fine grained rasterizer
  10. interpolators (CPU friendly)
  11. texture address generator
  12. texture filtering (CPU friendly, kind of..)
  13. texture cache(s)
  14. ROPs (a mixed bag, part of the tasks implemented in these units map very well to CPUs)
  15. fixed function tesselator (CPU friendly?)
1. CPU friendly
3. CPU friendly
7. CPU friendly
9. CPU friendly
13. CPU friendly

I see little need for specialized caches if every CPU core has a high bandwidth L1 cache. But maybe I'm missing why you call them "specialized" caches.

Furthermore, whereas a GPU would sometimes only have one or a few specialized units, a massively multi-core CPU would always have all its cores available. So bottlenecks are impossible (at this level).

I do believe the number one hotspot for CPUs is the texture sampling. However, SIMD scatter/gather could already make a huge difference there. Those instructions would also help a lot with implementing transcendental functions (by bringing lookup tables back in the game). And they would be useful for a lot more than just graphics. They would finally make well optimized auto-vectorization achievable.
 
I see little need for specialized caches if every CPU core has a high bandwidth L1 cache. But maybe I'm missing why you call them "specialized" caches.

I was going to make this same point. I think the point is even stronger considering the caches in Larrabee are all kept coherent by the hardware. No need to perform explicit extra data movement or flushing of caches (or whatever). In that sort of system, any part of the address space can become a flexible buffer for storing intermediate results on chip.

Furthermore, whereas a GPU would sometimes only have one or a few specialized units, a massively multi-core CPU would always have all its cores available. So bottlenecks are impossible (at this level).

I think this is also a key point. The more specialized units you have, the more likely one of them will become the bottleneck while others go idle. You've statically allocated the resources. With a multi-core CPU, you can dynamically load balance to apply computation to just where you need it.

I do believe the number one hotspot for CPUs is the texture sampling. However, SIMD scatter/gather could already make a huge difference there. Those instructions would also help a lot with implementing transcendental functions (by bringing lookup tables back in the game). And they would be useful for a lot more than just graphics. They would finally make well optimized auto-vectorization achievable.

I don't know much about texture sampling. I could certainly imagine scatter/gather type operations (or at least prefetching) that could really help. It seems that if textures don't hit in the on-chip memories, then you're going to be bandwidth bound. If they fit on-chip, then you can likely sample/fetch them reasonably quickly. Then again, I really don't know that much about texture sampling. Perhaps someone else can say more on that.
 
1. CPU friendly
3. CPU friendly
7. CPU friendly
9. CPU friendly
13. CPU friendly

I see little need for specialized caches if every CPU core has a high bandwidth L1 cache. But maybe I'm missing why you call them "specialized" caches.
I didn't assign the CPU friendly tags to specialized caches cause I took for granted that is trivial to replace them, at least the vertices related ones.
Specialized means that they serve only determined graphics pipeline stages and that they might be tweaked (different ways to tag specialized data, different eviction policies..) to be more efficient at what they do.

. And they would be useful for a lot more than just graphics. They would finally make well optimized auto-vectorization achievable.
Can you elaborate a bit more on this?
 
From what I've heard about Larrabee, the vector extensions are wildly different (which is one of the reasons that the rest of Intel doesn't like it).
Sounds like the ones most angry are the main x86 designers who might have wanted to integrate Larrabee alongside the big cores without a heterogenous multicore solution. Both Intel and AMD initially thought specialized and software incompatible cores were the next step, but both have backed off on that for the time being.

Larrabee doesn't support MMX or any of the SSE instructions. They actually went back to the microcode from the Pentium. They did extended it to 64-bit x86, but without SSE.
Sounds like somebody needed the opcode space and didn't want to follow AMD's Another Damned Prefix Byte approach.

Most of my speculation works so long as Larrabee's new instruction set has not done the following:

1) Gone totally Load/Store
2) Gone with non-destructive operands
3) Junked x86 memory and addressing semantics
4) Added 2-4 bytes to every instruction (which kind of goes against the point of avoiding Another Damned Prefix Byte) to the detriment of its instruction cache.

1,2, and 3 would go a long way in invalidating a lot of the work in established x86 vectorizing compilers Intel was supposedly leveraging for Larrabee to entice software developers.

Such a departure isn't unprecedented. The IBM Cell SPEs use a different sort of SIMD instructions (and different number of registers) than the normal PowerPC Altivec stuff.
The particulars vary, but the overall tenor is not massively different, save one interesting exception.
It's not like the SPE decided to go CISC with destructive operands and reg-memory operations.
The big difference is that the SPEs and their local store and DMA to main memory required a reworking of a lot of the memory addressing behavior and a removal of all software permissions checking.

Larrabee, if it supports any x86 at all, cannot do that (unless there's some wacky scheme where there are two decoders and two control units for an x86 mode than can handle virtual memory and permissions and a vector mode that does something entirely different.)

They basically re-designed the vector instructions from the ground up to be graphics specific. That is how they plan to get away with not having any other specialize graphics hardware on the chip. Just these special vector ALUs. Seems like a big gamble, but I am convinced by the pitch, frankly.

If that means they broke x86 semantics and are moving away from 2 decades of handling x86's baggage, I'd be happy from a philisophical standpoint.
That doesn't jive with the idea that Intel wants to leverage the weight of x86 to make headway in new markets, though.
Actually, with a very GPU-like extension, it might even open up an avenue for GPUs to partly mitigate Larrabee's x86 advantage, especially since Larrabee doesn't support x86 instructions still covered by copyright. Perhaps Intel Legal and Intel Markeing are also a little cheesed at Larrabee's team as well because of this.

In many ways, perhaps Larrabee is Cell "done right".
No, 1-2 Gesher cores and 8-16 fully compatible Larrabee cores on the ring bus would be Cell "done right". Larrabee's designers apparently went out of their way to screw that.

Transistor count isn't the most relevant issue anymore. The two most important issues are (1) power and (2) die area. Granted, these are related to transistor count, but not always one-for-one.
Any differential in transistor count per core is going to have an effect on die area and power consumption. It isn't one-for-one, but it almost never is a negative relationship, with one exception being that sleep transistors can cut power consumption for idle units.

edit: To complete this point: Any differential is scaled by a factor of 24. Unless you really expect there to be a significant benefit to any additional transistors, sometimes leaving it off is a better option.

Intel's 45nm process has very small SRAMs and it has very low power SRAM transistors (by using special low-leak transistors). You can get lots of L2 cache on a chip without burning much power or taking up that much die area. Once you're basically power limited by your ALUs, why not throw some extra cache on the chip if you have enough die area?
The question becomes why you're power-limited at a certain level by the ALUs.
There's a difference between working hard and working smart.
If a certain amount of specialized hardware on a common workload task with the footprint of 1 general ALU and supporting network can do the work of 10 ALUs at the same or less power consumption, that's enough to either cut 10 ALUs or add 9 more.

edit: to continue
A specialized unit might have a massive affect in the target workload.
A few million extra L2 transistors in each core that wind up yielding a few percent in a few workloads (thank you diminishing returns), not so much.

Conventional wisdom is that caches don't work for graphics computations. Perhaps Intel has found more locality in graphics applications (in the multi-MB range of caching) than previously thought.
That's an interesting question, and I'm sure research is active in this area.
The conventional wisdom has worked well thus far.
Some low-hanging fruit with large caches is perhaps loading up whole tables and textures, but in many complex scenarios the amount needed for a given pixel can go up and down wildly.

I was going to make this same point. I think the point is even stronger considering the caches in Larrabee are all kept coherent by the hardware. No need to perform explicit extra data movement or flushing of caches (or whatever). In that sort of system, any part of the address space can become a flexible buffer for storing intermediate results on chip.
I hope that can be turned off at will, given how much of the workload doesn't need full coherency. Those nifty scatter and gather operations would generate in really bad cases 24 time their number of accesses in coherency updates, to the exclusion of actual data being passed around.
If you know you are working intermediate results, you shouldn't have to care about updating 24 separate caches and 24 separate TLBs.
On the flip side, Intel would likely have to implement some way of maintaining processor affinity. Passing any amount of thread context is enough to saturate a good fraction of the bus for a good amount of time.

I think this is also a key point. The more specialized units you have, the more likely one of them will become the bottleneck while others go idle. You've statically allocated the resources. With a multi-core CPU, you can dynamically load balance to apply computation to just where you need it.
That is one thing the generalized hardware has for a bottleneck: that shared ring bus. It's likely overspecced at 256B/cycle for this reason.
Try load balancing by passing an 8KiB vector thread context between cores more than a couple times.

At 24 cores, even if each core has a low probability of requiring this, the aggregate probability is higher.
God forbid either the OS or the driver software does the willy-nilly thread thrashing current x86 multicore does.
 
Last edited by a moderator:
right, but for me it's not so much a processing problem to solve, rather it's a memory one. Programmable tesselation will be in DX11, for absolute sure.

Anything else to make Simon once more proud of it's grand child? <img>
 
Specialized means that they serve only determined graphics pipeline stages and that they might be tweaked (different ways to tag specialized data, different eviction policies..) to be more efficient at what they do.
I see. Well, the question then becomes whether it's more efficient (area/performance/power) to have a dozen different caches or to have one type of highly optimized cache. I believe that the latter, while not specialized, can perform just as well because a lot more effort can be put into its design. Also, generic caches would automatically balance, while specialised caches can be underused (wasting transistors), or overused (a bottleneck).
Can you elaborate a bit more on this?
With scatter/gather operations you could take any program loop that processes independent elements, and vectorize it by simply emitting vector operations instead of scalar operations. Without scatter/gather, all the accessed data would have to be stored sequentially to allow vectorization. Programmers without SIMD experience typically won't write vectorization friendly code. You have to convert an array-of-structures to a structure-of-arrays yourself. But with scatter/gather a lot of scalar code could be converted to vector code in a straightforward way.
 
Sounds like somebody needed the opcode space...

Intel would be *insane* to reuse the same opcodes for the Larrabee vectors and existing instructions. Just as Intel used available opcode space to added MMX and SSEx, they likely did the same with these instructions. One of the nice things about a variable length instruction set, you can always make more opcode space.

The big difference is that the SPEs and their local store and DMA to main memory required a reworking of a lot of the memory addressing behavior and a removal of all software permissions checking.

I agree. I think Cell's biggest mistake was the explicit DMA stuff. Seems pretty hard for the programmer to get it right (and fast).

...especially since Larrabee doesn't support x86 instructions still covered by copyright. Perhaps Intel Legal and Intel Markeing are also a little cheesed at Larrabee's team as well because of this.

You surely mean patents (not copyright). That aside, I would imagine that Intel would be pretty aggressive about getting patents on the new vector instructions. Nobody without a patent deal with Intel will be able to make a binary-compatible Larrabee.

No, 1-2 Gesher cores and 8-16 fully compatible Larrabee cores on the ring bus would be Cell "done right". Larrabee's designers apparently went out of their way to screw that.

I totally agree. This isn't in place for the first version of Larrabee, but this is obviously where things are headed. In fact, if Larrabee is a success, I expect that some distant version of it will re-unify the two divergent x86 ISAs (SSE vs Larrabee vectors), giving all the cores on the chip the same ISA, but different implementations (in-order vs out-of-order cores, for example).

Once Intel does as you suggest and combine some Larrabee core with a traditional x86 core, that will be hard for NVIDIA to compete with. Intel will eat up more and more of the mid-range graphics market, slowly squeezing NVIDIA into a niche high-end role.

How about this for a bold statement: I predict that NVIDIA will eventually be forced to re-invent itself as a software-only company working on middleware and graphics engines for game developers. It will fail at that and follow SGI into oblivion.

A few million extra L2 transistors in each core that wind up yielding a few percent in a few workloads (thank you diminishing returns), not so much.

I think this is where you missed the my full point about not all transistors being equal from a power perspective (and about being ALU power limited).

Those extra transistors in the L2 cache won't burn very much power at all. In Intel's 45nm, transistors don't leak that much and not that many transistors in the cache are active (switching) in any given cycle. However, if you took that area and turned it into some fixed-function pipeline, those transistors would be switching like mad each cycle, consuming much more of your power budget than cache SRAM.

For dynamic power, it isn't the number of transistors. It is the number of times a transistor switches. That is why cache memory is reasonably power efficient. You can increase the size of a second-level cache without it using that much more of your power budget.

That said, you could argue that fixed-function hardware could do the same computation with fewer transistors switching (as compared to Larrabee's vector units). That is likely true for some operations. How much, I really don't know.

I hope that can be turned off at will, given how much of the workload doesn't need full coherency... If you know you are working intermediate results, you shouldn't have to care about updating 24 separate caches and 24 separate TLBs.

This is *not* how cache coherence works. It is much more efficient than that. First, TLBs are totally unaffected by cache coherence. Second, the cache coherence protocol doesn't update all the other cache in the system. The coherence protocol just enforces the invariant that a given 64B block of data is either (1) read-only in one or more caches or (2) read/write in a single cache. It does this by invalidating the any other copies of the block before a processor is allowed to write its copy of the block. No data is transfered until another processor reads the block (which just becomes a normal miss).

This is how cache coherence works in Intel's existing multi-core systems (in fact, this is basically how all cache coherent multiprocessor systems work from AMD, Sun, IBM, HP, etc.)

There is no need to "turn it off". Once you've burned the design complexity to implement it, it really is just a win and doesn't really get in the way. If all processors are working on private data, it has basically zero impact. On when true sharing of data happens does it kick in.

On the flip side, Intel would likely have to implement some way of maintaining processor affinity. Passing any amount of thread context is enough to saturate a good fraction of the bus for a good amount of time.... Try load balancing by passing an 8KiB vector thread context between cores more than a couple times.

Why would it need to pass thread contexts? A GPU might, but Larrabee won't. I suspect that most of the algorithms in Larrabee will be work-list type algorithms. In the simplest implementation, each thread just pulls off the next task from the queue, does the task, repeat. In more sophisticated implementations, the work list is hierarchical (one per core, plus a global queue that supports work stealing). This allows for great load balance. Plus, the task information probably fits in a single 64B cache block (a few pointers, maybe a few index or loop counts).

The "queue" above is just a simple software data structure protected by a lock for synchronization. Such a data structure is easy to implement (and reasonably efficiently) in all software. No need for a special hardware queue structure. There is also no need to have a complicated in-hardware global thread scheduler.

That is one thing the generalized hardware has for a bottleneck: that shared ring bus. It's likely overspecced at 256B/cycle for this reason.

Having a big, fast, shared way for the components of the chip talk to each other sounds like a *good* thing. It is optimized for 64B block transfers, making it easy to transfer data between caches or to and from memory. I don't see a problem here. Remember, cache hits don't touch the ring, so assuming some locality of reference, this should be plenty of bandwidth.

The more and more we chat back and forth (which I'm really enjoying, BTW), I'm becoming to realize how radical a departure Larrabee seems from what is currently done in GPUs. Like I said earlier, my background is from the general-purpose multi-core side of things. I'm only beginning to realize how GPUs have forced game develops and such to think in one specific way about computation. I really do think that something like Larrabee is going to really spur innovation as those constraints are lifted.
 
Cache coherence and such

There is no need to "turn [cache coherence] off". Once you've burned the design complexity to implement it, it really is just a win and doesn't really get in the way. If all processors are working on private data, it has basically zero impact. On when true sharing of data happens does it kick in.

I wanted to add one thing to what I said above. I said that coherence has zero impact when working on private data. As evidence of that, have you ever heard of anyone running single-threaded code on a Core 2 Due (a multi-core) complaining that coherence was getting in their way? It is always on (even when you don't need it), yet nobody complains. The same will be true for Larrabee.
 
Conventional wisdom is that caches don't work for graphics computations. Perhaps Intel has found more locality in graphics applications (in the multi-MB range of caching) than previously thought.
That's an interesting question, and I'm sure research is active in this area.
The conventional wisdom has worked well thus far.
Some low-hanging fruit with large caches is perhaps loading up whole tables and textures, but in many complex scenarios the amount needed for a given pixel can go up and down wildly.
You don't need to 'load' whole textures. With mipmapping every pixel needs on average just one new texel, the rest is in its close surroundings and thus in L1 cache (even for anisotropic filtering). So the amount of cache you need just depends on the resolution and how many textures are read.

You can keep resolution low by rendering tiles and using frame-to-frame coherence. You can keep the number of texture reads low with deferred shading (doing a z-only pass first).
 
Back
Top