Larrabee: Samples in Late 08, Products in 2H09/1H10

With predication on GPUs you're running both sides of the branch too, and in those cases part of the SIMD hardware is idle/empty also.
I agree with that, but with predication you're only executing empty instructions during a branch. For parts of the code outside a branch all units are working. I don't know how significant this is though.

If I understand it right, there won't be a vector branch instruction in Larrabee, so there's no chance of having both a taken and a not-taken result for the same branch. Instead, you'll have vector comparison instruction(s) that compute essentially a 1-bit result for each vector element, and a way to tell whether any of those results are true or any false.

Then a potentially-divergent branch on Larrabee would look like this:
Code:
vcond = <vector comparison>
push current vector predicate mask
if (any bits set in vcond)
    AND vcond into vector predicate mask
    run "true" side of branch
else if (any bits set in NOT(vcond))
    AND NOT(vcond) into vector predicate mask
    run "false" side of branch
pop vector predicate mask
The branch predictor would predict the result of the two branches in that code, each of which is either taken or not taken.

I'd expect there to be specialized instructions for making this pattern (and similar patterns for loops) efficient.
What about the case where bits are set in vcond and NOT(vcond)? Won't it need to run both sides of the branch like a GPU?

Code:
if ( all bits set in vcond )
    run "true" side of branch
else if (all bits set in NOT(vcond))
    run "false" side of branch
else
    run both sides

edit: The above pseudocode might illustrate my question better. Ignoring that branch predictors will never practically be 100% correct if the architecture works like a GPU, then it seems that it's not possible for the branch predictor to be correct all the time.

I guess it would pick a branch and if necessary run the other branch while either throwing away the original branch or keeping the results from both with a predicate mask. If both branches are frequently taken then a branch predictor seems like a waste of silicon.

Or is a GPU mindset making me miss how the SIMD unit will work?
 
Last edited by a moderator:
No, you're right. Remove the else. I remember thinking that when I wrote it, but my fingers must have typed the else out of habit..

Code:
vcond = <vector comparison>
push current vector predicate mask
if (any bits set in vcond)
    AND vcond into vector predicate mask
    run "true" side of branch
if (any bits set in NOT(vcond))
    AND NOT(vcond) into vector predicate mask
    run "false" side of branch
pop vector predicate mask
 
One obvious thing he brought up which I completely forgot about, is that while caches (and branch prediction for that matter) make it easier for some programmers, when trying to extract peak performance per Watt (or per ALU capacity of the machine), you nearly always have to have an intimate knowledge of the cache design and modify the algorithm to suit the cache. Same goes for branch prediction (to a lessor extent), in optimizing by fanning out branch paths to suit the branch predictor. Both techniques require expert level programming skills to take advantage of. At least with a program managed local store (non-cache coherent) you can also easily optimize the memory bandwidth (and memory access patterns) as well as the ALU side of the algorithm, resulting in better performance per Watt.

Local stores always sound great until you have to use them. Pretty much anything a local store can do, a cache can do and do it with much more flexability and much less complex programming. While its true to get the absolute best performance, knowledge of the prediction algorithms and cache structure help, it generally isn't required and not that complex to get 95% of perfect.
1.) Registers become a scarce resource. If you are interleaving 4 16-wide shader programs to hide texture fetch latency, you are down to 8 regs per shader program (32/4).

The "register space" for x86 is as large as the L1 cache because of mem->mem and reg->mem operations.

2.) You won't be able to use the standard x86 integer or double precision SSE2 regs with this formula (with any hope of good performance). Double precision might not integrate well into standard unified vertex programs on Larrabee. Problem here is that if you add 8-wide DP support in the 16-wide vectors you break the formula for 16 simultaneous programs. One option would be to split DP into register pars (hi 32-bits 16-wide in one even register, and lo 32-bits 16-wide in the next odd register), keeping the same ISA opcode structure. Still wonder what NVidia is going to do here as well... if done this way NVidia's banked register file actually looks like an asset, compared to doing a MAC (mul+add) on double precision requiring an extra register file cycle (stalling access for another ALU op) do to limited number of register file ports on Larrabee (3 vs the needed 6 for split regs).

Um what? you aren't making sense. Doing DP float in hi/low is a recipe for disaster. Plus if you were going to interleave you would time multiplex not bit slice. In hardware that support SIMD DP/SP generally you get 1/2 the DP performance and have the exact same number of register ports. For a MAC that would be: MulValA, MulValB, AddVal, and Dest. Some or more may or may not be the same depending on if the hardware support only destructive ops or not. For a 3 OP mac, you would likely do A * B + C -> C.

3.) Same as above, Larrabee will probably need integer support in the 16-wide reg extension to support unified shaders.

Not even an issue really. All SIMD extensions implemented or proposed for pretty much every architecture supports both with most supporting 2x16 for every SP float slice.

4.) Branch granularity will probably have to match the amount of shader program interleaving. So Larrabee's possible 16 wide branch granularity (for fragment programs) probably will effectively be 32, 48, or 64 wide depending on how the compiler interleaves. Branch direction is likely highly uncorrelated to anything other than data and best predicted with a simple program hint indicating which path is most common.

Again this assumes bit slice interleaving when time domain is most likely.

Aaron Spink
speaking for myself inc.
 
Or is a GPU mindset making me miss how the SIMD unit will work?

Historically when working with SIMD units the best practice has always been to replace if/else statements with Select, Min, Max, Clamp, Abs etc and leaving real branches for when absolutely necessary, such as loops or when the incorrect side would cause a hardware crash if executed. Then you definitely don't want the hardware to take both branches unless it has special support for suppressing invalid operations. AFAIK Itanium has support for executing invalid operations and then discarding the results with no side effects. As another example, SPU's would never crash on invalid read address (the address simply gets wrapped modulo 256k) so you can execute loads speculatively. Hopefully Larrabee would have some support along these lines, which would help tremendously with optimizations.
 
The VMX128 implementation includes an added dot product instruction, common in graphics applications. The dot product implementation adds minimal latency to a multiply-add by simplifying the rounding of intermediate multiply results. The dot product instruction takes far less latency than discrete instructions.​


Dot Product is a limited for of tree-ops. TreeOps are pretty cool, don't really require much hardware to be added, and can generally be done in the same pipeline/latency. General operations are of the type: A = SUM(MulVecA * MulVecB)

Aaron Spink
speaking for myself inc.​
 
AFAIK Itanium has support for executing invalid operations and then discarding the results with no side effects.
You remember correctly. Itanium supports predication and I believe the compiler uses it when the number of instructions in a branch are below a threshold.
 
I can't see anything new, but something there makes me think Larrabee will be on 45nm...

Yea, I agree. It seems like 32nm would be too bleeding edge (and 65nm wouldn't be competitive) in that time frame. So, 45nm seems reasonable. I'm sure we'll see a Larrabee follow-on in 32nm not that long after Larrabee (maybe a year or so). It might even be that Larrabee-45nm is used mostly for software development and some early adopters. The 32nm shrink should help with cost, power, & performance. Combined with a mature software stack, and Larrabee II might be the one that becomes a high-volume discrete GPU.

Then again, Larrabee-45nm might also be successful, but my point is that we likely won't really know if Intel's gamble on Larrabee will pay off until we've seen the 32nm part.
 
Then again, Larrabee-45nm might also be successful, but my point is that we likely won't really know if Intel's gamble on Larrabee will pay off until we've seen the 32nm part.

We will probably be able to extrapolate the performance of a move to 32nm (assuming no major architectural changes) from a 45nm part. Further, using performance per mm and performance per watt there should be pretty good signals of how well the products might do in the market.

If it does not outperform competitors in these areas I'm not sure how great uptake would be. They can bundle all they want and I'm sure they will try, but the fact is that GPUs are cheap and there is not much room to undercut things.

Which brings me to another point and I'm sorry if its tangential to the technical scope of this thread, but GPUs are cheap and CPUs are expensive. So as integration progress, which makes more economic sense to move what into what? If you believe in cheaper products that do the same thing as more expensive products then the obvious answer is move the CPU into the GPU and not the other way around.
 
We will probably be able to extrapolate the performance of a move to 32nm (assuming no major architectural changes) from a 45nm part. Further, using performance per mm and performance per watt there should be pretty good signals of how well the products might do in the market.

I agree that we would have a basic idea of how it would do. However, the problem with such analysis is that cost per mm is highly non-linear with the overall area of the chip. If you make a chip twice the size, it will likely cost a lot more than twice as much (all depending on the yields). As well, the price a end-users is willing to pay for a unit of performance is also non-linear (and greatly depends on what else is out there).

In addition, Intel will likely be the first to 32nm, so Larrabee could make the jump a year or so earlier than the other GPU companies that use TSMC.

Finally, even if the hardware is just a shrink, Larrabee is going to depend more on software than most GPUs, so the software might mature quite a bit, too.

Perhaps the biggest uncertainly isn't the "cost" to Intel but the "price" it will set for the part. If Intel is willing to give up profit margins in the short term for building market share, it might be able to undercut NVIDIA, forcing NVIDIA's prices (and profit margins down, too).

Of course, this would be a bastardly and anti-competitive thing for Intel to do (basically subsidizing its GPUs with its CPU profits), but they might do it anyway.

Which brings me to another point and I'm sorry if its tangential to the technical scope of this thread, but GPUs are cheap and CPUs are expensive.

What do you base this statement on? Looking at Amazon, you can buy a Intel Core 2 Quad Q6600 Quad-Core Processor, 2.40 GHz, 8M L2 Cache for under $300. Many of the GeForce 8800 boards were around $250 to $350 range. I'm not exactly sure which two CPUs vs GPUs to compare, but I'm not so sure that your generalization about CPUs being expensive and GPUs cheap holds true. Sure, you can get GPU boards for $100, as you can get older CPUs from Intel for cheaper, too.

Perhaps someone that knows the market dynamics of GPUs could chip in (and correct me if necessary), but I'm not sure your generalization holds.
 
Which brings me to another point and I'm sorry if its tangential to the technical scope of this thread, but GPUs are cheap and CPUs are expensive. So as integration progress, which makes more economic sense to move what into what? If you believe in cheaper products that do the same thing as more expensive products then the obvious answer is move the CPU into the GPU and not the other way around.

Ahh that old argument? Its quite simple, CPUs are hard, GPUs are easy and only needed by a small portion of the market and competing with free!

From development cycles, verification, post-si validation, performance requirements, etc, CPUs are a lot harder to do than GPUs.

And if you don't have a tier 1 fab, don't even think about getting into the X86 CPU game as you won't make it. And no TSMC isn't a tier 1 fab...


Aaron Spink
speaking for myself inc.
 
I have the feeling that some NVIDIA or ATI guy would disagree on a few points here and there..;)
 
I have the feeling that some NVIDIA or ATI guy would disagree on a few points here and there..;)


It's too bad we can't see the credentials of ALL the posters in this thread. It would be interesting to see what backgrounds they come from. It would be interesting if a few "GPU" guys were able to declare as such where they come from, as to give more value to the words that they say.
 
Um what? you aren't making sense. Doing DP float in hi/low is a recipe for disaster. Plus if you were going to interleave you would time multiplex not bit slice. In hardware that support SIMD DP/SP generally you get 1/2 the DP performance and have the exact same number of register ports. For a MAC that would be: MulValA, MulValB, AddVal, and Dest. Some or more may or may not be the same depending on if the hardware support only destructive ops or not. For a 3 OP mac, you would likely do A * B + C -> C.

Aaron,

Hi/lo DP worked fine on MIPS ;) no problem for a programmer, or compiler.

The reason I thought 8-wide DP in a 16-wide SP register isn't a good idea is because then you need two distinct predicate systems. One for the 16-wide case and another for the 8-wide case. Seems like Larrabee is going to need to have a fully predicated 16-wide ISA. This is especially important for memory or texture fetch instructions and stores, because when you are running 16 scaler shader programs in parallel you don't want those instructions on the "divergent branch" to modify or fetch memory.
 
Last edited by a moderator:
Aaron,

Hi/lo DP worked fine on MIPS ;) no problem for a programmer, or compiler.

The reason I thought 8-wide DP in a 16-wide SP register isn't a good idea is because then you need two distinct predicate systems. One for the 16-wide case and another for the 8-wide case. Seems like Larrabee is going to need to have a fully predicated 16-wide ISA. This is especially important for memory or texture fetch instructions and stores, because when you are running 16 scaler shader programs in parallel you don't want those instructions on the "divergent branch" to modify or fetch memory.

Hi/Lo DP is depreciated on MIPS :)
and requires either additional ports, or additional sequencing and therefore non-pipelined FP ops.

As I said, it really doesn't make sense to run 16 in parallel. With the vector structure, it would make much more sense to do a 4 pixel quad and time division multiplex between 4 pixel quads based on external latencies.

Also in general, SIMD is ALU only, while loads and stores are done via normal loads and stores, ie, 1 at a time. This has been pretty standard since the inception of SIMD instructions into mainstream CPUs, so I don't see any reason why that would change. I pretty much have no knowledge of what they are planning as far as SIMD extensions for Larrabee.

I don't know all the details on the internal architecture of G80, but I would be surprised if its actually a SIMD pipe and not really 16 independent pipes, each with register offsets but a single instruction sequencer...

Then again it may also be that both G80 and R600 do serialized texture fetches as well instead of actual parallel texture fetches from a single SIMD unit.

Part of the problem is that both Nvidia and ATI pretty much don't talk about there architectures in anything actually approximating details. CPU details tend to be much more defined and much more public. And we've even seen ATI/Nvidia post diagrams that later turned out to be false or at least highly misleading.

Aaron Spink
speaking for myself inc.
 
Call me crazy but if Intel wanna perform complex raytracing with Larrabee's samples in the final of the 2008:

1. I don't understand how a company that is doing 640 3dmarks2006 with their top graphics product ( the X3000 series IGPs ) pretends that... On the other hand, my old Intel 740 is in the garage... You know what I wanna say: A company cannot pass from 640 3dmarks to 640K in one day!

2. Perhaps very simple raytracing of a million-polys scene and HD resolution at 30FPS could be possible with a GDDR5/XDR 2 + 48 cores 16-way SIMD running at 2Ghz ... but fear the price, the size and the hot.

3. Simple raytracing is not enough to superpass the current rasterization technologies... I don't think hard shadows with hard reflections running at 30FPS could be much better than Crysis.

To superpass rasterization we need global illumination added to the simple raytracing... and those are major words. That implies not only closest/furthest triangle-set hits... also nearest point search for photon mapping, complex BDRFs and glossy reflections, irradiance and volumetric light transport, SSS and caustics, complex motion blur shadows, etc etc... Let's put the feet on the ground... MentalRay using a rendering farm finishes a simple 1M scene with final gather and complex shadows/reflections in near 15 mins ...and we are trying to perform that in 1/30 seconds... no way.

I don't think we have currently the technology to perform all this... unless they extracted from the Area 51 one of those 500Ghz transistors ( http://www.xbitlabs.com/news/cpu/display/20060621235014.html ) or they are using a modified 1024 qubits Dwave/NASA thing. In fact, i think we will need another way to compute all that.. for example, optronics.

4. We've explored almost all the available hardware raytracing and computing systems ( Saarcor, AR500, ClearSpeed... ) and also GPGPU ( CUDA, brook, ... ) and I could say the closer things to render a million polys complex scene are some PS3 Cells in parallel or big dedicated server clusters.

So... I'm really skeptic about Larrabee ... I think they could be very happy if they even reach a GF9800 or an ATI HD 4000 using just rasterization!
 
Last edited by a moderator:
Call me crazy but if Intel wanna perform complex raytracing with Larrabee's samples in the final of the 2008...Simple raytracing is not enough to superpass the current rasterization technologies...

Larrabee is primarily a rasterization-based solution. I'm not sure where all these raytracing rumors come from. Sure, the hardware could do raytracing (the hardware *is* general-purpose after all), but that isn't what Larrabee will mostly be doing.
 
Larrabee is primarily a rasterization-based solution. I'm not sure where all these raytracing rumors come from. Sure, the hardware could do raytracing (the hardware *is* general-purpose after all), but that isn't what Larrabee will mostly be doing.

That’s simple. If Intel talks public about 3D graphics they talk about raytraceing and how much better it is compared to rasterization. This make people believe that their upcoming hardware is build to do raytraceing and not rasterization.
 
2. Perhaps very simple raytracing of a million-polys scene and HD resolution at 30FPS could be possible with a GDDR5/XDR 2 + 48 cores 16-way SIMD running at 2Ghz ... but fear the price, the size and the hot.
I get 30-60 FPS at 640x480 on a Q6600 for Jakko Bikker's Arauna demo. So if it scales perfectly with the number of GFLOPs that would be 300-600 FPS at 1280x960 on a 48 core 16-way Larrabee... Clearly there is some potential to be explored here.
3. Simple raytracing is not enough to superpass the current rasterization technologies... I don't think hard shadows with hard reflections running at 30FPS could be much better than Crysis.
We got so used to the capabilities and limitations of rasterization that we hardly see what more could be offered by ray tracing. Also, some effects are just trivial with raytracing while rasterization requires lots of complicated hacks.

For starters I have yet to see something in Crysis that is semi-transparent, reflects the surrounding, refracts multiple light sources, and casts colored shadows - at the same time.
To superpass rasterization we need global illumination added to the simple raytracing... and those are major words. That implies not only closest/furthest triangle-set hits... also nearest point search for photon mapping, complex BDRFs and glossy reflections, irradiance and volumetric light transport, SSS and caustics, complex motion blur shadows, etc etc... Let's put the feet on the ground... MentalRay using a rendering farm finishes a simple 1M scene with final gather and complex shadows/reflections in near 15 mins ...and we are trying to perform that in 1/30 seconds... no way.
It's very hard to achieve high quality for any of that with rasterization, let alone combined. It's true that we're still a huge factor away from achieving all this in real-time with raytracing, but at least we know that scaling performance will get us there.

In fact, ironically, when you do some of these things with rasterization the used techniques start to show some striking resemblance to raytracing...
4. We've explored almost all the available hardware raytracing and computing systems ( Saarcor, AR500, ClearSpeed... ) and also GPGPU ( CUDA, brook, ... ) and I could say the closer things to render a million polys complex scene are some PS3 Cells in parallel or big dedicated server clusters.
As it stands, CPUs are far more suited for raytracing than GPUs. And it gets even more spectacular when looking at per-GFLOP effective performance. So imagine a "big dedicated server cluster" on one chip...
So... I'm really skeptic about Larrabee ... I think they could be very happy if they even reach a GF9800 or an ATI HD 4000 using just rasterization!
I have to agree on that for rasterization. It's obvious that if NVIDIA/AMD explore all possibilities they'll end up with a faster solution dedicated to rasterization.

But they would be stupid not to start looking at raytracing before the end of the decade. If Intel can demonstrate something on Larrabee that NVIDIA/AMD can only have wet dreams of, that would give them a huge head start in an entirely new market. It's a big investment and a huge gamble, but Intel is in an almost ideal situation to try it. And if it fails to impress on one front, it will likely still open up new possibilities for something different!
 
Last edited by a moderator:
Ahh that old argument? Its quite simple, CPUs are hard, GPUs are easy and only needed by a small portion of the market and competing with free!

From development cycles, verification, post-si validation, performance requirements, etc, CPUs are a lot harder to do than GPUs.
I'm not going to argue for or against that here, but the fact of the matter is it's completely unrelated to Voltron's claim. Why? Because the cost difference happens on the income statement *before* operating expenses, so it excludes R&D!

As for ArchitectureProfessor's claim that CPUs and GPUs sell at the same price; remember that discussion of wafer prices we had? Either way, G80 (which is a 480mm2 chip on 90nm) sold for an average ASP of $125 or so in Q4 2006, including NVIO (50mm2 pad-limited on 110nm). The rest of the cost is the memory, the PCB, the cooler, etc.

As a sidenote, this is also why prices have gone down so much in the low-end afaik: DDR2 prices crashed. Eventually GDDR3 also did (which arguably made HD3850/8800GT) possible, but that's not a commodity market so it took a bunch more time and wasn't as extreme. The basic idea is this: a lot of the cost of a modern graphics card is not the GPU itself.

This is what also prompts me to like TBDR and Embedded Memory approaches so much; it's not just an engineering question, it's also a financial question where you should estimate how much of the money that allows you to 'steal' from Samsung and Foxconn.
 
Back
Top