Larrabee at Siggraph

You mean the reduction in the framebuffer?
No, I mean that your engine can simply render the screen in tiles of say 256x256 size (you don't need to be superstrict about the internal boundaries ... the tiler could just save the bins for the other tiles till you start rendering those, so there are no overlap costs). So you get increased culling costs (meh) and reduced memory usage (or if the tiler starts rasterizing early when buffers fill up, increased efficiency).
Yes, you can dump tiles before binning is finished, but then in the most poly-heavy situations (AFAIK the biggest culprit for framerate dips) you not only have a large rendering load, but you lose efficiency too.
Shrug, if it's a win it's a win.
 
No, I mean that your engine can simply render the screen in tiles of say 256x256 size (you don't need to be superstrict about the internal boundaries ... the tiler could just save the bins for the other tiles till you start rendering those, so there are no overlap costs). So you get increased culling costs (meh) and reduced memory usage (or if the tiler starts rasterizing early when buffers fill up, increased efficiency).
I'm not even considering overlap costs. I think the Intel paper says that it's around 5% for their implementation, so I don't see why larger tiles makes much difference.

The only way to save space is to have tiles so large that object-level binning is doable without obscene overlap costs.

Shrug, if it's a win it's a win.
If what is a win? Seems like a disadvantage to me.
 
Yes, it's all software, but if your software doesn't even try to retrieve a previously transformed vertex you won't have any (implict) post transformed cache in place :)
Who needs a cache when all vertices are processed and stored in a buffer?

Either way, it seems pretty pointless to me to discuss things that might change before release. Heck, they might even find better approaches after release. I'm even slightly surprised that they pinned down that it's a TBDR. Given the hardware architecture that's not entirely unlogical, but still, I'm pretty sure some games don't benefit from it.
 
Who needs a cache when all vertices are processed and stored in a buffer?
Because it doesn't make any sense to fully transform a vertex buffer when part of it might not even be indexed at all (different LODs embedded in the same vertex buffers, progressive meshes, etc..)
What you want to do is to transform them as a GPU does. In order to avoid to retransform them you might want to check if a vertex has already been transformed.

Either way, it seems pretty pointless to me to discuss things that might change before release. Heck, they might even find better approaches after release. I'm even slightly surprised that they pinned down that it's a TBDR. Given the hardware architecture that's not entirely unlogical, but still, I'm pretty sure some games don't benefit from it.
BTW, it's a TBR, not TBDR :) within a tile primitives are processed as an IMR would do according the paper.
 
Despite the problems generated by a TBR do you think they really had any other option?
It seems to me that this is by far the most reasonable decision they could have made with regards to their rendering architecture.
For a CPU designed to do general computation I agree.

Most of my points were about the cost effectiveness of Larabee compared to other GPUs, particularly when discussions about the console space started. I'm not saying binned rendering is bad for Larabee; rather, I'm just questioning whether it's a good approach to graphics in general.

Nonetheless, I'm still quite impressed with what's been presented so far. For each core to be so capable at only 1GHz and be so small is quite amazing. 30M transistors for a general purpose CPU core with 16 FP MADs, 16 load/store, 256KB cache and a texture unit? Wow.
 
I'm plenty sceptical still. ;) I'm not convinced this kind of architecture can beat a specialized GPU esign on their home turf. There are probably areas within GPGPU where it'll rock, but I don't think it can beat AMD or Nvidia on graphics on either raw performance, performance / watt or performance / mm^2.

One also has to keep in mind that Intel wasn't first at attempting this. Sony's original plan for PS3 was for it to not have any GPU, but the SPUs would do the graphics. It ended up shipping with a traditional GPU.

Intel has said from the get-go that they're aiming at mid-end performance.
So they don't plan on beating AMD or nVidia just yet.
But this design obviously has some extra potential outside legacy D3D/OGL acceleration.
It will be interesting to see if Intel can get enough leverage through D3D/OGL to unlock that extra potential.
 
- The vector units supports full scatter/gather and masking. When combined, this allows "16 shader instances to be run in parallel, each of which appears to run serially, even when performing array access with computed indices". As predicted, this is how Larrabee gets around the problem of hard-to-vectorize code. This is really true SIMD (more like a GPU) than SSE-like vectors on current x86 processors.

Since the SIMD units are 16-wide and they say they run 16 shader instances (aka threads in nVidia-speak?), doesn't that imply that they treat the SIMD units like arrays of scalar processors, much like how nVidia does it?
That could be extremely efficient.
 
Just to clarify something, I was watching a developer interview where they were moving the camera around the intensive outdoor scenes, where he said they push around a million polys per frame.

If you run the included benchmark .bat files for Crysis, you get an on-screen display of the average polycount per frame.
On High it is about 0.8 million, Very High is around 1.2 million tris per frame if I'm not mistaken.
So indeed, around a million polys per frame.
 
Since the SIMD units are 16-wide and they say they run 16 shader instances (aka threads in nVidia-speak?), doesn't that imply that they treat the SIMD units like arrays of scalar processors, much like how nVidia does it?
That could be extremely efficient.

If I understand correctly, it can load from/store to an array of pointers to do full gather/scatter, and a mask can be used to mask out "unwanted" elements. So, in theory, it can easily do CUDA-style programming model. Furthermore, the load/store performance is limited only by the cache, which is generally one cycle per cache line.
 
Because it doesn't make any sense to fully transform a vertex buffer when part of it might not even be indexed at all (different LODs embedded in the same vertex buffers, progressive meshes, etc..)
Just process those actually needed. Set a flag in the output buffer to indicate which vertices have already been processed. Like an infinite cache (from a software view, in hardware the actual cache keeps frequently used vertices close).
BTW, it's a TBR, not TBDR :) within a tile primitives are processed as an IMR would do according the paper.
That's what they're telling you today... ;)
 
Anand said:
Well, it is important to keep in mind that this is first and foremost NOT a GPU. It's a CPU.

Based on what's been revealed so far I tend to agree. Larrabee seems to be a many core x86 CPU + texture units + ring bus. One thing I don't get though - is the instruction set for the new vector ALU's x86 based as well?
 
Based on what's been revealed so far I tend to agree. Larrabee seems to be a many core x86 CPU + texture units + ring bus. One thing I don't get though - is the instruction set for the new vector ALU's x86 based as well?

What exactly is 'x86 based'?
I'd say that SSE is already quite distinct from legacy x86 instructions anyway.
You have new registers, new instructions etc, not much depends on x86 (and with new scatter/gather memory access, it will rely on x86 even less).
It's just that it's encoded inside the x86 instructionset, but that will obviously go for Larrabee aswell.
 
~1million triangles/frame may be about right for current games but tesselation dramatically boosts that & in a way that doesn't hurt the basic setup engine.
Should presumably be well in use by the time Larrabee comes out?

Software API & sheduling etc is confirmed but its running that on what bits of hardware? Main CPU?

30M transistors for a general purpose CPU core with 16 FP MADs, 16 load/store, 256KB cache and a texture unit?
Since a Pentium is only about 3M, the x86 core must be just a runt beside that Vec16

RV770 956M/10 =95.6M transistors for 80FP MADs (including shares of cache, tex & sheduling/setup)
or 956/(10*5) =19.6M for 16FP MADs (+ shares of the other bits)

I don't quite understand how anyone can categorise a Vec16 FP unit as 'general' :???:
 
I don't quite understand how anyone can categorise a Vec16 FP unit as 'general' :???:
Because it can be used for anything that needs math? Since there's a generic scatter/gather, practically any loop can be made to run 16 times faster by processing 16 iterations in parallel.
 
Larrabee seems to be a many core x86 CPU + texture units + ring bus.
I was expecting something like a bunch of SIMD x86 cores but it looks a lot more like a straight GPU: a bunch of 1*16 GPU type SIMDs with little rump x86 units attached + various other GPU type bits.
 
practically any loop can be made to run 16 times faster by processing 16 iterations in parallel.
Doesn't that require having 16 iterations that are ready to be processed & aren't dependant? A fairly un-general situation outside embarrasingly parallel things like graphics.
 
Last edited by a moderator:
Doesn't that require having 16 iterations that are ready to be processed & aren't dependant? A fairly un-general situation outside embarrasingly parallel things like graphics.
Yes. No.

Anything that really needs the massive throughput is going to have largely independent data. That includes graphics but also video encoding/decoding, sound processing, physics, etc. And tons of algorithms have parallel variants. We're only on the brink of researching and implementing them (before multi-core CPUs there wasn't any point). Basically, anything that previously required dedicated hardware to obtain parallelism can now be done on a more general-purpose chip.

If you know of any application that needs this level of performance but doesn't have a high number of independently processable elements, let me know. Else I think it's fair to call it general-purpose...
 
Just process those actually needed. Set a flag in the output buffer to indicate which vertices have already been processed. Like an infinite cache (from a software view, in hardware the actual cache keeps frequently used vertices close).

That's what they're telling you today... ;)


Binning Render ?????:?:
 
Anything that really needs the massive throughput is going to have largely independent data. That includes graphics but also video encoding/decoding, sound processing, physics, etc. And tons of algorithms have parallel variants. We're only on the brink of researching and implementing them (before multi-core CPUs there wasn't any point). Basically, anything that previously required dedicated hardware to obtain parallelism can now be done on a more general-purpose chip.

I think you're confusing thread-level parallelism and data-level parallelism (SIMD). They are two completely different forms of parallelism.
You don't need multi-core for SIMD, and you don't need SIMD for multi-core, and SIMD was available about a decade before multi-core was (at least in terms of consumer hardware).
 
Back
Top