Larrabee at Siggraph

One also has to keep in mind that Intel wasn't first at attempting this. Sony's original plan for PS3 was for it to not have any GPU, but the SPUs would do the graphics. It ended up shipping with a traditional GPU.

If GPUs keep getting more general, and CPUs like Larabee keep getting more parallel, with some special GPU-like fixed-function sauce bolted on, one would think eventually there would be a crossover point where they are, at least, within the same ballpark as each other..

PS3 Cell was too early, but maybe Larrabee isn't. I hope it isn't. It seems to be addressing Cell's big short-comings for competitive rendering performance (e.g. texturing and some other fixed function logic. PS3 Cell had NO graphics specific logic). I would love to see Larrabee be at least reasonably competitive with high end GPUs of its own generation. Particularly for the next consoles - two ~200-250mm^2 identical processors in the box instead of one cpu and one gpu could be very very nice indeed..
 
Again, one thing that Intel omitted in that paper was RAM usage for binning. FEAR and HL2 are extremely low poly, and even Gears doesn't have that many polygons when looking at a 2010 timeframe. You can see how the BW usage is way higher for Gears than FEAR with binned rendering but not with immediate mode. A game like Crysis would takes gobs of RAM and BW to bin.

RAM is a significant cost for video cards. We know Intel has a process advantage to help Larrabee make up for the per-transistor inefficiency compared to GPUs, but even better BW efficiency is not going to allow them to use cheap enough RAM to make up the difference.

One thing is for sure: If Larrabee has the same price as GPUs with equal DirectX performance, Intel will lose money instead of making the obscene margins on its CPUs that GPU makers can only dream of.

Titanio: For your projected FPS, remember that you're supposed to average render times, not FPS. Equivalently, you can average the number of cores needed and do your scaling from there.
 
If they don't provide automated tools to fiberize your code it will be a bloody fucking nightmare. Writing as close to the hardware as you would on a GPU while trying to get good efficiency is by far harder on Larrabee, because the latency hiding is done by software pipelining (essentially software multithreading) rather than handled by hardware multithreading.

DX/GL raster design is in general also just what I was expecting, which brings up the core questions of how many vector registers they have and how many simultaneous (pre)fetches a core can handle at one time (which determines how much software vectorization, ie fibers, they can do to hide latency).
 
Again, one thing that Intel omitted in that paper was RAM usage for binning. FEAR and HL2 are extremely low poly, and even Gears doesn't have that many polygons when looking at a 2010 timeframe. You can see how the BW usage is way higher for Gears than FEAR with binned rendering but not with immediate mode. A game like Crysis would takes gobs of RAM and BW to bin.

Seriously. . .how many GB in a "gob"? :p;):LOL:

How badly would they be hurt to offload some of that to system ram?
 
Again, one thing that Intel omitted in that paper was RAM usage for binning. FEAR and HL2 are extremely low poly, and even Gears doesn't have that many polygons when looking at a 2010 timeframe. You can see how the BW usage is way higher for Gears than FEAR with binned rendering but not with immediate mode. A game like Crysis would takes gobs of RAM and BW to bin.
an opportunity for a technology shift, from color+z compression to geometry compression.
 
Again, one thing that Intel omitted in that paper was RAM usage for binning. FEAR and HL2 are extremely low poly, and even Gears doesn't have that many polygons when looking at a 2010 timeframe. You can see how the BW usage is way higher for Gears than FEAR with binned rendering but not with immediate mode. A game like Crysis would takes gobs of RAM and BW to bin.

While I agree with this, triangle setup numbers aren't really accelerating on GPU's that fast now (are they?), and even if it did, current 2x2 quad based GPU pipelines lose too much pixel shader efficiency to make use of much smaller triangles. If anything I would guess that higher ALU rates will enable larger triangles + localized tracing (like relaxed cone step mapping) get get better detail.
 
I'm not into 3d graphics, but into video processing so I don't have the same expectations from this chip. 32 x86 look-alike cores with 16 wide registers per core + a complete SIMD API ? I just wish I had some larrabee hardware to play with, it looks like FUN :)
 
Titanio: For your projected FPS, remember that you're supposed to average render times, not FPS. Equivalently, you can average the number of cores needed and do your scaling from there.

Thanks..if we take the average number of cores needed to do 60fps for each of the frames in the sample set, then the avg for Gears of War would be ~86fps, for F.E.A.R. would be 108fps.

(Again, projections assume a 24-core 1Ghz)
 
While I agree with this, triangle setup numbers aren't really accelerating on GPU's that fast now (are they?),
I can't really make sense of this statement, but setup is a substantial part of the workload. In another thread I was looking at some hardware.fr summary numbers, and only about 60% of the workload was per-pixel. The rest was per-frame, and with the speed of VS processing in unified architectures and Z-fillrates for shadow maps, that generally means setup limited.

and even if it did, current 2x2 quad based GPU pipelines loose too much pixel shader efficiency to make use of much smaller triangles.
Not really. Yes, you loose efficiency, but if you can draw 8 quads per clock (GT200), you can process 8 tiny triangles per clock if setup/rasterization was faster.

If anything I would guess that higher ALU rates will enable larger triangles + localized tracing (like relaxed cone step mapping) get get better detail.
There are edge problems with this, though. I know Policarpo put out a paper on doing relief mapping for curved surfaces and getting silhouettes, but he overlooked a major flaw.

I independently went down the same path, and unfortunately discovered that it's possible to have holes appear in models when using local curvature. I actually used paraboloids instead of cones, because I had a second order polynomial for my local curvature model anyway and I also got faster convergence when using it with a normal vector (due to bigger volumes in each step). I gave up when I found this flaw, though, because I knew you could never really use it without it being a big headache.
 
an opportunity for a technology shift, from color+z compression to geometry compression.
Well color+z compression isn't about saving space, though, and I don't know how much geometry compression can help for poly-level binning.

There are avenues for improvement for Larrabee, though. I don't think storing the coverage mask is very space efficient. It may be possible for Intel to just store the original indices and vertices instead of the transformed ones, and separate the position portion of the vertex shader for binning. Then each bin would only need an index list that can be compressed into start-end pairs (with renderstate delineation, of course).

I still think polygon-level binning is an inefficient way to do graphics for a given amount of RAM, though, and that metric is critical for mass-market video cards and consoles.
 
I really hope Larrabee is awesome, but I have my doubts about a lot of what Intel does that doesn't come out of Israel (where Core2Dou was born).

The latest being Atom, which according to the new reviews, Via Nano completely trounces.

But anyway Larrabee seems very exciting.
 
I'm not into 3d graphics, but into video processing so I don't have the same expectations from this chip. 32 x86 look-alike cores with 16 wide registers per core + a complete SIMD API ? I just wish I had some larrabee hardware to play with, it looks like FUN :)
Unfortunately it seems no support for vectors of small integers (everything is upgraded to 32 bits). Which hurts for ME and CODECs.
 
I still think polygon-level binning is an inefficient way to do graphics for a given amount of RAM, though, and that metric is critical for mass-market video cards and consoles.
On a console you aren't going to be throwing a random polygon soup at the thing, unless you are deadset on wasting that memory. On a PC you can store the bins in host memory if necessary, it's a lot of data but negligible bandwidth.
 
I really hope Larrabee is awesome, but I have my doubts about a lot of what Intel does that doesn't come out of Israel (where Core2Dou was born).

The latest being Atom, which according to the new reviews, Via Nano completely trounces.

But anyway Larrabee seems very exciting.

Nano and Atom target different power, thermal and performance envelopes so that's not surprising. Atom is destined to go in smartphones, Nano won't go any further than Eeelikes. The only thing letting Atom down now is the rest of the platform which is aging and on old process tech as well.
 
Not much on the actual hardware...

As was perhaps expected, the SIGGRAPH paper doesn't have a lot about the actual hardware, but more about the software aspect. I guess we'll need to wait for HotChips for many more hardware details.

Some of the things they did say (some of which was mentioned early in the thread):

- Fixed function texture unit (per core?), but all rasterization is done in software. They say that rasterization is more efficient in fixed-function hardware, but that doing it in software gives them flexibility as to where the rasterization is done in the graphics pipeline.

- All task scheduling and such are implemented in software using a task scheduling software runtime.

- The vector units supports full scatter/gather and masking. When combined, this allows "16 shader instances to be run in parallel, each of which appears to run serially, even when performing array access with computed indices". As predicted, this is how Larrabee gets around the problem of hard-to-vectorize code. This is really true SIMD (more like a GPU) than SSE-like vectors on current x86 processors.

- Scatter/gather can access only one cache block per cycle, so it is variable latency based on the number of cache blocks touched.

- "Register data can be sizzled in a variety of ways, e.g., to support matrix multiplication"

- New instructions to bring data into the cache, but in a way so that larger streaming of data doesn't sweep the cache. Probably brings the cache block into the cache as "least recently used" so that it will be replaced next. If used well, this can help locality a lot.

- The software render is some sort of tile-based rendering engine using a frame buffer implemented totally in software.

- Perhaps the most interesting part of the paper is when they describe what it takes to add various advanced rendering techniques into their standard software renderer. They also show results on games physics and other more GPGPU-like tasks.
 
On a console you aren't going to be throwing a random polygon soup at the thing, unless you are deadset on wasting that memory.
It doesn't matter. All else being equal, a binned renderer will have lower poly models and/or lower resolution textures than an immediate mode renderer. The question is whether the performance savings (if there are any) will make up for that in one way or another.
On a PC you can store the bins in host memory if necessary, it's a lot of data but negligible bandwidth.
Negligible? With the binning technique in the paper, I think Crysis would need over 500MB of bin space. At 30 fps, that 30 GB/s of bin BW alone. In the future games could easily need this much or more.

Larrabee is going to be on a separate card, right? For 60 fps, PCI-E 2.0 will limit you to 136 MB per frame of bin space if it's kept on the host while also using most of the system BW. Those are ugly constraints.
 
I still think polygon-level binning is an inefficient way to do graphics for a given amount of RAM, though, and that metric is critical for mass-market video cards and consoles.
Considering the number of dirt cheap graphics cards with way too much low-bandwidth memory on them, I don't think this is a big deal.

Jawed
 
Back
Top