Larrabee at Siggraph

dizietsma · Aug 5, 2008

I assume that Intel's simulations are normally accurate? It's not going to turn out that in practice we will need 160 cores because the scaling was not linear as expected or the test simulation per core was too low by a factor of 4 for instance?

3dilettante · Aug 5, 2008

The Larrabee paper has that handy-dandy graph that breaks down the time percentage for various rendering stages for the 25 frames from FEAR.
It's the only game with that data, and eyeballing the other graph that shows the minimum number of units needed to hit 60 fps spikes at (eyeballing this) the third to last frame.
The major difference I see is that the alpha blending section is a higher percentage in this frame.

It doesn't explain the other relatively high areas in the FEAR unit count, but there's also the catch that the time breakdown is percentage of frame time, not absolute numbers.

mboeller · Aug 5, 2008

Vincent said:
Binning Render ?????

I find your comment quite strange. Read the PDF from Intel. They clearly say in the PDF that the software renderer for Larrabee is a "sort-middle" renderer. Therefore it is a TBDR renderer (in software not hardware)

Page 2
The PowerVR* MBX and SGX series [PowerVR 2008],
the Intel® Graphics Media Accelerator 900 Series [Lake 2005],
the ARM Mali* [Stevens 2006], and Talisman [Torborg & Kajiya
1996] have been generally classified as sort middle architectures.

Page 5
This section describes a sort-middle software renderer designed
for the Larrabee architecture that uses binning for load balancing.
Section 5 provides performance studies for this software renderer.

Link (for more information about the binning): http://softwarecommunity.intel.com/articles/eng/3803.htm

NetResearchMan · Aug 5, 2008

hoom said:
RV770 is 10*(16*(1+1+1+1+1) = 800 SP right?
A 48 core Larrabee would be 48*(16*1) = 768 SP

So one ATI 16*(1+1+1+1+1) VLIW SIMD = 5 Larrabee cores

This is assuming that you can get peak utilization on the ATI core, which is never the case on real world code. The 5 ALU ops on ATI are split into 4-wide vector and 1-wide scalar operations.

On real shaders, the scalar unit is usually 50% utilized or less. That's factoring in that the shader compiler does attempt to move work from the vector unit to the scalar unit (replacing a single vector instruction with 3 or 4 scalar instructions). It can only do so much because of dependencies. We'll be generous and say that on really complex shaders where there is time to hide the extra latency of the scalar operations, you can get 75% utilization.

The other issue is that a majority of the operations that go through the 4-wide vector unit only use 3-element vectors. You're only using the fourth element maybe 25% of the time. So I would count the ATI GPU as:

10*(16*(1+1+1+0.25+0.75)) = 640 SP

By contrast, from the descriptions of Larrabee we've heard, it will run shader code as 16 element SoA, which can achieve 100% utilization of the ALU execution resources on real world code (assuming no other bottlenecks of course).

Dave Baumann · Aug 5, 2008

NetResearchMan said:
On real shaders, the scalar unit is usually 50% utilized or less. That's factoring in that the shader compiler does attempt to move work from the vector unit to the scalar unit (replacing a single vector instruction with 3 or 4 scalar instructions).

I'd be curious to know what this is based on.

The 5 unit is also the special function math unit, so in these cases its highly utilized. Its also highly ultized under vertex ops.

The other issue is that a majority of the operations that go through the 4-wide vector unit only use 3-element vectors. You're only using the fourth element maybe 25% of the time.

Its not a 4 wide vector, they are 4 scalars; other ops can be scheduled in there to "fill that hole". The shader compiler does this type of stuff automatically.

no-X · Aug 5, 2008

NetResearchMan said:
The 5 ALU ops on ATI are split into 4-wide vector and 1-wide scalar operations.

Please tell me how you found out...

ShaidarHaran · Aug 5, 2008

no-X said:
Please tell me how you found out...

He didn't. It's an assumption carried over from previous ATi architectures. This assumption is incorrect.

MfA · Aug 5, 2008

Mintmaster said:
The only way to save space is to have tiles so large that object-level binning is doable without obscene overlap costs.

Yes, so? You can use large tiles inside your rendering engine and small tiles within the rasterizer.

So in the end we can then agree, on a console you can reduce storage needs this way, and that does matter?

If what is a win? Seems like a disadvantage to me.

If tiling is a win. Yes, starting with the rendering of bins before the entire scene has been binned will introduce an extra super-linear time/complexity factor ... but that in and of itself is not important, what's important is whether at those higher complexities it actually becomes slower than direct rasterization.

MfA · Aug 5, 2008

I think the sort-X terminology is really not a good one to distuingish tilers from "traditional" GPUs, it's usually used when talking about parallelization ... sort middle happens to be the one "traditional" GPUs use internally for parallelization too.

When you start rendering tiles in parallel as Larrabee you are doing sort first.

NetResearchMan · Aug 5, 2008

ShaidarHaran said:
He didn't. It's an assumption carried over from previous ATi architectures. This assumption is incorrect.

I didn't know that it had changed. My estimates were based on disassembly of real shaders in our game engine on existing ATI hardware which can issue one vector and one scalar instruction per clock. Do the newer ATI VLIW cores support issuing multiple scalar or vector instructions in one clock? If so, it could achieve much higher utilization.

I still think that no matter what, the ATI hardware will be running at less than 100% utilization, but would be better than the 80% I estimated.

nAo · Aug 5, 2008

Nick said:
Just process those actually needed. Set a flag in the output buffer to indicate which vertices have already been processed. Like an infinite cache (from a software view, in hardware the actual cache keeps frequently used vertices close).

Good, so you agree with me that it makes sense to 'cache' them. The paper doesn't mention this at all, but so many things are not mentioned that I'm willing to bet that loads of interesting details have been omitted.

That's what they're telling you today...

Well, the beauty of this rendering architecture is that it's fluid and it will keep evolving/improving.

[maven] · Aug 5, 2008

The paper mentioned that the transformed vertices will be streamed out to main memory I seem to recall (Section 4.3) and are reused.

It all sounds very impressive, but two things that I'm not totally sure about are

1) Latency. Especially their rendertarget dependency analysis. While I've already submitted passes for shadow map 1 & 2, and am now constructing primitives for the final RT, Larrabee seems to wait with processing them until it knows that the shadow maps are actually used in the final composite. You could already have been rendering them... This is of course easily fixable in software, which brings us to point

2) I would not want to have to maintain the Larrabee driver / software stack. Ever! Under no circumstances. And in particular knowing Intel's GMA driver reputation, I'm not sure Intel can do it.

ArchitectureProfessor · Aug 5, 2008

aaronspink said:
This is going to be true for pretty much any hardware that implements scatter/gather: 4870, G100, any vector supers computers throughout the ages that don't have word width SRAM main memory.

Certainly a cache-based system such as Larrabee could take up to 16 cycles for a worst-case scatter/gather. I guess you could bank the cache, but you would need to do 16 tag checks, which doesn't sound so great.

But the point I was trying to make is that Larrabee is smart enough to have such a scatter/gather take fewer than 16 cycles. If you have some locality in the scatter/gather, it can be really fast. I'm curious how optimized the implementation is: can it merge non-consecutive elements from the same cache block, or must they be adjacent elements.

Interestingly, the support for scatter/gather and fully predicated/masked vector operations probably go hand in hand. One of the big problems of scatter/gather is that all 16 of the operations could page fault or whatever. With 16-wide accesses, you really need some way to have an interrupt mid-way through a vector operation. If you already have masked/predicated vector ops, it makes handling these partial restarts, but with the already completed operations masked out.

ArchitectureProfessor · Aug 5, 2008

Scali said:
I think you're confusing thread-level parallelism and data-level parallelism (SIMD). They are two completely different forms of parallelism.

I agree that SIMD and multicore are different levels of parallelism. However, Larrabee shows the future in which we push on both dimensions. Whereas we currently have quad-core chips and 4-word vectors, Larrabee has 8x more core and 4x longer vectors. As increasing clock speeds are no longer going to give us all the performance we want, future chips are going to exploit parallelism at both these levels.

Scali said:
You don't need multi-core for SIMD, and you don't need SIMD for multi-core, and SIMD was available about a decade before multi-core was (at least in terms of consumer hardware).

The SIMD available in current microprocessors is much weaker than what Larrabee supports. Larrabee's vectors support full scatter/gather, vector masks, more vector registers, and bigger vector registers. In the past, going to SSE would maybe give you a 2x or 3x speedup in practice (with around a 4x maximum speedup). Larrabee's more general vector operations will make it much easier to vectorize loops (even ones with simple flow control) and give much more performance versus non-vector code.

Mintmaster · Aug 5, 2008

MfA said:
Yes, so? You can use large tiles inside your rendering engine and small tiles within the rasterizer.

Whoops. I should have seen this obvious solution. You're absolutely right.

but that in and of itself is not important, what's important is whether at those higher complexities it actually becomes slower than direct rasterization.

Sure, and it very well could be.

What interesting is how this fallback fits in with your macro-tiling idea. You feed in objects roughly from left to right, and if bin space becomes a problem you flush out tiles from left to right. Most of the time you wouldn't need to touch any tiles twice, and hard-to-chop objects could simply be fed in first.

Damn you. I have to change my entire stance on binned rendering now!

Dave Baumann · Aug 5, 2008

NetResearchMan said:
I didn't know that it had changed.

Actually that particular configuration (Vec4+Scalar) has never appeared on the PC, it was XBOX 360 / Xenos specific. DX9 non-unified architectures featured a Vec3+Scalar, dual issue MAD+ADD ALU configuration, while DX10 unified architectures have been 5D Scalar MADs.

Do the newer ATI VLIW cores support issuing multiple scalar or vector instructions in one clock?

Yes, the ALU's can process this and shader comiler is design to do this. Vectors can be even be broken and issued over multiple clocks if order to achieve better packing as well.

Jawed · Aug 5, 2008

NetResearchMan said:
I didn't know that it had changed.

Get GPUSA:

http://developer.amd.com/gpu/shader/Pages/default.aspx

Jawed

ArchitectureProfessor · Aug 5, 2008

[maven];1200036 said:
2) I would not want to have to maintain the Larrabee driver / software stack. Ever! Under no circumstances. And in particular knowing Intel's GMA driver reputation, I'm not sure Intel can do it.

If Larrabee takes over, perhaps Intel doesn't have to maintain a "driver" in the traditional sense. Intel could give the big game design companies access to their baseline software render engine and then said "have fun", allowing these companies to totally bypass DX and do any sort of customization they want to the rendering engine, physics computations, etc.

In a Larrabee-like world, more and more of the computations done by the CPU in current games could be pushed on to the Larrabee cores (which, are just x86 cores after all). Having a big beefy x86 core or two is still going to be important for those single-threaded parts of programs, but I predict that Larrabee-like cores will do more and more of the heavy lifting.

In the long run, there is no need for the hardware to hide behind some thick Microsoft-controlled API and driver layer.

3dilettante · Aug 5, 2008

ArchitectureProfessor said:
If Larrabee takes over, perhaps Intel doesn't have to maintain a "driver" in the traditional sense. Intel could give the big game design companies access to their baseline software render engine and then said "have fun", allowing these companies to totally bypass DX and do any sort of customization they want to the rendering engine, physics computations, etc.

The render engine would have to sit on top of some kind of driver.
I'm sure Microsoft would be overjoyed that Intel told any number of app programmers "here's low-level access to x86 hardware and the system it runs on, enjoy".
Then again, perhaps that's what virtualization is for.

In the long run, there is no need for the hardware to hide behind some thick Microsoft-controlled API and driver layer.

If and only if there is only Intel, and even then it is only guaranteed if there is a single Larrabee revision to program for.

What domination or market balkanization--its direct opposite, would entail is something "interesting" if nothing else.

TimothyFarrar · Aug 5, 2008

The real question with scatter/gather, is how many outstanding vector scatter/gather instructions can be in flight at one time per hyperthread, and how many vector registers are available.

Given that Larrabee's latency hiding is just "software vectorization / loop unrolling" (ie 32 or 64 or 128 wide SIMD), those are some rather key numbers which we don't know yet. One side effect of this is that branch granularity in practice is going to need to be large for anything with poor cache performance or higher latency (texture fetches), and register pressure will probably be high as well. L1 does seem exactly as just an extended register file for vector operations, 32KB divided by 4 hyperthreads then divided by say 4 for "software vectorization / loop unrolling" for latency hiding gives 32 vector sized slots in L1 and just 256 vector sized slots in L2 (for the core's 256KB). And this is only enough for 256 strands (or NVidia threads) per core. So L2 miss cost is also going to be somewhat important as well.

Personally I think all those who are thinking that they can somehow just lazy program (serial OO C++ code) Larrabee like they do with their PC and expect some amount of magical performance are fooling themselves... good thing Intel has some rather very smart people doing drivers, and it is still going to take a huge mindset change in developers before any normal folk start to understand how to program this beast.

Larrabee at Siggraph

dizietsma

3dilettante

mboeller

NetResearchMan

Dave Baumann

Gamerscore Wh...

no-X

ShaidarHaran

hardware monkey

MfA

MfA

NetResearchMan

nAo

Nutella Nutellae

[maven]

ArchitectureProfessor

ArchitectureProfessor

Mintmaster

Dave Baumann

Gamerscore Wh...

Jawed

ArchitectureProfessor

3dilettante

TimothyFarrar

Similar threads