22 nm Larrabee

Knights Ferry with 32nm is just 32 core @ 1.2Ghz with only 500GFlops DP, less than today's Tesla C2070 and Power 7 4 MCM with 1T DP
So how much do you believe 22nm can give, only by saying that LRBni is more easy to program?
 
Knights Ferry with 32nm is just 32 core @ 1.2Ghz with only 500GFlops DP, less than today's Tesla C2070 and Power 7 4 MCM with 1T DP
So how much do you believe 22nm can give, only by saying that LRBni is more easy to program?

That's not correct. Knights Ferry is 45nm.
 
There's quite a few papers out there on software micropoly rendering, and Intel's parallel setup was pretty fast, so I don't think it's the software rasterization that's the problem.

Instead, I think it's the way that tile based rendering needs either gobs of bandwidth with tessellation (binning all the polys and the dynamically generated vertex data), or lots of geometry workload duplication (tessellating each patch for the initial pass and again for every tile it gets binned into).

Tesselation is great way to amplify data yet avoid bandwidth consumption (assuming it's done right - Cayman's spilling into memory isn't needed with a proper architecture). However, that's only the case if you immediately render the triangles rather than defer it.

http://forum.beyond3d.com/showpost.php?p=1531013&postcount=42
 
Last edited by a moderator:
True there're a few micro-polygon setup papers out there, but it's not for Larrabee, and IIRC non of them mentioned how to use the method efficiently for a tile based renderer. The closest thing I found find is like Reyes, which is sorting patches before tessellation into tiles. But how to compute a tight yet conservative bound for a patch in a graphics API without user's hint is an open question. DX11 is just designed for immediate mode GPUs.
Hardly.

http://portal.acm.org/citation.cfm?id=1516530
 
I believe software is always smarter than hardware. If the hardware has an correct implementation, so does software.

The problem is software is rarly optimized for bandwidth alone. In the CPU world, bandwidth/compute ratio is almost always higher.

Let's say tessellate/displace a patch way more than twice is a dumb idea. So an implementation must always spill the vertices out to memory for neighbor tiles just like Reyes. That sounds dumb too.

Or let's say spill into memory is a dumb idea. Then even if no single patch overlapps tile boundarys, because of sorting, every patch has to be tessellated/displaced exactly twice, otherwise is unclear which tile it touches.

TBDR's idea relies on one assumption: pixel R/W bandwidth is larger than primitive parameter W/R bandwidth. As we all know that a pixel is much smaller than a vertex, if the vertex approaches pixel size, I can see no efficiency of TBDR. Maybe it's me that's dumb.
 
Last edited by a moderator:
I believe software is always smarter than hardware. If the hardware has an correct implementation, so does software.

The problem is software is rarly optimized for bandwidth alone. In the CPU world, bandwidth/compute ratio is almost always higher.

Let's say tessellate/displace a patch way more than twice is a dumb idea. So an implementation must always spill the vertices out to memory for neighbor tiles just like Reyes. That sounds dumb too.

Or let's say spill into memory is a dumb idea. Then even if no single patch overlapps tile boundarys, because of sorting, every patch has to be tessellated/displaced exactly twice, otherwise is unclear which tile it touches.

TBDR's idea relies on one assumption: pixel R/W bandwidth is larger than primitive parameter W/R bandwidth. As we all know that a pixel is much smaller than a vertex, if the vertex approaches pixel size, I can see no efficiency of TBDR. Maybe it's me that's dumb.

Well, for a TBDR to win with highly tessellated geometry, it has to dump the compressed version of geometry (ie raw patches) after spatial binning to memory and tessellate them on chip. Off hand, I can't see any other way that TBDR will win.

Assuming it can be done efficiently by a clever hw/sw combination, a TBDR can certainly win, and win big with ginormous tesselation.
 
Well, for a TBDR to win with highly tessellated geometry, it has to dump the compressed version of geometry (ie raw patches) after spatial binning to memory and tessellate them on chip. Off hand, I can't see any other way that TBDR will win.
That's a common practice for software parallel renderer already IMO.

Assuming it can be done efficiently by a clever hw/sw combination, a TBDR can certainly win, and win big with ginormous tesselation.

Did you count in the cost of re-tessellate and vertex shading just because one patch overlaps multiple tiles?


Thanks for the great paper!

But I'm highly suspicious about whether the paper descipts a proven technique, Expecially, can it handle aribitary real-world vertex shaders, with arbiraty displacment map funcion? The paper seems uncertain about that, too. Plus, the cost of analyzing/converting a displacement map to an interval texture in real-time within at most a fragment of a milisecond in driver seems highly impossible to me. Don't ignore the case that the displacement map can be procedurelly generated each frame as in water rendering and simulations.

Anyway the method described is quite novel, And IMO it should be quite good for an offline renderer, rather than of a very responsiveness program: graphics driver. DX11 provides no way for user to hint the graphics driver about this, thus I considered it as an immediate mode API. User(us graphics programmers) knows best about our data and shaders, not graphics drivers. Maybe OpenGL can be exented better, but in case of Larrabee, it has to be working well in DX11.
 
Last edited by a moderator:
But I'm highly suspicious about whether the paper descipts a proven technique, Expecially, can it handle aribitary real-world vertex shaders, with arbiraty displacment map funcion? The paper seems uncertain about that, too. Plus, the cost of analyzing/converting a displacement map to an interval texture in real-time within at most a fragment of a milisecond in driver seems highly impossible to me. Don't ignore the case that the displacement map can be procedurelly generated each frame as in water rendering and simulations.
They say pretty clearly in the abstract itself that they can haandle arbitrary shaders and arbitrary displacement maps.
Anyway the method described is quite novel, And IMO it should be quite good for an offline renderer, rather than of a very responsiveness program: graphics driver. DX11 provides no way for user to hint the graphics driver about this, thus I considered it as an immediate mode API. User(us graphics programmers) knows best about our data and shaders, not graphics drivers. Maybe OpenGL can be exented better, but in case of Larrabee, it has to be working well in DX11.
Water might be an exception, but caching shaders should work fine in a large number of cases.
 
They say pretty clearly in the abstract itself that they can haandle arbitrary shaders and arbitrary displacement maps.

I'll be very curious how they get this conclusion.

For one thing, they assume "differentiable functions". What if a function is not differentiable? as simple as a step(x,y) in HLSL is not differentiable. And to an extremme, what about an integer vertex shader with bitwise logical instructions?

For another, they didn't show the cost of the analysis in terms of time, memory, etc. In the real-time context, this is even more important than the math itself. Just said "on the fly" is not enough. The fly might be from China to US.

Last thing, they seem to make things overly complexed while in some other part was too brief. It sounds like a hell for me, as a programmer, to implement that paper.

In my opinion, culling can be much simpler if done by USERs than in the API. Many people is already doing culling in their hull shaders, far simpler because they know their data and code.
 
Last edited by a moderator:
I'll be very curious how they get this conclusion.

For one thing, they assume "differentiable functions". What if a function is not differentiable? as simple as a step(x,y) in HLSL is not differentiable. And to an extremme, what about an integer vertex shader with bitwise logical instructions?
And just how realistic is that scenario?
 
And just how realistic is that scenario?

Sure, it's not realistic, but it's possible. As an graphics driver, you should support all possibilities within specifications. You can't fail to compile a program, for example, just because it's not realistic or doesn't makes sense.

No offense, but what about the Taylor series of this function:
float InvSqrt (float x){
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
It's creativeness.

On the other hand, culling by developers themselves is much easier: everything is under control. If someone didn't make sense, you could just fire the guy.
 
That doesn't take away from my point. Tesselation hurts a TBDR more than an IMR. Including all the advantages of TBDR and inefficiencies of Larabee's generalized architechture, if it was barely competitive in DX9, it would be much less so in DX11.

Assuming it can be done efficiently by a clever hw/sw combination, a TBDR can certainly win, and win big with ginormous tesselation.
Geometry will always cost more on a TBDR than an IMR. There's no way around that. So no, it can't win big with high tessellation unless it's the pixel workload that is giving it the advantage (or, naturally, if you only give the TBDR an optimation that is equally valid for the IMR).

They say pretty clearly in the abstract itself that they can haandle arbitrary shaders and arbitrary displacement maps.
And just how realistic is that scenario?
True, but interval textures aren't free. And how about random value functions using bit masks? Procedural noise? It's an interesting paper, though.
 
Last edited by a moderator:
Sure, it's not realistic, but it's possible. As an graphics driver, you should support all possibilities within specifications. You can't fail to compile a program, for example, just because it's not realistic or doesn't makes sense.

No offense, but what about the Taylor series of this function:
float InvSqrt (float x){
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
It's creativeness.

On the other hand, culling by developers themselves is much easier: everything is under control. If someone didn't make sense, you could just fire the guy.

Culling be devs helps a tbdr and an imr equally.
 
Culling be devs helps a tbdr and an imr equally.

True. But my point is: DX11 doesn't give tile infomation when you are trying to do culling, so there's no chance for a developer to cull a patch based on a tile rect.

Think about a crazy idea: By offering a new SV_TileRect to the hull shader, a user is able to cull the patch if the test fails. For TBDR, this value is the dimention of the current tile. For IMR, it's the viewport size.

But no,there's no such chances for Larrabee and PowerVR. The Larabee now has to rely on itself to cull the patchs and do some hero programming to achieve that!
 
Think about a crazy idea: By offering a new SV_TileRect to the hull shader, a user is able to cull the patch if the test fails. For TBDR, this value is the dimention of the current tile. For IMR, it's the viewport size.
You can almost cull patches quite easily in hull shader. Just give it low tesselation factors, so that it's cheap to discard.
 
You can almost cull patches quite easily in hull shader. Just give it low tesselation factors, so that it's cheap to discard.

True, just use 0 then the GPU will discard the eintire patch.

I said DX11 is a IMR API, because, when a patch is replayed by an TBDR GPU, there's no way to know what tile we are currently in. Because of that, the developer can only cull the patch against the entire viewport, not the current tile.

That's why I suggest SV_TileRect as a system value, because it makes it consistent for developers to help both IMR and TBDR to cull a patch.
 
Last edited by a moderator:
Nick, while augmenting the throughput of the CPU core (simple or complex one) don't you think it could be an interesting thing to keep a "tiny/lesser GPU"?
I mean you pointed earlier in the thread that some time ago the vertex were still being handled by the CPU on Intel platform. How about moving the pixel shading too (like Dice is doing on PS3)?
Basically you put a tiny GPU with by today standard a "fucked up" ALUs/Tex ratio (ie plenty of texturing power vs compute power), assuming most 3D engine are moving to more and more deferred techniques the gpu (a modern one) would act a "deferred renderer accelerator"/"render target filler", could that make sense?
As an intermediate step I believe it can indeed make sense to keep the IGP around for a while. Software is only slowly becoming more generic, and extending the vector processing capabilities of the CPU while retaining a minimal cost adequate IGP would be a really low-risk way to prepare for the future while not compromising legacy graphics.

Note that AVX appears to be part of a bigger plan. Extending the CPU's SIMD capabilities to 256-bit must have been expensive both in terms of transistors and in terms of design time. But they've already specified FMA and the ability to extend it up to 1024-bit. So they don't appear to think the IGP is suitable for anything beyond legacy graphics. They must realize that running GPGPU applications on the IGP is an absolute joke. And making the IGP more flexible and powerful only converges it closer to the CPU architecture, while merely benefiting a handful of applications. Gradually making the CPU more powerful instead and reducing power consumption makes more sense in my opinion.

The beauty of the upgrade path for AVX is that there's not much need for Intel to rush things. They can observe how the software evolves, and carefully implement the features that make the most sense going forward.
 
Back
Top