Larrabee at GDC 09

However i think it wont mean much to us consumers and gamers.... It would have huge impact on Professional Market.

I don't think so, considering the abysmal quality of Intel's OpenGL drivers I don't see them making inroads in the professional market soon irrespective of how good the hardware is.
 
I don't think so, considering the abysmal quality of Intel's OpenGL drivers I don't see them making inroads in the professional market soon irrespective of how good the hardware is.

I live under the impression that the LRB team has nothing to do with the chipset team.
 
They can claim that the majority of the fab investments were already written off by producing CPU's, a luxury you don't have with an external fab.

If you only take into account the cost of a raw silicon wafer and the cost of operating a fab (electricity, man hours, maintenance etc), your chips will be very cheap indeed.

Reality will probably somewhere in the middle, I suppose, and impossible to estimate for outsiders.

The cost estimates for Itaniums at that die size already excluded the amortized fab overhead. This is wafer, processing, packaging, and validation costs only.
 


Any clue how strands and fibers are actually represented? Are they just concepts to guide developers or are they actually part of the programming model? Can I take a single hardware thread and do anything I want or does everything have to be broken down into fibers? What are we calling a "group of 16-strands" on Larrabee anyway? Is there an equivalent to Nvidia's warp?

I'm also curious as to how hardware and software switching are going to work together. Software switching is basically knowing you just asked for something that's gonna take a long time so switch to some other data item. But is it up to the developer to keep track of which fibers are waiting on what? And how does the hardware know that it's encountered an unpredictable short latency and that it should invoke a context switch instead of letting the thread switch to another fiber? Sorry for the barrage of questions but it all seems like magic at the moment....
 
Fibers and strands are handy abstractions for collecting work into large enough collections that allow for efficient SIMD processing.

If the programmer does not rely on the abstractions provided by the software stack that uses fibers and strands as base units, considerations for tracking execution and stalls is up to the programmer.

The hardware is an x86 processor with 64-byte vector registers. What a programmer wants to do within that context doesn't need to define anything in terms of strands or fibers.
Software switching is something either the framework provides, or a direct programmer can decide.

Unpredictable short-latency events are one reason why there are multiple threads per core.
It's too much work for not enough gain to try and predict small stalls like that. Context switches only happen on more obvious long-latency events.
 
First, I don't see any way to fill missing pages in real-time from disk to service one frame (the latency in the draw call would be horrid and stall future dependent calls).

That seems to be the basis for this Nvidia patent. Page misses on texture requests are serviced for subsequent frames but the current frame tries to find the next best texel that's available in local memory.
 
If the programmer does not rely on the abstractions provided by the software stack that uses fibers and strands as base units, considerations for tracking execution and stalls is up to the programmer.

Is there any example code out there as yet that uses those abstractions? Still can't wrap my head around the concepts :oops:

Unpredictable short-latency events are one reason why there are multiple threads per core.
It's too much work for not enough gain to try and predict small stalls like that. Context switches only happen on more obvious long-latency events.

So there are two measure of "long-latency" hiding? Switching between fibers in a thread and then switching out hardware threads? Guess I don't understand how the hardware knows when to do the latter.
 
Low-level detail on a lot of this hasn't been disclosed.

Software can try to hide latency as long as it remains on-chip. The static allocation of qquads to a strand (correction: fiber) to hide best-case texture latency is an example.

Anything that goes to memory is probably going to be too long for most types of software-based latency hiding.
 
Last edited by a moderator:
That seems to be the basis for this Nvidia patent. Page misses on texture requests are serviced for subsequent frames but the current frame tries to find the next best texel that's available in local memory.

That is exactly what I personally would want from the hardware (from a developer perspective). I don't want to be checking if TEX instructions fail in the shader (that is insane IMO), and I also don't want my shader to have to be restarted with different MIP clamps per "subtile" to insure efficient computation after a page fault (to insure that either I don't have to check, or I don't get repeated page faults on future shader accesses).

Found an older thread on LRB and final post from TomF on the Molly Rocket Forums,

"-In the second pass (which is now the only pass), you don't need 14(?) shader instructions and 2 texture reads, you just read the texture with a perfectly normal texture instruction. If it works, it gives the shader back filtered RGBA data just like a non-SVT system. If it doesn't (which we hope is rare), it gives the shader back the list of faulting addresses, and the shader puts them in a queue for later and does whatever backup plan it wants (usually resampling at a lower mipmap level)."

Not this says "texture instruction" (talking about doing mega textures on Larrabee compared to current GPUs). The question is how exactly is the texture unit giving the shader a list of faulting addresses?

And later he writes,

"... My understanding is that Rage's lighting model is severely constrained because every texture sample costs them so much performance, they can only read a diffuse and a normal map - they just can't afford anything fancier..."

I just don't buy this later comment. I'd bet, if they are limited to diffuse and normal, it's probably because of lack of ability to store all the data (DVD's for 360), or decompress and recompress enough of it to service the real-time streaming requirements. Or lack of ability to do high enough quality re-compression to pack more into two DXT5s. Should be able to get diffuse+monospec into one, and a 2 channel normal with two channels for something else in the other (the Insomniac trick for detail maps) ...

I'll be more clear on my original point, going back to the texture size limitation, DX11's 16364x16384 max isn't enough even with virtual texturing to do mega textures with a single texture. And with just two DXT5s, that's 4GB of data, ie the full 32bit address space. So likely you'd still need a level of indirection in the shader to get around this problem (beyond optionally dealing with software page faults). And this is exactly why I'm not sold on virtual paging for mega texturing, unless the card supports 64-bit addressing for texture memory, and then I could split my megatexture up into many tiny mega textures and more draw calls.

In light of the above problem, I think LRB's virtual texture paging would make more sense with a more classical engine "material system" like say Unreal, where you still use tiled textures + lightmap, or Uncharted with its usage of dual tiled textures + blend mask. But in this case, if with LRB, my shader has to deal with page faults, I'd likely want to factor all that work out into a texture streamer and manually stream textures so I don't eat any unnecessary costs (ie page faults) during shading ... even if just for the reason that I want a consistent cost to render when stuff like textures applied to surfaces which get un-occulded.

But who knows, I might be singing another song if I was actually playing with the real hardware ;)
 
Not this says "texture instruction" (talking about doing mega textures on Larrabee compared to current GPUs). The question is how exactly is the texture unit giving the shader a list of faulting addresses?
If it really gives all the addresses of individual texels I'd assume it pushes them on the stack, letting you add a conditional branch for the exception (not a big deal if it's a rare occurrence) and some scalar code to deal with the faults.
 
Any clue how strands and fibers are actually represented? Are they just concepts to guide developers or are they actually part of the programming model?
It's a purely software concept to indicate a bit of parallel code working on unrelated data. As such it has no hardware equivalent and there is no such thing as 'switching between strands' in hardware. Think of each strand as an iteration of a loop working on data which is already known to be in the L1 - for example - or doing purely computational work and thus having a predictable execution latency (minus the variability introduced by hardware thread switching).
 
Well, that photo is big enough just to count the integral structures exactly, at least, but no detail on the surface to be sure it isn't LRB but something else... anyway, there is 85 integral structures on that wafer and that writes for 625 mm² per die.
 
Well, that photo is big enough just to count the integral structures exactly, at least, but no detail on the surface to be sure it isn't LRB but something else... anyway, there is 85 integral structures on that wafer and that writes for 625 mm² per die.

He only held up two wafers during the keynote so it's either one or the other. The one on the bench behind him does look like Jasper Forest so its pretty sure in my book.
 
Back
Top