Larrabee at GDC 09

This was supposed to be the new feature of DX10 (even the most important, according to Tim Sweeney, for example). More than two years have passed, and paged texture memory is still a promise of future?

As a side note, only just now the beta of CUDA 2.2 provides direct host memory access: GT200 through PCIe bus, and MCP79 direct access. More Info. Limited by ~6GB/s of PCIe bandwidth (low) and latency (hi). Not sure if newer NV hardware has device/host page granularity (ie a texture could have parts on device and parts on host). Regardless, I think the usefulness of paged texture memory is marginalized by the need for software to page to and from disk, high latency, and limited PCIe bandwidth. Perhaps there is a good reason WDDM v2 is quiet.
 
Loading on page misses is not how you would use virtual memory on the GPU in games, that's more something for time multiplexed independent applications trying to share the GPU. In games/rendering you would use virtual memory to efficiently store sparsely populated textures. The obvious example is megatextures. An extra layer of indirection in the shader is never going to be as efficient as a TLB.
 
Loading on page misses is not how you would use virtual memory on the GPU in games, that's more something for time multiplexed independent applications trying to share the GPU. In games/rendering you would use virtual memory to efficiently store sparsely populated textures. The obvious example is megatextures. An extra layer of indirection in the shader is never going to be as efficient as a TLB.

As for mega-textures, the idea being one virtual huge texture, which isn't doable with virtual paging alone because of texture size limitations (filtering), in which case the level of indirection would be necessary. I guess one might be able to get around this with texture arrays.

Not sure exactly how LRB is doing texture page misses, but arn't you still going to need a level of indirection to choose to sample lower mips on the page fail? If texture page miss is an interrupt, that is going to get awfully scary for real-time...
 
A page miss on the x86 cores will behave like any other memory access.

The texture units have TLBs, but they do not handle page misses. They need to revert back to the core to load the page.
 
So for LRB then, the texture fetch instruction would throw an exception, which would fault the shader, and end up in an interrupt handler?

If so I'd speculate that in the case of virtual texturing, the interrupt handler would toss the page on a page fault list (to be handled later). Then for the rest of the "subtile" force a mip level cap for the faulted texture to insure the page fault didn't happen again. This would have to be at some "subtile" granularity to insure future fetches didn't kill performance with interrupts...
 
So for LRB then, the texture fetch instruction would throw an exception, which would fault the shader, and end up in an interrupt handler?

If so I'd speculate that in the case of virtual texturing, the interrupt handler would toss the page on a page fault list (to be handled later). Then for the rest of the "subtile" force a mip level cap for the faulted texture to insure the page fault didn't happen again. This would have to be at some "subtile" granularity to insure future fetches didn't kill performance with interrupts...

From Larrabee: A Many-Core x86 Architecture for Visual Computing
Cores pass commands to the texture units through the L2
cache and receive results the same way. The texture units perform
virtual to physical page translation and report any page misses to
the core, which retries the texture filter command after the page is
in memory.
 
So for LRB then, the texture fetch instruction would throw an exception, which would fault the shader, and end up in an interrupt handler?
I haven't seen an outline of how the cores control the texturing units. I didn't see any instruction on the Larrabee vector ISA list that would match such an event.

The exact control method is not mentioned, but perhaps a texturing section is a software loop that sends commands to the texure unit and checks a structure for access problems.
If there is an unsucessful access, the code then tells the core itself to perform a memory read that would lead to a page miss, which would then set off a standard x86 handler.
After that was done, the texture loop would resend to the texture unit.
 
Not sure exactly how LRB is doing texture page misses, but arn't you still going to need a level of indirection to choose to sample lower mips on the page fail? If texture page miss is an interrupt, that is going to get awfully scary for real-time...
Prefetching has to take care of it 99% of the time, load on page miss has to be a rare corner case or it would just get too slow.
 
The hardware thread on the core that generated the texture instruction is, generally speaking, running a loop over multiple fibres (qquads). So when one fibre generates a texture request it's likely that the next fibre in the loop will generate a texture request within a very short time ... and so on for all the fibres in that hardware thread.

Once the hardware thread has issued all its texture requests and exhausted all "shader ALU instructions" that could run in the shadow of those requests, it has to relinquish control of the core to the other hardware threads, i.e. go to sleep.

When the texture unit reports that it's encountering misses, I guess the relevant core marks some kind of "waiting for slow texture results" status on the thread that generated the requests. This prevents the core returning the dozing thread to context until there are results it can use.

If the texture unit had texels already in cache and was able to return results to the originating thread without the originating thread having to go to sleep, I guess it raises a simple "texture results ready" flag.

Alternatively, I suppose, the originating thread could request that a simple watchdog with sleep intervals is setup, to poll the status of texturing. The core could then adjust its polling interval when told that page misses have occurred. Polling would run on the "control thread" that runs on each core, independent of the shader threads. This control thread is the same thread that generates qquads, performs interpolation, etc.

Jawed
 
First, I don't see any way to fill missing pages in real-time from disk to service one frame (the latency in the draw call would be horrid and stall future dependent calls). I'm speaking from experience with the texture streamer I work on at work. So either we are talking about decompressing/recompressing on the fly from a higher compressed format in memory, to a lower compression usable by the texture units, or procedurally generated textures. In any case the idea of pure virtual textures requires the assumption that some lower mip level is resident, and switching to the lower mip level to continue the draw call.

One wouldn't want to actually load the missed page in that draw call. IMO this would have to be a background process, and even would be a good idea with procedurally generated content to amortize that generation cost over a few frames.

Having to produce the page on a texture fetch fault just seems like a bad idea to me. If this is the case can page fault has to be a rare case, then from the developer perspective, pure page virtual texturing wouldn't seem like a good idea.

So if texture requests go through L2, then we are talking locked L2 cache lines and effectively memory mapped IO to the texture unit?

Wouldn't seem like a good idea to have to check for texture fetch fails by looking at L2 results in a shader. So perhaps the texture unit itself throws the exception, which in turn faults one of the cores?
 
If this is the case can page fault has to be a rare case, then from the developer perspective, pure page virtual texturing wouldn't seem like a good idea.
With or without hardware support megatextures have pages ... with hardware support they are accessed efficiently, without hardware support less efficiently. It's hardly the only application of sparse textures, for instance you don't want to store a photon map in a 3D texture at a uniform resolution in memory ... you want to store it sparsely.

Being able to store sparse arrays without having to pull tricks with extra layers of indirection in the shader is just a nice ability to have.
 
http://74.125.77.132/translate_c?hl...wafer/&usg=ALkJrhg3XB_p0k-EbK3dfqBm1W0uZUqXhQ

'First Larrabee wafer' - don't know if something got lost in translation or what. Not sure if I have seen those slides before either, from IDF, might be new.

This is actually Jasper Forest, Computerbase got the wrong picture hehe
I was at IDF and I can say however that Gelsinger also showed briefly a Larrabee wafer. I'll post the pic I got (bad quality sadly) when I'm done with my report about the event. Larrabee is much bigger than that ;)
 
This is actually Jasper Forest, Computerbase got the wrong picture hehe
I was at IDF and I can say however that Gelsinger also showed briefly a Larrabee wafer. I'll post the pic I got (bad quality sadly) when I'm done with my report about the event. Larrabee is much bigger than that ;)

Yeah, I thought there was something off - even if it's at 45 nm you could fit three to four of those on a GT200. :LOL:

I can't believe there's an IDF going on and NONE of the big sites are covering it. Not even posting third party info. It's like they've all signed contracts to pretend it doesn't exist. :???:
 
Yeah, I thought there was something off - even if it's at 45 nm you could fit three to four of those on a GT200. :LOL:

I can't believe there's an IDF going on and NONE of the big sites are covering it. Not even posting third party info. It's like they've all signed contracts to pretend it doesn't exist. :???:

Cost reduction so Intel didn't invit the international press to this IDF.
 
This is actually Jasper Forest, Computerbase got the wrong picture hehe
I was at IDF and I can say however that Gelsinger also showed briefly a Larrabee wafer. I'll post the pic I got (bad quality sadly) when I'm done with my report about the event. Larrabee is much bigger than that ;)

Any quick estimates on the number of dies per wafer or a ballpark figure on the die area?
 
LRB may be a wonderful dx10 gpu, but do you think it will work well for a dx 11 gpu? I am referring to Tessellation specifically. I can't see how they will be able to run the tessellation efficiently in software. (Though many said the same about rasterization :) )I guess they'll add some new instructions to the ISA.
 
I still can't believe they censored the die with a powerpoint slide in the webcast. Black Project!

It better be amazing with this amount of skulking.
 
This is when you wished you'd brought a 50MP digicam with serious zoom, eh?

If that's really Larrabee then, ahem, 32nm can't come soon enough, eh? Maybe that's a prototype built on 90nm :p

Jawed
 
Back
Top