For 32 cores on the same die to all cooperate in a single virtual address space there does need to be some kind of TLB fabric though, doesn't there? I supposed they're all caching the same page table, which implies changes to the page table have to be atomic, but that's a whole other kettle of fish. I don't know how the "Pentium core" that Larrabee's based on does this, or how scalable it is.
The page table is just data in memory, granted it has system-level significance.
The various cores and tex units might cache parts of it at any given time in their TLBs. There's no need in x86 for a TLB fabric, as they are caches that can stay coherent like their neighbor L1 and L2s.
Since it is system-critical, changing the page table would require additional work that would be serializing. As AMD's Barcelona chip showed, there are a number of actions related to TLBs where there are assumed to be atomic operations on page table structures. The failure to handle them atomically is basically game-over.
It might be fun to see how much work Intel has done in verifying Larrabee can handle TLB updates across so many cores, given what AMD experienced with just four. TLB-related errata are documented concerning non-atomic updates are still present in Intel and AMD designs, but are typically prevented by microcode patches.
It would be interesting to see just how much the texture units can modify the page tables. Maybe they can update some of the bookkeeping bits per entry. If they write to the L2 buffer, it might have to be labelled dirty, or maybe it has to be already initialized to a fixed status by the core.
That the texture unit gives up on a fault makes sense as modifying the actual page table is an OS-level operation.
I would assume Larrabee's software sets up as much as it can ahead of time and tries to keep it as unchanged as possible, given the overhead.
Slides 12 and 33 here:
http://s08.idav.ucdavis.edu/forsyth-larrabee-graphics-architecture.pdf
indicate that there are TU TLBs, which allow it to work independently of the core, fetching page table entries if need be and managing its own page load requests. The owning thread is oblivious to all this stuff (though the core may receive mirrored TLB entries if they're changed?) and just ends up stalling when the texture results don't appear.
I wouldn't expect the core to receive mirrored TLB entries, if the texture unit somehow modifies them.
An alteration would invalidate all cached copies of the page table entry. The CPU, if it were to require that entry, would miss and have to fill the TLB before trying to complete the memory access.
It should be on-chip, as the texture unit would have a copy. It might be that the entries are mostly not modified by the texture units to avoid TLB thrashing.
Alternatively the programmer can elect to have hard faults activated, which seems to mean that TU-TLB is a mirror of core-TLB.
My take was that texture units have more specialized TLB behavior that allows them to behave in a rather non-x86 manner. The texture unit can, at programmer discretion, give up when a full core would be required to service a miss or fault.
This might make good performance sense, as fiddling with the TLB can inject unpredictable latencies.
I would think that the texture unit with hard faults enabled would still defer to the core if it encounters a fault that invokes an OS routine.
One of the interesting things here is that a TU-TLB is logically a mirror of 8 cores' TLBs, when hard-faulting is active. And each core's TLB is logically 4-way threaded too (though I'm not sure if Larrabee actually threads core-TLB). Whether hard- or soft-faulting there's 32 cores and 8 TUs all cooperating in page table maintenance.
I didn't get the impression that there was any mirroring of TLBs in the pdf. What leads to this conclusion? TLB entries can be shared, but an update to one TLB's entry will typically only lead to the invalidation of old copies cached elsewhere. Broadcasting an update wouldn't normally be done. The other cores would just have to service a miss, if they happen to need the address.
TLBs are typically a shared resource local to a core. Given how intimately they are tied to the memory pipeline and how many there would be if they were per-thread, I'd bet this hasn't changed.