I was just thinking of mirroring or marking as dirty, traffic crossing the ring bus.
The Pentium caches wouldn't push data updates, as far as I know.
The modification of a table entry would follow an invalidation of all other caches' copies and also a write of the new data form to main memory.
Without some other modification to the "it's x86" mantra, the traffic over the bus would be "pull" traffic where other cores or texture units generate TLB misses and then need to pull the latest version in.
Also I was thinking of:
Virtual memory fragment aware cache
specifically figure 8 (paragraphs 109 onwards), where the TLBs are communicating with each other via L2. Also note this talks about 8 page table contexts, although distribution of contexts by type of client may not marry well with my thoughts on dividing contexts across tasks/cores.
It's an interesting way of compressing page table data to prevent excessively redundant page table entries from overwhelming the smal TLBs.
The communication through L2s something similar to the way Larrabee has local L2 subsets per core, with a core treating every other core's subset as an L3.
It's not a proactive form of communication, just that every L2 is allowed to fail back to the shared superset of all other L2s.
Well, in theory Intel knows how to do that.
The concepts are well-understood. Validating them over coherent caches, many memory clients, and other system details does get more expensive.
So far, no implementation on the most modern cores has come without bugs, though they can be caught by microcode patches.
Larrabee's core is not the most modern core, and its baseline infrastructure is an FSB and a single core per socket. There would be extra effort to make sure there wasn't some nasty surprise hiding in the melding of a new interconnect with the older core, particularly since incorrect TLB behavior that can't be patched would be a very bad thing.
One thing occurred to me: AMD and NVidia are effectively going to spend billions over the next 5 years of GPU design, with iterations of refinement towards cGPU. How much Larrabee investment over the same time is required?
That would depend significantly on how we account for the research Intel has put into its many other manycore initiatives.
Nobody has a good answer right now, so they're all spending massive amounts to find an acceptable one.
Also, when a TU faults and the owning thread stalls/sleeps this will presumably start to cause the L2 lines preferred by that thread to leak away in favour of threads that are active. On the other hand, maybe the entire core quickly falls inactive since textures are usually pretty coherent and these higher mip levels will affect a lot of pixels in the core's screen-space tile (e.g. prolly all of them?).
There's probably a pool of threads pinned to each core that is >4 for this reason.
The slides indicate Larrabee uses a scheme similar to Niagara, with modified round-robin threading.
At the very least the cores' TLB updates need to be mirrored into TUs though, otherwise they'll be spending a lot of time stale (suppose it depends on how quickly the TUs exhaust the content of higher MIP level pages). One-way mirroring, from cores to TUs?
The architecturally simplest answer would be to do what x86 does already and invalidate the TU's PTEs.
I think having stale copies anywhere being treated as if they were valid is a potential system crash.
However, since the texture units have their own TLB hardware and are allowed to perform fills on their own, mirroring would be more complicated.
Did you mean soft faults? Hard faults makes the TU always defer don't they?
My interpretation is that setting a texture unit for soft faults means that it will message back to the core a message that something went wrong with paging, but would not generate an actual fault, leaving it up to the shader code to decide what to do next.
With hard faulting, the texture unit would defer also, but this time to the core's fault handler.
My earlier statement was colored by my impression that it might also be possible to disable the texture unit's ability to automatically fill its TLB, and just require it to defer back to a shader.
It might keep the texture unit's behavior more deterministic as invoking a TLB fill is a hardware exception that blocks all use of the memory pipeline while in progress, but on reflection the complexity might not be worth it.