Larrabee delayed to 2011 ?

Because in certain cases it is fantastically useful, think big shared data structures where reads/queries vastly outnumber updates.
Snooping will always fuck things up ... software managed directories is the way to go, not hardware managed ones.

Roll it into the page handling perhaps?
 
Last edited by a moderator:
Snooping will always fuck things up ... software managed directories is the way to go, not hardware managed ones.

Right, because a hw cache controller doesn't know (cannot know without sw help?) that a particular cacheline will be used mostly in read operations and writes to it will be relatively v. infrequent and hence put the same cacheline in many cores.
 
Right, because a hw cache controller doesn't know (cannot know without sw help?) that a particular cacheline will be used mostly in read operations and writes to it will be relatively v. infrequent and hence put the same cacheline in many cores.

This is the scenario where caches shine. Frequent reads and infrequent writes.

The problem is when cache coherency is used for IPC. Producers writing to queues causing invalidate traffic, subsequent consumer requests then cause queries for the line. That's a lot of wasted traffic to just get a few bytes from core 0 to core 1.

The solution is either explicit message queues or virtual channels in the memory system. Both require more hardware support.

Cheers
 
This is the scenario where caches shine. Frequent reads and infrequent writes.

The problem is when cache coherency is used for IPC. Producers writing to queues causing invalidate traffic, subsequent consumer requests then cause queries for the line. That's a lot of wasted traffic to just get a few bytes from core 0 to core 1.

The solution is either explicit message queues or virtual channels in the memory system. Both require more hardware support.

Cheers
My understanding is that a cacheline in a cpu (since they are r/w there) can only be owned by one core. If many cores read the same data, then there will be a lot of traffic just to push stuff around.

As for amount of hw support, it seems from that ppt that sw managed cache coherency costs less hw than present day caches.
 
This is the scenario where caches shine. Frequent reads and infrequent writes.

In this scenario caches shine, hw cache-coherency doesn't.

In the sw managed world, that page could be marked as readonly and then could be shared safely across many cores without the coherency traffic.
 
So anyone know how fast Larrabee was in the sparse matrix multiplication from that video?

Also, do you think Rattner was a bit embarrassed to be hyping 1TFLOPS achieved SGEMM when ASCI Red was DGEMM?

Jawed
 
It's already embarrassing when a CEO of any firm doesn't have a clue about the technology of the firm he's managing.
 
So anyone know how fast Larrabee was in the sparse matrix multiplication from that video?

IIRC it's something like 8GFLOPS.

Also, do you think Rattner was a bit embarrassed to be hyping 1TFLOPS achieved SGEMM when ASCI Red was DGEMM?

Actually it's even worse. ASCI Red achieved more than 1TFLOPs in LINPACK (solving dense matrix in DP). It's would be even faster doing DGEMM :)
 
My understanding is that a cacheline in a cpu (since they are r/w there) can only be owned by one core. If many cores read the same data, then there will be a lot of traffic just to push stuff around.

Of course there won't be any copying around. Each core would have a copy of the cacheline. The cacheline would have the state "Shared" (Read up on cache coherence protocol).

If there is a write to the cacheline, an cacheline invalidate is broadcast.

Cheers
 
Last edited by a moderator:
Snooping will always fuck things up ... software managed directories is the way to go, not hardware managed ones.

I don't know, I quite like automatic directories (like AMDs snoop filter). You can build hierarchies of these things. The coherency traffic reduction is dramatic.


Roll it into the page handling perhaps?

Would you then trap on a write to a page ?


Cheers
 
IIRC it's something like 8GFLOPS.
Really :?: that sounds way too low to even be worth demonstrating :???:

Looking at the video again I have a feeling the scale goes up to 100GFLOPs, so 80 sounds reasonable. There's two things being shown QCD and FEM_CANT. If they're running concurrently, each using half of Larrabee, then ~160 GFLOPS might be it.

Jawed
 
Would you then trap on a write to a page?
Not exactly what I meant. I meant that you could maintain a per page subscriber list, it might trap if the list isn't cached in the hardware TLB at the moment but it need not trap on every write.
 
Not exactly what I meant. I meant that you could maintain a per page subscriber list, it might trap if the list isn't cached in the hardware TLB at the moment but it need not trap on every write.

I was thinking along the same line, but only automatic.

To support single producer/multiple consumer semantics in the cache apparatus, I'd add a new cache line state, where a write doesn't immidately cause an invalidate broadcast. Instead communication is deferred until the programmer chooses to "push" the line (or regular LRU replacement ejects the line).

Instead of an invalidate broadcast, the cacheline itself is broadcast. Use automatic directories to only push the line to the actual consumers. All the consumers would hold the cache line in the Shared state.

The automatic directory/snoop filter is used to implicitly implement the subscriber list.

Cheers
 
Yes well I have my doubts about the complexity necessary to scale it regardless of hierarchy ... and apparently Intel does too to a degree :)
 
So I guess nVidia was right to call it a bunch of powerpoint slides :cry: Maybe they're going to focus on the new 48 core design they've been showing off lately?
 
So I guess nVidia was right to call it a bunch of powerpoint slides :cry: Maybe they're going to focus on the new 48 core design they've been showing off lately?


Fry: "Words. Nothing but sweet, sweet words that turn into bitter orange wax in my ears."

Sorry couldn't help myself. :)
 
Back
Top