Larrabee delayed to 2011 ?

CarstenS · Dec 3, 2009

nAo said:
2 IA cores share the same mesh node/routing logic, see Intel press release material.

Sounds a bit like Bulldozer in that regard.

MfA · Dec 3, 2009

Gubbi said:
Because in certain cases it is fantastically useful, think big shared data structures where reads/queries vastly outnumber updates.

Snooping will always fuck things up ... software managed directories is the way to go, not hardware managed ones.

Roll it into the page handling perhaps?

rpg.314 · Dec 3, 2009

MfA said:
Snooping will always fuck things up ... software managed directories is the way to go, not hardware managed ones.

Right, because a hw cache controller doesn't know (cannot know without sw help?) that a particular cacheline will be used mostly in read operations and writes to it will be relatively v. infrequent and hence put the same cacheline in many cores.

Gubbi · Dec 3, 2009

rpg.314 said:
Right, because a hw cache controller doesn't know (cannot know without sw help?) that a particular cacheline will be used mostly in read operations and writes to it will be relatively v. infrequent and hence put the same cacheline in many cores.

This is the scenario where caches shine. Frequent reads and infrequent writes.

The problem is when cache coherency is used for IPC. Producers writing to queues causing invalidate traffic, subsequent consumer requests then cause queries for the line. That's a lot of wasted traffic to just get a few bytes from core 0 to core 1.

The solution is either explicit message queues or virtual channels in the memory system. Both require more hardware support.

Cheers

rpg.314 · Dec 3, 2009

Gubbi said:
This is the scenario where caches shine. Frequent reads and infrequent writes.

The problem is when cache coherency is used for IPC. Producers writing to queues causing invalidate traffic, subsequent consumer requests then cause queries for the line. That's a lot of wasted traffic to just get a few bytes from core 0 to core 1.

The solution is either explicit message queues or virtual channels in the memory system. Both require more hardware support.

Cheers

My understanding is that a cacheline in a cpu (since they are r/w there) can only be owned by one core. If many cores read the same data, then there will be a lot of traffic just to push stuff around.

As for amount of hw support, it seems from that ppt that sw managed cache coherency costs less hw than present day caches.

rpg.314 · Dec 3, 2009

Gubbi said:
This is the scenario where caches shine. Frequent reads and infrequent writes.

In this scenario caches shine, hw cache-coherency doesn't.

In the sw managed world, that page could be marked as readonly and then could be shared safely across many cores without the coherency traffic.

Jawed · Dec 3, 2009

So anyone know how fast Larrabee was in the sparse matrix multiplication from that video?

Also, do you think Rattner was a bit embarrassed to be hyping 1TFLOPS achieved SGEMM when ASCI Red was DGEMM?

Jawed

Ailuros · Dec 3, 2009

It's already embarrassing when a CEO of any firm doesn't have a clue about the technology of the firm he's managing.

pcchen · Dec 3, 2009

Jawed said:
So anyone know how fast Larrabee was in the sparse matrix multiplication from that video?

IIRC it's something like 8GFLOPS.

Also, do you think Rattner was a bit embarrassed to be hyping 1TFLOPS achieved SGEMM when ASCI Red was DGEMM?

Actually it's even worse. ASCI Red achieved more than 1TFLOPs in LINPACK (solving dense matrix in DP). It's would be even faster doing DGEMM

Gubbi · Dec 3, 2009

rpg.314 said:
My understanding is that a cacheline in a cpu (since they are r/w there) can only be owned by one core. If many cores read the same data, then there will be a lot of traffic just to push stuff around.

Of course there won't be any copying around. Each core would have a copy of the cacheline. The cacheline would have the state "Shared" (Read up on cache coherence protocol).

If there is a write to the cacheline, an cacheline invalidate is broadcast.

Cheers

Gubbi · Dec 3, 2009

MfA said:
Snooping will always fuck things up ... software managed directories is the way to go, not hardware managed ones.

I don't know, I quite like automatic directories (like AMDs snoop filter). You can build hierarchies of these things. The coherency traffic reduction is dramatic.

MfA said:
Roll it into the page handling perhaps?

Would you then trap on a write to a page ?

Cheers

Jawed · Dec 3, 2009

pcchen said:
IIRC it's something like 8GFLOPS.

Really

that sounds way too low to even be worth demonstrating :???:

Looking at the video again I have a feeling the scale goes up to 100GFLOPs, so 80 sounds reasonable. There's two things being shown QCD and FEM_CANT. If they're running concurrently, each using half of Larrabee, then ~160 GFLOPS might be it.

Jawed

MfA · Dec 3, 2009

Gubbi said:
Would you then trap on a write to a page?

Not exactly what I meant. I meant that you could maintain a per page subscriber list, it might trap if the list isn't cached in the hardware TLB at the moment but it need not trap on every write.

Gubbi · Dec 3, 2009

MfA said:
Not exactly what I meant. I meant that you could maintain a per page subscriber list, it might trap if the list isn't cached in the hardware TLB at the moment but it need not trap on every write.

I was thinking along the same line, but only automatic.

To support single producer/multiple consumer semantics in the cache apparatus, I'd add a new cache line state, where a write doesn't immidately cause an invalidate broadcast. Instead communication is deferred until the programmer chooses to "push" the line (or regular LRU replacement ejects the line).

Instead of an invalidate broadcast, the cacheline itself is broadcast. Use automatic directories to only push the line to the actual consumers. All the consumers would hold the cache line in the Shared state.

The automatic directory/snoop filter is used to implicitly implement the subscriber list.

Cheers

MfA · Dec 4, 2009

Yes well I have my doubts about the complexity necessary to scale it regardless of hierarchy ... and apparently Intel does too to a degree

Dave Baumann · Dec 4, 2009

http://news.cnet.com/8301-13924_3-10409715-64.html

RobertR1 · Dec 4, 2009

Party over. Back to Fermi and ATI's counter.

phenix · Dec 5, 2009

Jon Stokes' take:
http://arstechnica.com/hardware/new...-gpu-put-on-ice-more-news-to-come-in-2010.ars

CNCAddict · Dec 5, 2009

So I guess nVidia was right to call it a bunch of powerpoint slides

Maybe they're going to focus on the new 48 core design they've been showing off lately?

phenix · Dec 5, 2009

CNCAddict said:
So I guess nVidia was right to call it a bunch of powerpoint slides Maybe they're going to focus on the new 48 core design they've been showing off lately?

Fry: "Words. Nothing but sweet, sweet words that turn into bitter orange wax in my ears."

Sorry couldn't help myself.

Larrabee delayed to 2011 ?

CarstenS

Moderator

MfA

rpg.314

Gubbi

rpg.314

rpg.314

Jawed

Ailuros

Epsilon plus three

pcchen

Moderator

Gubbi

Gubbi

Jawed

MfA

Gubbi

MfA

Dave Baumann

Gamerscore Wh...

RobertR1

Pro

phenix

CNCAddict

phenix