22 nm Larrabee

Sure you can. Whether it is wise thing to do is another matter, but you can certainly make a multicore processor without L3.

Of course you can make a multicore processor without L3. But you can't make an Ivy Bridge without L3. The point is that its L3 performs a vital function and if you remove it you have to replace it with something else. Who knows how much it'd change the design. It makes it really hard to guess areas with.

I included the controllers in my estimation.

Can you break that down for me?
 
Well, you could fit ~10 Ivy Bridge cores in 160mm^2 without the L3 and GPU. So without the L3 and GPU it could probably be manufactured. Whether it should be on the other hand...


so what you are saying is that you want 10 cores that have no viable way of talking to the outside world...
 
even the celeron includes L3, 2MB for two cores (tip : the current celeron is incredibly fast, it's about like a core2duo E8500)

an updated Bulldozer will work without L3, as in Trinity.
but it's a fat design anyway and you now end up with big L2, which *bridge and similar don't have :p
these days an ivy bridge core has half the L2 of an Atom core.
 
Is that dual-core celeron using a regular dual-core die with one slice of L3 disabled, or half of two slices disabled? (No way of knowing for sure perhaps...)
 
Is that dual-core celeron using a regular dual-core die with one slice of L3 disabled, or half of two slices disabled? (No way of knowing for sure perhaps...)

Intel's Optimization Reference Manual actually addresses this:

The LLC consists of multiple cache slices. The number of slices is equal to the number
of IA cores. Each slice has logic portion and data array portion. The logic portion
handles data coherency, memory ordering, access to the data array portion, LLC
misses and writeback to memory, and more. The data array portion stores cache
lines. Each slice contains a full cache port that can supply 32 bytes/cycle.

The physical addresses of data kept in the LLC data arrays are distributed among the
cache slices by a hash function, such that addresses are uniformly distributed. The
data array in a cache block may have 4/8/12/16 ways corresponding to
0.5M/1M/1.5M/2M block size. However, due to the address distribution among the
cache blocks from the software point of view, this does not appear as a normal N-way
cache.

So every core must have a slice, but it can all the way down to 512KB. And the size is cut by reducing set associativity (throwing out ways). It's safe to assume they didn't change this for IB. So a 10-core there must have at least 5MB of L3 cache.
 
So how much area would 24 cores + 24 MB L3 take?

According to that 170mm^2 i7-3770K die shot the breakdown is something like this:

Cores: 48.3mm^2
L3: 24.7mm^2
System Agent on bottom: 12.9mm^2
Stuff around the edges which includes I/O: 34.1mm^2
GPU: 53.9mm^2

That's very rough (since it adds up to about 174mm^2) but good enough for estimation. The cores and L3 alone would take 386.4 + 98.8 = 485.2mm^2. The problem is I don't know how the system agent and I/O stuff scales, and what's "empty space" in the edges (for instance what could be light on logic could still be important for metal layers for the ring bus for instance). No idea how much space this hypothetical ring bus with 25+ hops would take, that is of course assuming that it can be done at all. And I don't know what kind of scaling you'd need for the memory and I/O logic.
 
The cores and L3 alone would take 386.4 + 98.8 = 485.2mm^2.
That is probably too much. Perhaps if they reduced the L3 down to 512KB/core, and removed the L2 from each core making the "L3" the L2...
 
Too much latency
To what? (Ivy Bridge's L3 is only 3 more cycles than Bulldozer's L2)

too small size
That is the same amount of L2 they already have.


BTW, I didn't promise the same performance, just 24 cores on a die... I also never said it was a good idea.... but I digress...
 
Last edited by a moderator:
Knights Corner's gather instruction specification is quite interesting:
Note the special mask behavior as only a subset of the active elements of write mask k1 are actually operated on. There are only two guarantees about the function: (a) the destination mask is a subset of the source mask (identity is included), and (b) on a given invocation of the instruction, at least one element (the least significant enabled mask bit) will be selected from the source mask.

Programmers should always enforce the execution of a gather/scatter instruction to be re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are zero).
In other words it fetches one cache line, gathers all the elements located within that single cache line, and that's it. A "jknzd" instruction (jump if mask is not zero) is required to ensure it fetches every other cache line that's touched by the addressed elements.

Haswell's AVX2 gather specification doesn't return partial results. But it does use a full 256-bit register as the mask, and it can be interrupted in the middle of its execution. Also Haswell's version doesn't appear to have alignment restrictions.

I guess it makes sense for an in-order architecture like Knights Corner to ensure it has a fixed L1 cache hit latency? It also seems to imply that Knights Corner uses a permute unit that's part of the vector ALU. I don't think such an approach makes sense for Haswell. None of the ALUs can write two 256-bit registers, and it would have to handle complicated unaligned cases.

As suggested before, Haswell could have a 256-bit gather load port and a 256-bit regular load port. The regular load port can then be used to return the updated mask register. Does that make sense or can anyone think of an even more probable implementation?
 
According to that 170mm^2 i7-3770K die shot the breakdown is something like this:

Cores: 48.3mm^2
L3: 24.7mm^2
System Agent on bottom: 12.9mm^2
Stuff around the edges which includes I/O: 34.1mm^2
GPU: 53.9mm^2


The size of the i7-3770k is 160 mm².
 
Back
Top