22 nm Larrabee

170 mm² is wrong, he used the Haswell die for his calculation. The official datasheet for IVB HE-4 says 8.141 x 19.361 mm= 159.8mm².

http://www.anandtech.com/show/5876/the-rest-of-the-ivy-bridge-die-sizes

It is very unlikely that that is a Haswell die, and why do you think UBM TechInsights would have had access to it? I don't know where this alternate one came from but it really doesn't matter, obviously it's the same as a standard IB one except with a bigger GPU and therefore fine for what I used it for.. do you somehow still have a problem with this?

Interesting though, that Intel did four different dies instead of just two. That's pretty unexpected.
 
You can see it on the number, it's the Haswell photo from IDF last year in your link.

http://pc.watch.impress.co.jp/img/pcw/docs/480/539/html/4.jpg.html


We don't need any amateurish imprecise measurements, Intel stated in the datasheet and press pdf of Ivy Bridge 160 mm², or exactly 159,8 mm². What do you need more?

Although that chip (with the UBM Techinsights die shot superimposed on top of it) was assumed to be Haswell I'm having a hard time finding anything definitive confirming it (the Chip-Architect site itself says the chip was presumed to be Haswell). Obviously - unless I'm missing something or the shot is fabricated - UBM would have needed to have that Haswell sample to get die shots. Why would they have it? And would Haswell really have CPU cores the same size as IB's? What die model would this correspond with, the 16 EU Haswell? The 40 EU one? It doesn't really seem to line up with anything. My money is on this being a scrapped IB with 24 EUs or some other GPU difference.

Anyway, I've said it several times and you don't seem to get it, it really doesn't matter if you use the 160mm^2 shot or the 170mm^2 one from the Chip-Architect page, the sizes of the outlined cores and L3 cache are the same. Or at least the rectangles they lined up are and those are what I used. I'm not arguing against real retail IB's being 160mm^2 (never have been, I simply didn't know which one was the right one at first). So why are you going off on this tangent exactly? The point was never to estimate IB's die size but to estimate what some hypothetical chip with more cores might be, at a minimum.
 
Last edited by a moderator:
It looks like gather/scatter can not be done with a single instruction, instead you are required to code a loop like:

while ( mask = gather(values, address, mask));

Bits of the mask are cleared indicating what SIMD data lanes coud be read, in a single instruction.
In case all lanes come from a different cache line, this looks to require 16 iterations !?

I wonder if Haswell also will require to explicitly code loops for gather like this.
 
I understand that as pseudo code describing the internal workflow for completing a gather instruction. You don't have to write that loop in your code (and neither is the compiler required to generate it).
 
A little search later I found back another link to the Xeon Phi ISA

Notice on page 299:
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are
zero).
 
Okay. So the only advantage of gather on Larrabee is that one may get an early exit from that loop. In the worst case it behaves the same as individually loading all elements. :cry:
 
Okay. So the only advantage of gather on Larrabee is that one may get an early exit from that loop. In the worst case it behaves the same as individually loading all elements. :cry:

I wouldn't say that. Without gather you'd have to individually move each element from a vector register to a scalar one, perform the load, then move the result back to the vector element. It'd require at least 2, maybe 3 cycles and would need a lot of code or a loop itself (which would increase the cycle count of course).

In-order dual issue pipelines these days are often staggered with conditional branches, so it's possible that on Larrabee the branch-if-mask-not-zero instruction can issue in parallel with the scatter/gather. You'd expect 0-1 mispredicts for the loop, depending on whether or not the predictor recognized a repeated loop length. So the question is, how good is the predictor and what's the mispredict penalty. But in the hypothetical worst case you'd still only get one mispredict after 16 loads.

In a lot of loads it's really common for the gather/scatter indexes to be near each other.. like in their original use purpose of texturing. Often you'll even have the same element accessed repeatedly. But good luck benefiting from this while emulating the instruction.
 
The difference in how Haswell and Phi handle gather instructions could be due to limits to how much the designers were willing to push the simple in-order pipeline.
The aggressive OoO cores already have a lot of internal replay and cross-domain coordination that can be extended or replicated for a gather unit that doesn't need as much monitoring by the code.
 
I wouldn't say that. Without gather you'd have to individually move each element from a vector register to a scalar one, perform the load, then move the result back to the vector element. It'd require at least 2, maybe 3 cycles and would need a lot of code or a loop itself (which would increase the cycle count of course).
I was thinking of the alternative of loading one element after another directly into the right slot of the vector GPR. You are right, that this is not possible in an efficient way on Larrabee. But you could almost do that with the KnightsCorner instruction set (VBROADCASTSS loads a single float and broadcasts it to all vector slots, one just have to combine it with a writemask allowing it to write to the right slot). So one could write out the vector with the indices before the loop and the loop body would consist of a scalar load (the index), a SHR (write mask dual used as loop "counter", no explicit test needed if one loops with jnz) and the vbroadcastss instruction. It could be better than the solution you proposed (furthermore one could completely unroll it, eliminating the the bitshift and the loop overhead), but unfortunately KnightsCorner doesn't support to use a scalar register (or an immediate for the unrolled version would be even better) directly as a vector mask. One has to move the mask from the scalar reg to a mask register (and immediates are not supported for this neither). Otherwise it could be a nice 1 scalar + 1 vector instruction pair for each loaded value (the instructions can run in parallel) if immediates would be supported as vector mask.
 
Last edited by a moderator:
I'm not totally clear from the instruction set reference manual, which I find pretty poorly written/incomplete, but doesn't a memory address come from the standard x86 address modes (using scalar registers), except for scatter/gather? If so you'd need to move the vector fields into these registers first, even if you could do the loads or stores straight from parts of the vector registers. And I don't see any instructions for moving between scalar and vector registers; if they're not there you'd have to go through memory which would be even worse..
 
I'm not totally clear from the instruction set reference manual, which I find pretty poorly written/incomplete, but doesn't a memory address come from the standard x86 address modes (using scalar registers), except for scatter/gather? If so you'd need to move the vector fields into these registers first, even if you could do the loads or stores straight from parts of the vector registers. And I don't see any instructions for moving between scalar and vector registers; if they're not there you'd have to go through memory which would be even worse..
That's a single vector move of the offset indices to memory (L1 D$, which is pretty fast and probably doesn't cost any additional latency with 4 threaded SMT, save for the absolute worst case). These offsets have to be in a single vector register for gather, so one starts from the same situation. The base address needs to be supplied by a scalar register also for gather/scatter. So what one needs to do is to address the memory as [base+offset] (the broadcast instructions use the "traditional" addressing not the gather like vector addressing, just the meaning of imm8 offsets is slightly modified, it is automatically scaled with the actual access width). The gather instructions do nothing else fundamentally, they only automate the write mask thing. And they can read more than one item if it is in the same cache line and set the write mask accordingly (one does not have to manage it manually), which has to be tested in a loop to be sure everything got read.
 
Okay, I had no idea you could use a vector field as a memory offset for a contiguous load. Could you show where in the reference manual this is specified? Is there a way to pick which field or is it the first one or something like that?
 
Back
Top