22 nm Larrabee

This description has me thinking about the Cell. That's a gross simplification or comparison probably, Cell took data locality to the extreme by having the SPEs only be able to access the local storage (I'm imagining it as 7 or 8 computers with 256K memory with the ring bus as the "lan"). Xeon Phi would be less braindead than that.
My feeling, just a feeling is that data locality and data flow are very important, does that thing fit in a fast and close enough cache or are you going to wait 1000 cycles to get it.. Vectors are almost irrelevant if the data isn't there.
That is why I wonder if the statement Intel made about "removing bandwidth constrains" only have to do with CW.
If I get what LiXiangyang the each core only has access to its local share of L2.
So I think that means that if some data is in another slice of the L2, the data have to be move to the L2 before the data is accessible?
I guess there is a penalty in latency and may in bandwidth if a core has to access data that are in the L2 but not in its local share of the L2.
Iirc the old Larrabee talks, there are bus that connect 8 cores and then are connected together.
To me now that we speak of ~64 cores it seems that data and coherancy traffic may have to jump through many hoops before it reaches it target, I wonder on the impact on latency and achieved bandwdith.

That why I think an organization more akin to AMD Jaguar "compute cluster" could allow for a "leaner" organization of the data traffic on the chip.
I was wondering between 4 or 8 cores, but 4 seems better to me as it should allow for a more fined grain scaling of the number of cores.
Now say you have clusters of 4 Xeon phi cores connected to an L2 interface akin to the one in AMD jaguar based compute cluster, you have 4 cores that have access to 2 MB of shared L2.

Say there is a L1 a cache miss, case 1 is what I think Xeon Phi does case 2 is what I think could be better (quite as statement for a forum warrior).

Case 1 : the core checks its local L2 and then the other slice of L2 to check if the data is available, then if the data is there it is move to its L1/L2 (inclusive caches?). I think doing require jumping through many oops (the data/ signal may have to go through multiple sections of the bus).

Case 2 :
The core check the 2MB of o the shared local L2. If data if not there, they go to check the L3 tags (on chips). every thing contained in the L2 is duplicated in the L3 (Crystal Web).
Such a compute cluster would have no mean to access the other compute cluster L2, everything would be checked against the L3.
Whereas accessing the off chip L3 could take sometime I wonder what is cost of going through a lot of oops like in nowaday Xeon phi. I also wonder what is the power cost of that traffic.
Now say you have 64 cores that is 16 computes, you "only" need to connect 16 L2 interfaces to a system agent (including the L3/Crystal Web) tags. To me it seems that the data traffic on chip would be leaner and that would be easier to provide high bandwidth data path between the system agent and the compute clusters. I also wonder about such an organization of the data traffic would compare to the actual set up where data seems to travel through lots of hoops.

Intel could also copy AMD wrt to the L2 clock speed. Like in Jaguar the L2 should run at half the core speed, I think 4 thread should be enough to hide the extra latency. Xeon Phi burns a lot of power it could be welcome. May be Intel could do further trade off (or simply better) and increase the bandwidth between the cores and their local share of the L2.

About Crystal Web, I read some people on RWT wondering about a hypothetical 2 CW set-up, for something like CW is could be great, especially if the L3 include all the data in the L2, for 64 cores that is 32MB already.
I could see Intel trade its 512B bit bus and GDDR5 for two of those fast link to CW and 256 bit bus to fast DDR3/4. Another way to save power as it seems that CW +its interface power consumption is way lower than GDDR5 and its memory controller.

Another thing is that I don't expect Intel to scale the number of core or actually the theorical throughput of the chip. Actually I wonder if they could focus on the "contrary" ie lowering power consumption. Down clocking the chip would de facto lower the ratio between the bandwidth needed and the bandwidth available. It would have a nice effect on power efficiency. Along with possible change to the L2 (my own speculation), CW nice power characteristic, the abandon of GDDR5, the jump 14nmk lithography those Xeon PHi could have pretty neat power characteristics.
If threorical throughput doesn't increase it is easier to rack together a bunch of cooler chip.
 
Last edited by a moderator:
Or maybe they'll try to solve it the same way they try to solve memory BW issues on haswell GPUs: stack some insane amount of EDRam.

I'm rather looking forward to what they do with the instruction set, next desktop AVX version will grow to 512bit SIMD size, hopefully they'll add scatter (beyond the current gather on haswell).
I also wonder if the iGPUs actually use some kind of AVX ISA, it would make sense in the long run to have all chip product lines having the same ISA, at some point it gets quite inefficient to have e.g. the GPU and CPU with 1TFlop each, but one part is idle, waiting for the other part. both consuming same food would be ++

regarding the caches, it probably depends what intel aims to achive, there are quite some use cases where they can beat the s**t out of kepler, but when it comes to streaming alike benchmarks (e.g. DGEM), as long as you hide the lantecy of the main memory, bigger caches don't really give you anything. So I expect they keep the same caches and integrate more math power.
Is the KNB actually capable to request data from other caches, bypassing dram transfers? (I think that was the point of the ring bus, but I can't recall any recent paper saying that)
 
Another thing is that I don't expect Intel to scale the number of core or actually the theorical throughput of the chip.

Recent server roadmap by Intel says 3+ TFlop for Knights Landing so they will definitely increase the amount of cores, Flops/core, or clock speed, or maybe little bit of all of them.
 
See here:


26d.png

http://www.icsr.agh.edu.pl/~kito/Arch/arch1-1-4B-x86.pdf
 
I saw that on RWT I did not notice the increase in throughput, it is massive.
The ISA should also converge toward some form of AVX, AVX3.1.

I wonder if that is simply a radical departure from the original Larrabee. The improvements they are claiming are quite gigantic.

Edit
Indeed gigantic, I wonder about the competition is thinking about that? They aim for Blue gene /q (or higher) type of power efficiency with way higher raw throughput :8O:
EDIT
Release is also pushed back to 2015 on the contrary to what some links stated earlier.
 
Last edited by a moderator:
I saw that on RWT I did not notice the increase in throughput, it is massive.
The ISA should also converge toward some form of AVX, AVX3.1.

I wonder if that is simply a radical departure from the original Larrabee. The improvements they are claiming are quite gigantic.
no, why should it? AVX seems to be designed to converge to lrb instructions.
we have:
-3 operators (AVX1)
-fma (AVX2)
-gather (AVX2)
what we miss right now is:
-512bit (AVX3.x for float and AVX4 for int?)
-scatter (AVX4?)
-lane masks

those 3.1 and 3.2 might not be an order, but indicate incompatibility (like calling it 3.1 and 3.a). But at some point, the desktop and lrb series will just support the same instruction set.
if AVX goes up to 1024bit, then there might be AVX6 in some back log.
 
I wonder if that is simply a radical departure from the original Larrabee. The improvements they are claiming are quite gigantic.

When he was in the Data Center Group, Kirk Skaugen has stated that future MIC will use Atom cores for better single thread performance. Now, whether we'll see it in same generation(as in Silvermont vs Silvermont) or some laggard(Silverthorne vs Silvermont) is to be seen.

Indeed gigantic, I wonder about the competition is thinking about that? They aim for Blue gene /q (or higher) type of power efficiency with way higher raw throughput :8O:

Nvidia is actually claiming similar Flops/watt figures for Maxwell.

Also Intel was stating the first MIC as an "Inflection Point", and the second one an another "Inflection Point", hinting massive increase.

That's how they plan on a 40MW, 1 Exaflop system by roughly 2018 timeframe.

I also wonder if Intel is planning on meeting the 6GFlops/watt figure for Knights Corner. Maybe a refresh next year with 225W, 1.3TFlop part?
 
Last edited by a moderator:
no, why should it? AVX seems to be designed to converge to lrb instructions.
we have:
-3 operators (AVX1)
-fma (AVX2)
-gather (AVX2)
what we miss right now is:
-512bit (AVX3.x for float and AVX4 for int?)
-scatter (AVX4?)
-lane masks

those 3.1 and 3.2 might not be an order, but indicate incompatibility (like calling it 3.1 and 3.a). But at some point, the desktop and lrb series will just support the same instruction set.
if AVX goes up to 1024bit, then there might be AVX6 in some back log.
I know about Intel plan with AVX as Nick speaks a lot about it (usually about what he calls an unified architecture).
Though I'm not sure about your answer the "no why should it" part?
You speak of the core, right? Because I was thinking of something in the line of DavidC post, like a significant departure from the low single thread performance 4way SMT cores at work since Larrabee.
They "claim" (not yet it is just a roadmap) almost of three time increase in performances per Watts, and 3 times the throughput.
I'm not sure they can get there without changing the core radically.
 
See here:
Thanks (the PDF doesn't seem to exist when I checked a few minutes ago but there's still a cache of it). 14-16 DP GFLOPS indeed.

3x the FP rate could be attained by doubling the per-core FP plus a modest increase in cores and/or clock speed (I don't know how that would work with power consumption). Perhaps 2x 512-bit FP units in a Knights Landing core?

I also wonder if Intel is planning on meeting the 6GFlops/watt figure for Knights Corner. Maybe a refresh next year with 225W, 1.3TFlop part?
I wouldn't be surprised to see something like that, but if the refresh consists of around the specs you mentioned, then I would expect higher-performing 300 W parts as well, since 1.3 TF is already above any existing Knights Corner part. I would also guess that the core counts go up by one or so. Maybe they plan on some update or new chip each year.

if AVX goes up to 1024bit, then there might be AVX6 in some back log.
AVX does support 1024-bit FP, although integer is limited to 512-bit (according to a 2008 paper, maybe things have changed since then). So there's still possible headroom.
 
I know about Intel plan with AVX as Nick speaks a lot about it (usually about what he calls an unified architecture).
Though I'm not sure about your answer the "no why should it" part?
You speak of the core, right?
*nods*

Because I was thinking of something in the line of DavidC post, like a significant departure from the low single thread performance 4way SMT cores at work since Larrabee.
They "claim" (not yet it is just a roadmap) almost of three time increase in performances per Watts, and 3 times the throughput.
I'm not sure they can get there without changing the core radically.
one or another way, they need to hide latency somehow (memory and alu wise).
The other HT implementations (I think the atom one also) are rather meant to hide cache misses, if you'd run something purely register based, there should be no benefit from HT.
LRB's in order cores with 4way SMT on the other side, are needed to hide alu latency (at least there was some time ago some intel guy saying you need to use at least 2 threads per core to get full math performance).
I agree with you, that would need a radical change to have the new OoO atom cores without SMT running. I'm not sure if it would make any sense to change it that much. At the moment GPUs and LRB are made for throughput, latency seems to be quite high, but SMT hides it. They'd either have to show you the real latency, but that would put a lot of pressure at those 32 vector register you have, or they'd have to make the ALUs lower latency, that would cost a lot of transistors.

Let's assume Intel continues the way they do on the desktop side, in that case we can also assume, they will try their best to speed up existing software, they don't want all software to be re-optimized for a totally different architecture. would make their argument of 'usual x86 cpu' completely nonsense. from that point of view, 4x SMT is likely to stay.

another ++ that the architecture will stay is, that LRB is based on a pentium, yet it got 64bit and 4xSMT, why shouldn't that be possible on an Atom core. Yes, I know, they argued they've not added HT to keep it more efficient with OoO, but they have HT on the haswell cores, so it must make some sense.

I would actually think, they need more than 4xSMT if they rise the ALU:memory performance ratio. But 8xSMT seems to be silly (Sparc also departed from it again).

BUT IF THEY go radically different...

How about every of the 4x 'threads' will be replaced by a atom OoO core (bulldozer alike)? Or maybe, as they said, 2x is needed to keep the ALU fully loaded, maybe 2 atom OoO cores. those might use some more space than the SMT, but they would not need 128 vector registers etc. but just 64. And if one atom OoO is more efficient than the old Atom 2xHT, it would end up with the same ratio on LRB side. (actually, they might still have 128 registers due to the renaming, but that detail would be hidden for code)
[edit] if 2core would share one vector unit, you'd need to double the unit count to still support 128threads, of course. That would 2x the math power.[/edit[

so many possibilities :D
 
Last edited by a moderator:
AVX does support 1024-bit FP, although integer is limited to 512-bit (according to a 2008 paper, maybe things have changed since then). So there's still possible headroom.
interesting, and sad at the same time :(
I guess the 1024/512 bit support is due to op-code encoding? in that case, those 1024bit for FP might be accidental (you'd still need 2bits to have 3 vex encoding sizes, 128,256,512,X).
 
At the moment GPUs and LRB are made for throughput, latency seems to be quite high, but SMT hides it.
SMT isn't about hiding latency, it's about getting a better resource utilization without going OoO. You can get vertical multithreading as a side effect of SMT, which Larrabee uses to hide ALU/L1 latency (for anything beyond that it needs to use software based vertical multithreading, aka fibers) but it's still a bit misleading to simply say SMT hides it.

As for GPUs, that's vertical multithreading all the way.
 
*nods*


one or another way, they need to hide latency somehow (memory and alu wise).
The other HT implementations (I think the atom one also) are rather meant to hide cache misses, if you'd run something purely register based, there should be no benefit from HT.
LRB's in order cores with 4way SMT on the other side, are needed to hide alu latency (at least there was some time ago some intel guy saying you need to use at least 2 threads per core to get full math performance).
I agree with you, that would need a radical change to have the new OoO atom cores without SMT running. I'm not sure if it would make any sense to change it that much. At the moment GPUs and LRB are made for throughput, latency seems to be quite high, but SMT hides it. They'd either have to show you the real latency, but that would put a lot of pressure at those 32 vector register you have, or they'd have to make the ALUs lower latency, that would cost a lot of transistors.

Let's assume Intel continues the way they do on the desktop side, in that case we can also assume, they will try their best to speed up existing software, they don't want all software to be re-optimized for a totally different architecture. would make their argument of 'usual x86 cpu' completely nonsense. from that point of view, 4x SMT is likely to stay.

another ++ that the architecture will stay is, that LRB is based on a pentium, yet it got 64bit and 4xSMT, why shouldn't that be possible on an Atom core. Yes, I know, they argued they've not added HT to keep it more efficient with OoO, but they have HT on the haswell cores, so it must make some sense.

I would actually think, they need more than 4xSMT if they rise the ALU:memory performance ratio. But 8xSMT seems to be silly (Sparc also departed from it again).

BUT IF THEY go radically different...

How about every of the 4x 'threads' will be replaced by a atom OoO core (bulldozer alike)? Or maybe, as they said, 2x is needed to keep the ALU fully loaded, maybe 2 atom OoO cores. those might use some more space than the SMT, but they would not need 128 vector registers etc. but just 64. And if one atom OoO is more efficient than the old Atom 2xHT, it would end up with the same ratio on LRB side. (actually, they might still have 128 registers due to the renaming, but that detail would be hidden for code)
[edit] if 2core would share one vector unit, you'd need to double the unit count to still support 128threads, of course. That would 2x the math power.[/edit[

so many possibilities :D
For the ref I'm just a forum warrior and iirc you're not so if I say nonsensical things don't get too shocked ;)

I think that actually the 4 way SMT in Xeon Phi is a "relic" of Larrabee aka a system designed to render graphics. If I understand realtime rendering well enough the issue is that data, foremost texture offer poor data locality locality for cache to be exploited, more often then not the "device" has to hide pretty the massive latencies associated with accessing the RAM/VRAM or an off-chip pool of memory. I hope I get that right. Actually Intel on top of 4 way SMT had to rely on a software scheme to hide more even more latencies.
That is what GPU does inn hardware, and I hope I got that right or close.

Now I think that the workloads Intel is aiming at may on average do better use of data locality which make the usefulness of cache greater. I don't know if it is related but Intel moved for example from 256KB of L2 core to 512KB (may be they could just afford it).

On top of it if I compare the old atom with 2 way SMT and the new one, it seems that OoOE is to have a greater impact on performances than 2 way SMT. With rendering outside of the picture my bet is that out of order execution as implemented in the new Atom is all they need now.
Intel claims that the overhead for OoO is no greater than the one for 2 way SMT so I would assume lesser than for four way SMT..

I think that those new Atom could be a good building block for Intel. I've just answered Nick in another thread, and I think that may be Intel could sort of blend Haswell double FMA unit those Atom.
They would widen the data path on the atom core to 256bit to match the SIMD width (8 wide), and have 2 FMA units. If I get it right that is 16 DP FLOPS per cycle. They would introduce the same support for the gather instruction as in LBNx ISA (so better than what Haswell does).

If they do some reuse of existing parts (FMA units, Atom core) as they did in Atom with I don't remember which ALU (D.Kanter goes into the detail on the ad-hoc article on RWT), which should have a part that clock well.
the SIMD units in Haswell obviously clock really well, up to 4GHz, and those new Atoms are rumored to reach up to 2.4GHz.

Now I expect Intel to introduce some turbo so they module the speed of the chip to fit a given TDP figure though when it comes to giving peak figures especially for a pretty far remote roadmap I think they could use the peak clock speed.
Say at 2.4GHz, 16 DP FLOPS per cycle, to reach ~3 DP TFLOPS I get around 80 cores. I won't make a bet on power but I bet that the figure the are presenting could be estimates based on the chip running real software with the turbo/energy svaing feature on. So PR for the win, you could the max trhoughput on one side disregarding TDP and perfs per watts that have nothing to do with the power cost of really reaching 3TFLOPS. Though not really a scheme as both are not lies without further details, and out of the 2 the more valuable data is perf per watts on real workload (even if you don't get close to the max throughput of the chip... if I speak clearly enough... ).
{edit} I think so because with their 14GFLOPS per watts and 3TFLOPS I get ~220Watts for the chips and that sounds unrealistic for 80 cores running on "all cylinders" @2.4GHz when +60 cores at 1GHz already burn 300 even taking in account the impact on GDDR5 if they look at the whole platform. {/edit}

I'm not sure about how those Atom cores+dual 8 wide FMA units would compare in die size with Xeon phi cores but taking in account the process, it doesn't sound to crazy to me.

I expect Intel to level out the playing field between its different product before possibly going wider the 256bit.

For the "layout" of the chip, the L2 fabric for those Atom cores seems to scale up to 8. I wonder if instead of linking all those "cluster/grappes" together if they could link a system agent as in Haswell, it would include the cache tag for one or more Crystal Well, and have the coherency and data exchange between those "cluster of 8", though the off chip L3 which could be inclusive. 40MB of data of of the 256MB offered by 2 CW doesn't sound to crazy.
It is may be an incorrect assumption but I keep thinking that linking those cluster to such a set-up should be easier than linking all those cluster together+the off chip L3, etc. It might be a wasteful approach but my gut feeling says that it should allow for higher bandwidth connection (between the clusters and system agent in turn connected to CW and main memory) and possibly saving in power.
 
Last edited by a moderator:
SMT isn't about hiding latency, it's about getting a better resource utilization without going OoO.
better resource utilization by hiding latency.

You can get vertical multithreading as a side effect of SMT, which Larrabee uses to hide ALU/L1 latency (for anything beyond that it needs to use software based vertical multithreading, aka fibers) but it's still a bit misleading to simply say SMT hides it
how do you know it's not by design? if you get 0%-30% like on i7, then it's exploiting existing resources, but in-order cores that can only reach 50% with one thread -> by design, IMO.
 
I think that actually the 4 way SMT in Xeon Phi is a "relic" of Larrabee aka a system designed to render graphics. If I understand realtime rendering well enough the issue is that data, foremost texture offer poor data locality locality for cache to be exploited, more often then not the "device" has to hide pretty the massive latencies associated with accessing the RAM/VRAM or an off-chip pool of memory.
afaik, for that purpose, TMUs were build in HW. latency of texture fetches does not come solely by the memory access, but also format conversion and filtering. not even intel dared to do that in software.
there is some latency if you use 'gather', but I think that wasn't hidden by SMT, but by:
.. software scheme to hide more even more latencies.

Now I think that the workloads Intel is aiming at may on average do better use of data locality which make the usefulness of cache greater. I don't know if it is related but Intel moved for example from 256KB of L2 core to 512KB (may be they could just afford it).
that's really a good point, to predict the next xeon phi more accurately, we'd need to know what workload intel is aiming at.
if they aim at maximum throughput, they'd probably need to even increase the alu latency.

On top of it if I compare the old atom with 2 way SMT and the new one, it seems that OoOE is to have a greater impact on performances than 2 way SMT. With rendering outside of the picture my bet is that out of order execution as implemented in the new Atom is all they need now.
Intel claims that the overhead for OoO is no greater than the one for 2 way SMT so I would assume lesser than for four way SMT..
that's true, that's why my last lines considered two OoO atoms to share one vector unit, AMD bulldozer alike. the problem is, that ALUs cost exponentially transistors to reduce latency linearly. (e.g. radix 16 instead of radix 4 divider), it's always throughput vs latency. I don't think intel will redo their long polished LRB vector units, but you also can't really stretch the pipeline of the cpus, or you'll end up with a P4 alike pipe.

I think that those new Atom could be a good building block for Intel. I've just answered Nick in another thread, and I think that may be Intel could sort of blend Haswell double FMA unit those Atom.
They would widen the data path on the atom core to 256bit to match the SIMD width (8 wide), and have 2 FMA units. If I get it right that is 16 DP FLOPS per cycle. They would introduce the same support for the gather instruction as in LBNx ISA (so better than what Haswell does).
hm, I wonder how hard it might be to mix a low power 2GHz core with haswell vector units. And it might sounds freaky, but wouldn't the iGPU vector units suite more the GPGPU area, if they want to mate the atom with some vector units.
 
if they aim at maximum throughput, they'd probably need to even increase the alu latency.
It is a complete bet but I thin they are not aiming at max throughput but way higher sustained performance and way better perfs per Watts (on real workload).
I think that actually the "theorical max throughput" might not even be reachable within a reasonable TDP, a bit like some Radeons can't achieve their theoretical max texel rate throughput because power management feature prevent them to do so. Really a complete bet.
that's true, that's why my last lines considered two OoO atoms to share one vector unit, AMD bulldozer alike. the problem is, that ALUs cost exponentially transistors to reduce latency linearly. (e.g. radix 16 instead of radix 4 divider), it's always throughput vs latency. I don't think intel will redo their long polished LRB vector units, but you also can't really stretch the pipeline of the cpus, or you'll end up with a P4 alike pipe.
It is possible but looking at AMD results I think Intel could wary about pursuing that road. I think it could make the design more complex than it needs.
WRT to their LRB vector units, well I think Intel can afford it especially if they can reuse existing work done for other architectures. I don't code but as they aim at DP and grossly 8 wide SIMD translate into 4 wide DP SIMd, do you really need all the extra from LRB when you deal only on four elements (vs a design that dealt with up to 16 SP and SP was relevant to the design)?
hm, I wonder how hard it might be to mix a low power 2GHz core with haswell vector units. And it might sounds freaky, but wouldn't the iGPU vector units suite more the GPGPU area, if they want to mate the atom with some vector units.
I don't know if it is doable when I look here and here ... well it is confusing :LOL:
Though I would think that Intel guys would find a way and a cleverer one than I can imagine to fit those 2 units in their 4 execution ports.
The great thing about Intel is that it has the man power to build new FMA units (still have to fit it in the the Atom four exection ports).

their iGPU are great, for now they doesn't support DP calculation, Intel seems to keep DP as a reserved playing field for its expensive CPU but they may change their policies.
It seems to me that with its iGPU, its big cores backed with always larger SIMD and those Xeon Phi, Intel is hitting all cylinders with regard to "how to exploit parallelism", they get pretty much everything cover, they have lot of option at this point.
 
Last edited by a moderator:
Back
Top