That is why I wonder if the statement Intel made about "removing bandwidth constrains" only have to do with CW.This description has me thinking about the Cell. That's a gross simplification or comparison probably, Cell took data locality to the extreme by having the SPEs only be able to access the local storage (I'm imagining it as 7 or 8 computers with 256K memory with the ring bus as the "lan"). Xeon Phi would be less braindead than that.
My feeling, just a feeling is that data locality and data flow are very important, does that thing fit in a fast and close enough cache or are you going to wait 1000 cycles to get it.. Vectors are almost irrelevant if the data isn't there.
If I get what LiXiangyang the each core only has access to its local share of L2.
So I think that means that if some data is in another slice of the L2, the data have to be move to the L2 before the data is accessible?
I guess there is a penalty in latency and may in bandwidth if a core has to access data that are in the L2 but not in its local share of the L2.
Iirc the old Larrabee talks, there are bus that connect 8 cores and then are connected together.
To me now that we speak of ~64 cores it seems that data and coherancy traffic may have to jump through many hoops before it reaches it target, I wonder on the impact on latency and achieved bandwdith.
That why I think an organization more akin to AMD Jaguar "compute cluster" could allow for a "leaner" organization of the data traffic on the chip.
I was wondering between 4 or 8 cores, but 4 seems better to me as it should allow for a more fined grain scaling of the number of cores.
Now say you have clusters of 4 Xeon phi cores connected to an L2 interface akin to the one in AMD jaguar based compute cluster, you have 4 cores that have access to 2 MB of shared L2.
Say there is a L1 a cache miss, case 1 is what I think Xeon Phi does case 2 is what I think could be better (quite as statement for a forum warrior).
Case 1 : the core checks its local L2 and then the other slice of L2 to check if the data is available, then if the data is there it is move to its L1/L2 (inclusive caches?). I think doing require jumping through many oops (the data/ signal may have to go through multiple sections of the bus).
Case 2 : The core check the 2MB of o the shared local L2. If data if not there, they go to check the L3 tags (on chips). every thing contained in the L2 is duplicated in the L3 (Crystal Web).
Such a compute cluster would have no mean to access the other compute cluster L2, everything would be checked against the L3.
Whereas accessing the off chip L3 could take sometime I wonder what is cost of going through a lot of oops like in nowaday Xeon phi. I also wonder what is the power cost of that traffic.
Now say you have 64 cores that is 16 computes, you "only" need to connect 16 L2 interfaces to a system agent (including the L3/Crystal Web) tags. To me it seems that the data traffic on chip would be leaner and that would be easier to provide high bandwidth data path between the system agent and the compute clusters. I also wonder about such an organization of the data traffic would compare to the actual set up where data seems to travel through lots of hoops.
Intel could also copy AMD wrt to the L2 clock speed. Like in Jaguar the L2 should run at half the core speed, I think 4 thread should be enough to hide the extra latency. Xeon Phi burns a lot of power it could be welcome. May be Intel could do further trade off (or simply better) and increase the bandwidth between the cores and their local share of the L2.
About Crystal Web, I read some people on RWT wondering about a hypothetical 2 CW set-up, for something like CW is could be great, especially if the L3 include all the data in the L2, for 64 cores that is 32MB already.
I could see Intel trade its 512B bit bus and GDDR5 for two of those fast link to CW and 256 bit bus to fast DDR3/4. Another way to save power as it seems that CW +its interface power consumption is way lower than GDDR5 and its memory controller.
Another thing is that I don't expect Intel to scale the number of core or actually the theorical throughput of the chip. Actually I wonder if they could focus on the "contrary" ie lowering power consumption. Down clocking the chip would de facto lower the ratio between the bandwidth needed and the bandwidth available. It would have a nice effect on power efficiency. Along with possible change to the L2 (my own speculation), CW nice power characteristic, the abandon of GDDR5, the jump 14nmk lithography those Xeon PHi could have pretty neat power characteristics.
If threorical throughput doesn't increase it is easier to rack together a bunch of cooler chip.
Last edited by a moderator: