22 nm Larrabee

rpg.314 · Jun 5, 2013

moozoo said:
http://clbenchmark.com/device-info.jsp?config=15887974

lol, I hope it can do a lot better than this.

I actually thought this card might be worth looking into.

I am having second and third thoughts.

Gipsel · Jun 5, 2013

moozoo said:
http://clbenchmark.com/device-info.jsp?config=15887974

lol, I hope it can do a lot better than this.

Maybe it's caused by a strange mapping of OpenCL workitems to the vALU. CLInfo returns a preferred vector width of 1 float but a native vector width of 16 floats for instance (which is contradicting in my opinion, the preferred width should always be equal or larger than the native one, not smaller). Maybe each work item has the full vALU for its own and the code would have to use large vector data types (float16) to put the vALU to good use (then the returned preferred widths are simply wrong)? They should do it like all GPUs and run one work item per slot of the vALU (wasn't that a "strand" in intels original Larrabee terminology?). Or are there restrictions in Larrabee/KC I'm not aware of so this wouldn't work for OpenCL?

I was also surprised at first that the number of hardware threads (236 equalling 59 active cores) are returned as number of CUs, not the number of cores. But I think the reason is simply that the data sharing between hardware threads on a core may be not as straightforward as within the same thread (as there are no local memory structures directly shared by all threads on a CU and where accesses to local memory are bound to the area allocated to its own workgroup in hardware). I think there was a discussion with Andy Lauritzen about if that design choice in GPUs make sense or not. Here we may see a side affect of not having such a mechanism in Larrabee. It means one has to run at least 236 work groups (4 per core) to load all threads of that chip. Apparently it can't distribute a large workgroup over several hardware threads (as GPUs can).

RecessionCone · Jun 5, 2013

moozoo said:
http://clbenchmark.com/device-info.jsp?config=15887974

lol, I hope it can do a lot better than this.

I believe it can. I'm generally skeptical of OpenCL benchmarks, since OpenCL is not performance portable without careful effort, and these benchmarks tend to be fairly naively written. OpenCL exposes the architecture directly, which allows users to get good performance, but does tend to cause the code to be very specialized to the particular platform the developer targeted when writing the application. It is possible to write more flexible OpenCL code, but it is much more work, and most developers don't do so. In this case, I expect they developed the code for the benchmark on a GPU (probably an AMD GPU), and so I wouldn't expect the results to be very good on a machine like Xeon Phi.

This situation is pretty bad for OpenCL benchmarks, since it undermines my trust in their results. I don't believe you can make sweeping statements from these benchmark suites like "Xeon Phi is bad at compute". The best you can say is "This benchmark suite ran slowly on Xeon Phi". I hope over time the OpenCL benchmark world gets more sophisticated, but at the moment it seems fairly naive, modulo a few exceptions.

entity279 · Jun 5, 2013

All compute benchmarks share these flaws actually.

keldor314 · Jun 6, 2013

Just to illustrate how attrocious these results are:

http://clbenchmark.com/compare.jsp?config_0=15887974&config_1=11905561

A couple of those tests have HD 7970 close to 20x faster than Xeon Phi.

It also loses badly to Nvidia, which has lately been the poster child for bad OpenCL support.

http://clbenchmark.com/compare.jsp?config_0=15887974&config_1=14470292

But yeah, OpenCL is rather backward in terms of both features and vendor support, not to mention that different devices need much different optimization stratgies. Still, though....

The difference between Phi and HD 7970 are on the same order of magnitude as the difference between Python (pure Python, no numpy or such) and C. So unless they're doing something completely rediculous, like using a Python style interpreter to run their OpenCL code, there's a really big problem here.

Gipsel · Jun 6, 2013

keldor314 said:
The difference between Phi and HD 7970 are on the same order of magnitude as the difference between Python (pure Python, no numpy or such) and C. So unless they're doing something completely rediculous, like using a Python style interpreter to run their OpenCL code, there's a really big problem here.

As I said above, it looks a bit like each workitem has the full SIMD unit at its disposal, it doesn't run 16 workitems at once, each in one lane of the vALU. To put the resources to good use, one should use the float16/int16/double8 data types. Or intel needs to change the mapping. In the moment, it looks like they are running only a slightly changed CPU driver on it.

keldor314 · Jun 6, 2013

Yeah, it looks like they're running on the x86 portion of the processors. They really need to change it to a proper SIMT model, since using float16 explicitely is just ugly (and 99% of the time, you'd just end up using SIMT, but with intrinsics at just above assembly level. Ugh.).

rpg.314 · Jun 7, 2013

Gipsel said:
As I said above, it looks a bit like each workitem has the full SIMD unit at its disposal, it doesn't run 16 workitems at once, each in one lane of the vALU. To put the resources to good use, one should use the float16/int16/double8 data types. Or intel needs to change the mapping. In the moment, it looks like they are running only a slightly changed CPU driver on it.

That is so stupid that I have a hard time believing they would do that.

K20 and KC are about equal in flops. KC has 8 wide DP vALU. If that was true, we would see >8x slowdown vs K20. I don't think that is quite the case here.

Gipsel · Jun 7, 2013

rpg.314 said:
That is so stupid that I have a hard time believing they would do that.

K20 and KC are about equal in flops. KC has 8 wide DP vALU. If that was true, we would see >8x slowdown vs K20. I don't think that is quite the case here.

If you look at the (very low) OpenCL benchmark numbers for KC, it appears quite possible. If they would run one workitem in each lane of the vector unit (basically the equivalent of 16 item wavefront/warps), you get a hard time explaining the extremely low OpenCL performance.
[strike]Someone[/strike]Keldor linked a comparison to a HD7970 before. Only the tree search is quite fast on KC (when the work items diverge in that search, the mapping isn't an issue anymore), but otherwise a HD7970 is on average a factor of 15 or so faster with some subtest approaching a factor 50. Even against the arguably not so good OpenCL implementation of nV a KC still looses badly against a GK110, it is about one order of magnitude on average (worst case for KC is the bitonic merge sort where a Titan beats KC by a factor 82 in performance).

liolio · Jun 7, 2013

Is ISPC available on Xeon Phi? I would think it would do better than Open Cl as it is designed to run o. CPU only.

pcchen · Jun 7, 2013

I guess Xeon Phi is not intended to be used with OpenCL like a GPU. It's much easier to simply treat it as a multi-node computer, rather than a vector computer.

There's some data from a new Chinese supercomputer using Xeon Phi, achieving 30 PFLOPS in Linpack. That could be the fastest supercomputer in the next Top 500 list.

nAo · Jun 7, 2013

I am not sure about the reason behind those opencl numbers but I wouldn't read too much about them. I can hardly believe that's the best a Phi part can do, I guess whatever the problem is it will be taken care of.

Blazkowicz · Jun 7, 2013

Maybe this says something about OpenCL instead. Stuff is compatible, but you still have to rewrite it else the performance is useless.. Except on DX11 GPUs where it plays nice even on different hardware. See, Intel made a GPU to run pixel shaders and DirectX compute shaders, and it is quite fast in OpenCL.

OpenCL does not fare too bad on GPUs, you get situations where one GPU is 2x or 3x slower than the competing GPU from the other vendor. I used to think it would be much worse. Now we have a real pathological result.

pcchen's suggestion is interesting, would you run Open MPI code and similar?
And is there a killer app for this architecture?

rpg.314 · Jun 8, 2013

Blazkowicz said:
Maybe this says something about OpenCL instead. Stuff is compatible, but you still have to rewrite it else the performance is useless.. Except on DX11 GPUs where it plays nice even on different hardware. See, Intel made a GPU to run pixel shaders and DirectX compute shaders, and it is quite fast in OpenCL.

OpenCL does not fare too bad on GPUs, you get situations where one GPU is 2x or 3x slower than the competing GPU from the other vendor. I used to think it would be much worse. Now we have a real pathological result.

pcchen's suggestion is interesting, would you run Open MPI code and similar?
And is there a killer app for this architecture?

KC ISA is supposed to be vector-complete.

OCL is implicitly vectorized.

This thing should be drop dead simple in the grand scheme of things.

Intel is supposed to have the best vectorizers in business out there, althought they are hardly needed here.

rpg.314 · Jun 8, 2013

nAo said:
I am not sure about the reason behind those opencl numbers but I wouldn't read too much about them. I can hardly believe that's the best a Phi part can do, I guess whatever the problem is it will be taken care of.

Sure, but if you put out crap tools....

Looks like Intel really wants people to port their C/C++ codes instead of anything that might be more portable.

pcchen · Jun 17, 2013

The latest Top500 list is out now. As expected, the new Chinese supercomputer "Tianhe-2" is No. 1. It has 16,000 node, each with 2 Ivy Bridge Xeon and 3 Xeon Phi processors, for a combined total of 3,120,000 cores. Its peak performance is 54 PFLOPS, and LINPACK performance 33.86 PFLOPS. It's also one of the most power hungry supercomputer, requiring a total of 17.8 MW, with similar power efficiency as No. 2 (previous No. 1) 'Titan', which is a GPGPU based supercomputer. BlueGene/Q based No. 3 'Sequoia' is still more power efficient.

With ~60% max-to-peak ratio, its computation efficiency is not very good, a little worse than No.2's GPGPU based supercomputer.

Another Xeon Phi based supercomputer, 'Stempade', is at No. 6, with similar computation efficiency (but much worse power efficiency). In total, there are 11 systems using Xeon Phi.

moozoo · Jun 17, 2013

I got a reply in the Intel support forum
http://software.intel.com/en-us/forums/topic/393480

What I plan to do is have a separate .cl file optimised for CPU's , I probably should do this anyway, and use this with workgroup size tweaks when Xeon phi is detected.

.

traptd · Jun 18, 2013

Intel announces next generation 14nm Xeon Phi:

http://www.xbitlabs.com/news/cpu/di...on_Xeon_Phi_Knights_Landing_Co_Processor.html

liolio · Jun 20, 2013

Nobody is willing to dare a try about how Intel could evolve this architecture?

I'm willing to speculate but among the least qualified to do so

beware that is an ultimatum if there no reaction to that announcement I will do it.

damned me I can't hold it...

clearly I'm saturated with the renovation work at my new place

Ok reading between the line of that PR announcement I wonder if Intel could something like:
They won't increase the core, may tone it down slighty
Completely rework the memory subsystem and cache hierarchy. More in details (which I know I should avoid... god forbid me

):
It seem that there are a lot of buses in Xeon Phi running across all the cores.
The L2 set-up doesn't look that flexible. The core have 4 thread to hide latency and that should be enough to hide L2 latency. The Chip is quite power hungry so...
I could see Intel moving toward "cluster" of Xeon Phi cores, ala Jaguar. So 4-8 cores would share a shared L2 (sorry for the lame ass wording).
The L2 interface would run at the CPU speed but that the L2 would run at half that clock speed to save power ala jaguar. That L2 would sacrifice latency for bandwidth, wich should be fine with the 4 hardware threads per core.
Then Intel would save on the core to core or more precisely cluster to cluster communication.
every cluster would be tied to a sort of system agent connected to the memory and 2 Crystal web.
The coherency would be handled though checking against the L3 /Crytal web.
So a lot less wiring, 8-16 L2 interface linked to that agent vs the ring bus(es) running from core to core (or group of 8 cores). /one or another (most likely the later) creating a high bandwidth interface.

The chip would include 4MB of cache holding the 4MB of tags for the 2 Crystal Web (tied to the system agent).
The chip will revert to a quad channel set-up using DDR4.
I was thinking of may be integrating QPI link for multi processor set-up but I wonder if the chip could get too tiny to fit all the IO. Overall IBM passed with the their late Blue gene /Q.
If intel aimed a super computing and try to compute with IBM may be they could simply put such a chip on a board with 16 GB of RAM and their Fabric interconnect in a neat form factor easy integrated in racks (as blue gene /Q).

Overall the chip would be tinier and significantly cooler. So I think the statement about removing bandwidth limitation could not only be linked to the inclusion of CW to the design (I would think 2) but also a significant rework of the cache hierarchy /memory subsystem.

Blazkowicz · Jun 20, 2013

LiXiangyang said:
Just returned from the IDC, according to the intel guys:

1)The LLC cache arrangement of Phi is not like these found in intel CPU, LLC(which is L2 for phi) of Xeon Phi is local to each core, so for each core there is only 512kB L2 cache, instead of the 31MB number Intel promoted, any data cached that need to be accessed, that not avilable at the local L2 cache, will need to be transfered to the local L2 before accessing.

For comparison, GK110 has 1.5MB of L2 cache, but it is global cache like Intel's LLC on ivy bridge/sandy bridge CPUs, so its data is accessable to all gpu cores.
...

This description has me thinking about the Cell. That's a gross simplification or comparison probably, Cell took data locality to the extreme by having the SPEs only be able to access the local storage (I'm imagining it as 7 or 8 computers with 256K memory with the ring bus as the "lan"). Xeon Phi would be less braindead than that.
My feeling, just a feeling is that data locality and data flow are very important, does that thing fit in a fast and close enough cache or are you going to wait 1000 cycles to get it.. Vectors are almost irrelevant if the data isn't there.

22 nm Larrabee

rpg.314

Gipsel

RecessionCone

entity279

keldor314

Gipsel

keldor314

rpg.314

Gipsel

liolio

Aquoiboniste

pcchen

Moderator

nAo

Nutella Nutellae

Blazkowicz

rpg.314

rpg.314

pcchen

Moderator

moozoo

traptd

liolio

Aquoiboniste

Blazkowicz

Similar threads