http://clbenchmark.com/device-info.jsp?config=15887974
lol, I hope it can do a lot better than this.
I actually thought this card might be worth looking into.
I am having second and third thoughts.
http://clbenchmark.com/device-info.jsp?config=15887974
lol, I hope it can do a lot better than this.
Maybe it's caused by a strange mapping of OpenCL workitems to the vALU. CLInfo returns a preferred vector width of 1 float but a native vector width of 16 floats for instance (which is contradicting in my opinion, the preferred width should always be equal or larger than the native one, not smaller). Maybe each work item has the full vALU for its own and the code would have to use large vector data types (float16) to put the vALU to good use (then the returned preferred widths are simply wrong)? They should do it like all GPUs and run one work item per slot of the vALU (wasn't that a "strand" in intels original Larrabee terminology?). Or are there restrictions in Larrabee/KC I'm not aware of so this wouldn't work for OpenCL?http://clbenchmark.com/device-info.jsp?config=15887974
lol, I hope it can do a lot better than this.
http://clbenchmark.com/device-info.jsp?config=15887974
lol, I hope it can do a lot better than this.
As I said above, it looks a bit like each workitem has the full SIMD unit at its disposal, it doesn't run 16 workitems at once, each in one lane of the vALU. To put the resources to good use, one should use the float16/int16/double8 data types. Or intel needs to change the mapping. In the moment, it looks like they are running only a slightly changed CPU driver on it.The difference between Phi and HD 7970 are on the same order of magnitude as the difference between Python (pure Python, no numpy or such) and C. So unless they're doing something completely rediculous, like using a Python style interpreter to run their OpenCL code, there's a really big problem here.
As I said above, it looks a bit like each workitem has the full SIMD unit at its disposal, it doesn't run 16 workitems at once, each in one lane of the vALU. To put the resources to good use, one should use the float16/int16/double8 data types. Or intel needs to change the mapping. In the moment, it looks like they are running only a slightly changed CPU driver on it.
If you look at the (very low) OpenCL benchmark numbers for KC, it appears quite possible. If they would run one workitem in each lane of the vector unit (basically the equivalent of 16 item wavefront/warps), you get a hard time explaining the extremely low OpenCL performance.That is so stupid that I have a hard time believing they would do that.
K20 and KC are about equal in flops. KC has 8 wide DP vALU. If that was true, we would see >8x slowdown vs K20. I don't think that is quite the case here.
Maybe this says something about OpenCL instead. Stuff is compatible, but you still have to rewrite it else the performance is useless.. Except on DX11 GPUs where it plays nice even on different hardware. See, Intel made a GPU to run pixel shaders and DirectX compute shaders, and it is quite fast in OpenCL.
OpenCL does not fare too bad on GPUs, you get situations where one GPU is 2x or 3x slower than the competing GPU from the other vendor. I used to think it would be much worse. Now we have a real pathological result.
pcchen's suggestion is interesting, would you run Open MPI code and similar?
And is there a killer app for this architecture?
I am not sure about the reason behind those opencl numbers but I wouldn't read too much about them. I can hardly believe that's the best a Phi part can do, I guess whatever the problem is it will be taken care of.
Just returned from the IDC, according to the intel guys:
1)The LLC cache arrangement of Phi is not like these found in intel CPU, LLC(which is L2 for phi) of Xeon Phi is local to each core, so for each core there is only 512kB L2 cache, instead of the 31MB number Intel promoted, any data cached that need to be accessed, that not avilable at the local L2 cache, will need to be transfered to the local L2 before accessing.
For comparison, GK110 has 1.5MB of L2 cache, but it is global cache like Intel's LLC on ivy bridge/sandy bridge CPUs, so its data is accessable to all gpu cores.
...