22 nm Larrabee

Discussion in 'Architecture and Products' started by Nick, May 6, 2011.

Tags:
  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
  2. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Maybe it's caused by a strange mapping of OpenCL workitems to the vALU. CLInfo returns a preferred vector width of 1 float but a native vector width of 16 floats for instance (which is contradicting in my opinion, the preferred width should always be equal or larger than the native one, not smaller). Maybe each work item has the full vALU for its own and the code would have to use large vector data types (float16) to put the vALU to good use (then the returned preferred widths are simply wrong)? They should do it like all GPUs and run one work item per slot of the vALU (wasn't that a "strand" in intels original Larrabee terminology?). Or are there restrictions in Larrabee/KC I'm not aware of so this wouldn't work for OpenCL?

    I was also surprised at first that the number of hardware threads (236 equalling 59 active cores) are returned as number of CUs, not the number of cores. But I think the reason is simply that the data sharing between hardware threads on a core may be not as straightforward as within the same thread (as there are no local memory structures directly shared by all threads on a CU and where accesses to local memory are bound to the area allocated to its own workgroup in hardware). I think there was a discussion with Andy Lauritzen about if that design choice in GPUs make sense or not. Here we may see a side affect of not having such a mechanism in Larrabee. It means one has to run at least 236 work groups (4 per core) to load all threads of that chip. Apparently it can't distribute a large workgroup over several hardware threads (as GPUs can).
     
  3. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    I believe it can. I'm generally skeptical of OpenCL benchmarks, since OpenCL is not performance portable without careful effort, and these benchmarks tend to be fairly naively written. OpenCL exposes the architecture directly, which allows users to get good performance, but does tend to cause the code to be very specialized to the particular platform the developer targeted when writing the application. It is possible to write more flexible OpenCL code, but it is much more work, and most developers don't do so. In this case, I expect they developed the code for the benchmark on a GPU (probably an AMD GPU), and so I wouldn't expect the results to be very good on a machine like Xeon Phi.

    This situation is pretty bad for OpenCL benchmarks, since it undermines my trust in their results. I don't believe you can make sweeping statements from these benchmark suites like "Xeon Phi is bad at compute". The best you can say is "This benchmark suite ran slowly on Xeon Phi". I hope over time the OpenCL benchmark world gets more sophisticated, but at the moment it seems fairly naive, modulo a few exceptions.
     
    #1123 RecessionCone, Jun 5, 2013
    Last edited by a moderator: Jun 5, 2013
  4. entity279

    Veteran Regular Subscriber

    Joined:
    May 12, 2008
    Messages:
    1,229
    Likes Received:
    422
    Location:
    Romania
    All compute benchmarks share these flaws actually.
     
  5. keldor314

    Newcomer

    Joined:
    Feb 23, 2010
    Messages:
    132
    Likes Received:
    13
    Just to illustrate how attrocious these results are:

    http://clbenchmark.com/compare.jsp?config_0=15887974&config_1=11905561

    A couple of those tests have HD 7970 close to 20x faster than Xeon Phi.

    It also loses badly to Nvidia, which has lately been the poster child for bad OpenCL support.

    http://clbenchmark.com/compare.jsp?config_0=15887974&config_1=14470292

    But yeah, OpenCL is rather backward in terms of both features and vendor support, not to mention that different devices need much different optimization stratgies. Still, though....

    The difference between Phi and HD 7970 are on the same order of magnitude as the difference between Python (pure Python, no numpy or such) and C. So unless they're doing something completely rediculous, like using a Python style interpreter to run their OpenCL code, there's a really big problem here.
     
    #1125 keldor314, Jun 6, 2013
    Last edited by a moderator: Jun 6, 2013
  6. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    As I said above, it looks a bit like each workitem has the full SIMD unit at its disposal, it doesn't run 16 workitems at once, each in one lane of the vALU. To put the resources to good use, one should use the float16/int16/double8 data types. Or intel needs to change the mapping. In the moment, it looks like they are running only a slightly changed CPU driver on it.
     
  7. keldor314

    Newcomer

    Joined:
    Feb 23, 2010
    Messages:
    132
    Likes Received:
    13
    Yeah, it looks like they're running on the x86 portion of the processors. They really need to change it to a proper SIMT model, since using float16 explicitely is just ugly (and 99% of the time, you'd just end up using SIMT, but with intrinsics at just above assembly level. Ugh.).
     
  8. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    That is so stupid that I have a hard time believing they would do that.

    K20 and KC are about equal in flops. KC has 8 wide DP vALU. If that was true, we would see >8x slowdown vs K20. I don't think that is quite the case here.
     
  9. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    If you look at the (very low) OpenCL benchmark numbers for KC, it appears quite possible. If they would run one workitem in each lane of the vector unit (basically the equivalent of 16 item wavefront/warps), you get a hard time explaining the extremely low OpenCL performance.
    [strike]Someone[/strike]Keldor linked a comparison to a HD7970 before. Only the tree search is quite fast on KC (when the work items diverge in that search, the mapping isn't an issue anymore), but otherwise a HD7970 is on average a factor of 15 or so faster with some subtest approaching a factor 50. Even against the arguably not so good OpenCL implementation of nV a KC still looses badly against a GK110, it is about one order of magnitude on average (worst case for KC is the bitonic merge sort where a Titan beats KC by a factor 82 in performance).
     
    #1129 Gipsel, Jun 7, 2013
    Last edited by a moderator: Jun 7, 2013
  10. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
    Is ISPC available on Xeon Phi? I would think it would do better than Open Cl as it is designed to run o. CPU only.
     
  11. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,743
    Likes Received:
    106
    Location:
    Taiwan
    I guess Xeon Phi is not intended to be used with OpenCL like a GPU. It's much easier to simply treat it as a multi-node computer, rather than a vector computer.

    There's some data from a new Chinese supercomputer using Xeon Phi, achieving 30 PFLOPS in Linpack. That could be the fastest supercomputer in the next Top 500 list.
     
  12. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    I am not sure about the reason behind those opencl numbers but I wouldn't read too much about them. I can hardly believe that's the best a Phi part can do, I guess whatever the problem is it will be taken care of.
     
  13. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    Maybe this says something about OpenCL instead. Stuff is compatible, but you still have to rewrite it else the performance is useless.. Except on DX11 GPUs where it plays nice even on different hardware. See, Intel made a GPU to run pixel shaders and DirectX compute shaders, and it is quite fast in OpenCL.

    OpenCL does not fare too bad on GPUs, you get situations where one GPU is 2x or 3x slower than the competing GPU from the other vendor. I used to think it would be much worse. Now we have a real pathological result.

    pcchen's suggestion is interesting, would you run Open MPI code and similar?
    And is there a killer app for this architecture?
     
  14. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    KC ISA is supposed to be vector-complete.

    OCL is implicitly vectorized.

    This thing should be drop dead simple in the grand scheme of things.

    Intel is supposed to have the best vectorizers in business out there, althought they are hardly needed here.
     
  15. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Sure, but if you put out crap tools....

    Looks like Intel really wants people to port their C/C++ codes instead of anything that might be more portable. :evil:
     
  16. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,743
    Likes Received:
    106
    Location:
    Taiwan
    The latest Top500 list is out now. As expected, the new Chinese supercomputer "Tianhe-2" is No. 1. It has 16,000 node, each with 2 Ivy Bridge Xeon and 3 Xeon Phi processors, for a combined total of 3,120,000 cores. Its peak performance is 54 PFLOPS, and LINPACK performance 33.86 PFLOPS. It's also one of the most power hungry supercomputer, requiring a total of 17.8 MW, with similar power efficiency as No. 2 (previous No. 1) 'Titan', which is a GPGPU based supercomputer. BlueGene/Q based No. 3 'Sequoia' is still more power efficient.

    With ~60% max-to-peak ratio, its computation efficiency is not very good, a little worse than No.2's GPGPU based supercomputer.

    Another Xeon Phi based supercomputer, 'Stempade', is at No. 6, with similar computation efficiency (but much worse power efficiency). In total, there are 11 systems using Xeon Phi.
     
  17. moozoo

    Newcomer

    Joined:
    Jul 23, 2010
    Messages:
    109
    Likes Received:
    1
    I got a reply in the Intel support forum
    http://software.intel.com/en-us/forums/topic/393480

    What I plan to do is have a separate .cl file optimised for CPU's , I probably should do this anyway, and use this with workgroup size tweaks when Xeon phi is detected.

    .
     
  18. traptd

    Newcomer

    Joined:
    Jun 6, 2013
    Messages:
    5
    Likes Received:
    0
  19. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
    Nobody is willing to dare a try about how Intel could evolve this architecture?

    I'm willing to speculate but among the least qualified to do so :lol: beware that is an ultimatum if there no reaction to that announcement I will do it.

    damned me I can't hold it... :lol: clearly I'm saturated with the renovation work at my new place :lol:

    Ok reading between the line of that PR announcement I wonder if Intel could something like:
    They won't increase the core, may tone it down slighty
    Completely rework the memory subsystem and cache hierarchy. More in details (which I know I should avoid... god forbid me :) ):
    It seem that there are a lot of buses in Xeon Phi running across all the cores.
    The L2 set-up doesn't look that flexible. The core have 4 thread to hide latency and that should be enough to hide L2 latency. The Chip is quite power hungry so...
    I could see Intel moving toward "cluster" of Xeon Phi cores, ala Jaguar. So 4-8 cores would share a shared L2 (sorry for the lame ass wording).
    The L2 interface would run at the CPU speed but that the L2 would run at half that clock speed to save power ala jaguar. That L2 would sacrifice latency for bandwidth, wich should be fine with the 4 hardware threads per core.
    Then Intel would save on the core to core or more precisely cluster to cluster communication.
    every cluster would be tied to a sort of system agent connected to the memory and 2 Crystal web.
    The coherency would be handled though checking against the L3 /Crytal web.
    So a lot less wiring, 8-16 L2 interface linked to that agent vs the ring bus(es) running from core to core (or group of 8 cores). /one or another (most likely the later) creating a high bandwidth interface.

    The chip would include 4MB of cache holding the 4MB of tags for the 2 Crystal Web (tied to the system agent).
    The chip will revert to a quad channel set-up using DDR4.
    I was thinking of may be integrating QPI link for multi processor set-up but I wonder if the chip could get too tiny to fit all the IO. Overall IBM passed with the their late Blue gene /Q.
    If intel aimed a super computing and try to compute with IBM may be they could simply put such a chip on a board with 16 GB of RAM and their Fabric interconnect in a neat form factor easy integrated in racks (as blue gene /Q).

    Overall the chip would be tinier and significantly cooler. So I think the statement about removing bandwidth limitation could not only be linked to the inclusion of CW to the design (I would think 2) but also a significant rework of the cache hierarchy /memory subsystem.
     
    #1139 liolio, Jun 20, 2013
    Last edited by a moderator: Jun 20, 2013
  20. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    This description has me thinking about the Cell. That's a gross simplification or comparison probably, Cell took data locality to the extreme by having the SPEs only be able to access the local storage (I'm imagining it as 7 or 8 computers with 256K memory with the ring bus as the "lan"). Xeon Phi would be less braindead than that.
    My feeling, just a feeling is that data locality and data flow are very important, does that thing fit in a fast and close enough cache or are you going to wait 1000 cycles to get it.. Vectors are almost irrelevant if the data isn't there.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...