Knights Landing details at RWT

Discussion in 'Architecture and Products' started by dkanter, Jan 13, 2014.

Tags:
  1. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
  2. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    988
    Likes Received:
    280
    A little off topic but David I have a question on your article coverages.

    In the past your articles covered more vendors (Apple, ARM, AMD, Intel, Nvidia) but now over the last 18 months you have 7 Intel articles, 2 ARM articles and one Apple article.

    Is there a reason for the very heavy coverage on Intel and the now non-existent coverage of those other vendors?
     
  3. BRiT

    BRiT (╯°□°)╯
    Moderator Legend Alpha Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    12,514
    Likes Received:
    8,723
    Location:
    Cleveland
    Have the other vendors done anything worthy of being covered from an architectural level? Have other vendors provided any sort of base material?
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Being open about architectural detail isn't particularly common.
    A number of those other vendors are known to be tight-lipped.

    We've got Anandtech trying to wrangle something as basic as the issue width of Apple's Cyclone processor with their own code loops, as one example.
    I've seen complaints about the lack of feedback from ARM as well.

    x86, and for time POWER, had a much broader level of exposure.
    With paywalling or an increasing level of custom platform designs, we see that IBM and AMD have backtracked significantly.
    The other question mark is that, with generations of rebranding and Orwellian PR speak under their belt, some vendors might consider disclosing much at this point might just show how much of the external activity is decoupled from the physical reality.

    That leaves Intel as a company whose broad range of products have the unusual property of frequently being new and for which many derive commercial utility from educating their customers. Granted, its architectural details for general consumption are not as forthcoming as they once were.

    I'm not sure if it's just the optics, but it looks like the increased exposure to the mobile market and its lack of transparency and/or honesty has lead to a regression towards a much less open mean.
     
  5. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    988
    Likes Received:
    280
    Are you somehow connected to David since you seem to be speaking for him?
     
  6. BRiT

    BRiT (╯°□°)╯
    Moderator Legend Alpha Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    12,514
    Likes Received:
    8,723
    Location:
    Cleveland
    My words are my own. I speak for no one else. Who do you speak for?

    If you don't want others participating then don't ask questions in an open discussion forum.
     
  7. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,999
    Likes Received:
    4,571
    18 months?
    nVidia's GK110, Tegra K1; some of AMD's GCN family and FM2 APUs, most recently Hawaii, Bonaire and Kaveri. Both consoles sporting HSA APUs.

    Just from the top of my head.

    Disclaimer: I'm not saying what should or should not be in RWT. I'm just answering your questions.
    If dkanter has any reasons not to cover other vendors, I'm pretty sure it's not for the lack of base material.
    He's one of the few writing about Intel's non-exclusively-CPU solutions - which tend to pass around unnoticed in the media, and that's enough of a reason for me.
     
  8. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    988
    Likes Received:
    280
    I speak for myself and my question was directly stated to David. You however seem to want to speak for others.
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    I'm curious about the prediction of a 144MB L3 on-die, and the additional prediction of QPI.

    How would this L3 be distributed on-die?
    Intel went through a fair amount of work creating the distributed ring bus protocol and hashed distribution of addresses across L3 slices to minimize traffic hot spots in a comparatively constrained topology. For smaller core counts, there is a unidirectional ring bus, with clients able to add traffic at known cycles. Higher core counts can up the number of directions to two.

    How does the fabric play into this, with the additional dimension for routing and an unknown sequencing of request cycles? If the L3 is still sliced, at what granularity and how many directions across how many cycles can contend with any particular slice?

    If QPI is now the interconnect, earlier RWT articles tied its function to the introduction of MESIF. Does this mean KNC is using MESIF? How is that going to play with the new fabric?
    I've seen discussions about earlier MIC architectures leveraging some kind of L2 directory or filter structure, but the KNC article is indicating the L3 is the new filter.
    Does a line in the F state need to wander around a 2D fabric with an unknown number of requesters and intermediaries when there is an L3/directory close at hand? Perhaps the tiles are MESI (shared L2 means S can still forward between tile cores), with a forwarding state at the L3 level?
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I would guess other vendors have launched products but haven't made disclosures commensurate with RWT's depth.
     
  11. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    988
    Likes Received:
    280
    I for one hope to see a detailed report on Nvidia's upcoming Denver CPU.

    Between this patent information:

    Instruction-optimizing processor with branch-count table in hardware
    https://www.google.com/patents/US20130311752

    and announced information: 7-way Superscaler, 128-64k L1 Cache

    Tegra K1 64-bit Denver core analysis: Are Nvidia’s x86 efforts hidden within?
    http://www.extremetech.com/computin...nalysis-are-nvidias-x86-efforts-hidden-within

    and a 3 year development timeframe it will be interested in what Nvidia's Custom 64-Bit ARM CPU is.
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The wide issue could point to a LIW or VLIW processor core for Nvidia, if the patents turn out to be implemented.
    The patent makes note that there is a hardware decoder for non-native instructions and that native instructions bypass it. There's some ambiguity here as to whether that means there's a decode engine and an implied native decoder, or if Nvidia is going for a really old school VLIW internal representation. I'd feel more comfortable with the idea that there's at least some level of decode still present, even if it's not mentioned in the patent.

    Since Nvidia wants HPC, giving details should be encouraged.
    On, the other hand, if it is a code-morphing processor, it's a mark against transparency if Transmeta's behavior is a guide. RWT had an article about reverse-engineering it.

    One handy thing that might happen with this isn't so much ARM or x86 compatibility, but compatibility with the GPU ISA could be done as well, if for example a kernel showed bad divergence or small granularity.
    Kepler's inclusion of dependence data in its ISA would make creating instruction packets easier, but that's more a what-if and not a KNL speculation.


    As far as KNL goes, if it uses Silvermont as a basis, how it works in relation to a 4-way SMT setup with vector requirements will be interesting.
    That core's rename resources are not big enough for 4-way threading to not feel significant resource pressure, the load and store queues are going to fill up with outstanding accesses with that many threads in a stream compute context, and the FP rename capability is simply not big enough for an AVX3 32 register context.
    The design choices for Silvermont would point to a multithreading scenario where register files get duplicated, instead of pointer following in a physical register file like Haswell and the like.

    I'm curious if the explicit breakout of the VPU means a new domain or possibly a coprocessor model like AMD's FPU, or something even more separate.
    The VPUs are going to like using gather, have many outstanding accesses, tolerate latency, and like a lot of bandwidth. Itanium routed FP accesses straight to the L2.
    What if they combined all of this and gave a separate VPU ALU and memory domain that hooks into that new and custom L2?
    (edit: not sure how they'd approach the reorder buffer, sharing one with the current size poses the risk of it becoming a hazard, but the size or sharing/duplication of it are knobs that can be turned)
     
    #12 3dilettante, Jan 24, 2014
    Last edited by a moderator: Jan 24, 2014
  13. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    If you look at Silvermont, it actually looks almost like AMD's K7 through K10 with separate reservation stations for each unit instead of a unified scheduler for everything as with intel's high performance CPUs (or two schedulers, one for int and one for FP, respectively, in case of AMD's BD/PD/SR).
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Grall likes this.
  15. nutball

    Veteran Subscriber

    Joined:
    Jan 10, 2003
    Messages:
    2,154
    Likes Received:
    483
    Location:
    en.gb.uk
    Very interesting. Can't come soon enough as far as I'm concerned.
     
  16. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    KNL does indeed look interesting, although Intel needs to show they can use all those peak numbers to do something useful. KNC was abysmal in that regard - it was tough to use more than 50% of off-chip memory bandwidth, for example, meaning that AMD and NVIDIA GPUs had much higher sustained throughputs than KNC, even though their peak bandwidths were lower.

    KNL's peak numbers are actually a little less than I was expecting, especially since GM200 already beats it in SP Flops and Fiji will likely beat it in memory bandwidth, and both of them are shipping earlier than KNL.

    However, GP100, which will be KNL's main competitor, will likely dominate KNL in SP & DP flops, as well as off chip bandwidth.

    So I think it's demonstrably false that CUDA's irrelevant for other HPC, even if KNL does look pretty interesting.
     
    nnunn and spworley like this.
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    http://www.theplatform.net/2015/03/25/more-knights-landing-xeon-phi-secrets-unveiled/

    Mmm, 16GB "L3 cache".

     
  18. spworley

    Newcomer

    Joined:
    Apr 19, 2013
    Messages:
    146
    Likes Received:
    190
    KNL's use of 90GB/sec DDR4 seems laughable at first compared to even today's 300GB/sec GPUs, soon to be 1TB/sec. But the real use of this is for a large data store, not for performance. When you have 16GB of fast MCDRAM on the package, the off-chip DDR4 RAM is essentially used like host system memory is now... a slow but large main data staging and storage region. And for HPC, having that memory as dense DIMMs gives you flexibility and a very large maximum size (384 GB). So I almost think of it as replacing a 6GB/sec PCIE bus to system memory with a 90GB/sec connection instead. KNL is the system now.
     
    nnunn and Grall like this.
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    This chip is also capable of being a host processor, so the DDR4 bus can be host memory in a standalone system.
     
    nnunn likes this.
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The reason GPUs became interesting is that they offered 1-2 orders of magnitude greater compute density or performance per watt or per $ (and combinations thereof).

    None of those things apply with Knights Landing in the wild. And it will run the code that everyone has been running, and happily support compute paradigms beyond crappy old MPI and OpenMP.

    You can freely choose your desired mix of task- and data-parallel kernels within a single architecture, memory hierarchy, execution model, instruction set and clustering topology. It's a flat, sane landscape.
     
    Grall likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...