I Can Hazwell?

Discussion in 'PC Industry' started by Grall, Nov 9, 2011.

  1. I.S.T.

    Veteran

    Joined:
    Feb 21, 2004
    Messages:
    3,174
    Likes Received:
    389
    Indeed. That is what I was hinting at, but I didn't really get it across very well. Thank you for restating it in actual words and not the scribblings of a moron.
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,555
    Likes Received:
    4,725
    Location:
    Well within 3d
    Was AMD claiming the only additional area related to x86 was specifically in the decoders?

    Some elements aren't as relevant to Bobcat, such as the A64 devoting more storage in the L1 due to the predecode bits.
    Then there's the longer pipeline, with 2-3 stages for picking and lane selection.
    Then there's the internal cracking into micro ops that requires full tracking for each op and then folding back into the ROB.
    There's the need to track flags throughout the engine, and stuff like the support of the x87 FP pipeline.

    There is a lot of impressive work over decades to make x86 look like it doesn't have disadvantages.
    The whole uop cache for Intel's cores are an impressive bit of work to get into a single cycle in parallel with the L1, all to avoid using that front end.

    At any rate, should we believe AMD when they said that then, or after they promised an 8-16 core ARM processor with embedded microserver fabric with higher-clocking A57 cores with 2-4x the performance and higher perf/watt than a 4-core Jaguar-based Opteron-X?
     
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,624
    Likes Received:
    1,050
    That's what the slide said: The overhead of x86 was 4%.

    The instruction border bits were stored in the ECC bits of the I$, A64 only supported parity checking on I$ (damage to read-only instructions = no harm done: Reload from memory). You can argue they could have saved a few bits in the instruction cache without this, however it allowed reuse of the same SRAM macro as the one used for D$.

    I'll grant you an extra pipe stage for picking instructions after scan.

    The instruction grouping/lanes was a consequence of the ROB structure used. By grouping instructions in threes AMD used less hardware to track instructions ending up with a more compact ROB (thus faster) which had larger capacity than Intel's counterpart. Power 4&5 and the PPC970 derivative also had instruction groups, and much more restrictive ones at that.

    Pipeline length is a function of operation frequency and power consumption design point. Athlon 64 had a 12 stage pipeline, POWER 4&5 had 14 (and twice the schedule-execution latency to boot!). Cortex A15 has a 15 stage pipeline as do AMD's Bobcat.

    Most common instructions map one-to-one to internal ops. Some map to multiple ops but can be decoded in a single cycle, some instructions require microcode fallback. Microcode execution is rare because it is slow. It is slow because little hardware is used on it (in part).

    Micro op fusion is not specific to x86. RISCs would benefit too (eg. Power's compute predicate+branch). Packing multiple ops into a single ROB entry increases the virtual size of the ROB and improves execution efficiency (and power).

    Flags are renamed with every instruction that modify them in the register rename stage. They add 6 bits to the result buses throughout the chip (newer CPUs split them up to avoid false dependencies on unused flags). x87 is indeed a headache and an abomination that won't die.

    Agree entirely.

    Many techniques were developed to work around shortcomings in the ISA: SRAM based ROBs were pioneered in PPRO to support a large ROB (this design is still amazing to me, the register file only has three ports, almost all register values live in the ROB). Unaligned memory accesses (since forever) seems like a small deal, but isn't. Large store-to-load forwarding queues because of the frequent register spilling effectively extending the size of the register file. Speculative loads again to work around all the false RAW hazards introduced by register spills and incidentally giving a massive jump to wide superscalar implementations.

    Today you wouldn't think of building a high end CPU design without these features (well Intel are the only ones with truly speculative loads, the new IBM POWER8 might feature it).

    The success of x86 is thanks to Intel and AMD engineers overcoming the challenges posed by the ISA, - and by pure luck, because the ISA is actually fairly efficient.

    The two-operand instruction scheme was a performance bottleneck until OOO execution became a reality, then it became a boon because you save encoding bits for one operand per instruction on average. The addressing modes are *very* useful and simple compared to VAX and M68K (in particular M68020 and onwards). The instruction format doesn't have the long sequential dependency chains found in M68020. I'm not saying the prefix system is elegant, but it is easier to make fast than other CISCY schemes.

    Cheers
     
    #523 Gubbi, Oct 1, 2013
    Last edited by a moderator: Oct 1, 2013
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,555
    Likes Received:
    4,725
    Location:
    Well within 3d
    Overhead means a lot of things, or nothing if they didn't want to say that much.


    For cores that had the same targets, the pipeline was longer. AMD also hid a chunk of delay in the predecoder, so it doesn't appear in the primary mispredict pipeline.
    Intel's desktop cores hide a number of stages that show up in the uop cache miss case, which isn't discussed widely.


    For AMD, they mapped to a macro op, which itself will split if it's reg-mem.
    Exception handling for an in-flight instruction is more complex if it can be in-flight in the execution pipeline while also potentially faulting in memory.

    It's redundant work that expands the amount of storage past the front end, and also something which Silvermont bucks the trend by not doing.


    Which ISA shortcoming is having a (fat) ROB meant to address?
    It is also not something current architectures do, for power reasons.

    While the pressure is more acute with heavier reliance on memory, it wouldn't be unique to the ISA.
    Dynamic memory disambiguation may not be as necessary from an ISA standpoint for weaker memory models, although if you mean speculative loads other architectures have done their own forms.

    Disambiguation is something the power-conscious Silvermont doesn't do.
    It may also simplify some of the work of natively supporting x86 through the pipeline, I'm not sure.
     
    #524 3dilettante, Oct 1, 2013
    Last edited by a moderator: Oct 1, 2013
  5. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,624
    Likes Received:
    1,050
    A64 execute reg-mem ops directly. All integer exec units had an AGU for this purpose (and LS port). The cache itself was only dual ported. Using macro-ops like this was a balancing act. Using macro-ops all the way to the execution units allowed for denser instructions in the ROB with a corresponding increase in performance at the cost of not being able to start the AGU op until all operands were ready for the macro-op. Intel's scheme clearly won out, especially in combination with speculative loads.

    Why? You roll back execution until before the faulting instruction (note, not op).

    Silvermont is more akin to A64 in this regard. A reg-mem instruction is inherently sequential in nature anyway, the ALU part can't take place before the operands are ready and with an in-order memory pipeline there probably isn't much to gain from trying to execute the load early.

    You have to remember what came before SRAM based data capture ROBs. It was basically CAM based reservation stations which had very limited benefit.

    Data capture schedulers/ROBs made perfect sense when they had 32 entries, each 32 bits wide and wire delay/power was modest. Then the number of entries grew, register width doubled and the power-delay product of wires increased as geometries shrinked. So now we have PRF based ROBs instead. - Everywhere.

    IMO, the high register pressure (and thus, more spills) and strong memory model compound the problem.

    Depends on where you performance-power design point lies, no? If you want low power, then no, don't speculate. If you want high performance you want it.

    IPF has a weak memory model and they still opted to implement the ALAT to promote loads before the addresses of pending stores are known. Every load in Intel Core 2 and forwards are basically ALAT loads.

    You're right. It doesn't make sense for a low power / low performance micro architecture. The LS units in Intel's desktop CPUs from Core 2 and onwards are massive. In general the bigger your OOOe resources are the more you want it. A single scattering store, where the address for the store comes from a load from LLC or worse, main memory will stall a CPU core dead without speculative loads. With a 192 entry ROB like in Haswell that is a lot of lost performance.

    Cheers
     
  6. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    8,297
    Likes Received:
    4,735
    Location:
    Pennsylvania
    I believe my Helix comes close to all those requirements but the price of course. It's the only thing I found to do so however it set me (well the company) back $1700 for it. Slightly thicker and heavier (in clamshell mode) that what you specified but the fact that it's a hybrid as well is a huge feature (convertible and detachable).
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,555
    Likes Received:
    4,725
    Location:
    Well within 3d
    Precise exceptions and correct fault handling are more difficult to maintain in a pipelined processor, on top of which is a requirement of atomicity and a memory access that can produce exceptions and faults an unknown number of cycles later and potentially multiple exceptions deep.
    Splitting the ops internally was what made the problem tractable and scalable performance-wise.

    Is there anything specific to x86 or any other ISA that makes it stand out as an ISA-specific feature?

    It also depends on how strongly ordered the memory model is. For good or ill, if the architecture doesn't define a stronger ordering, adding hardware to enforce it doesn't stand out as a first-choice. That being said, I think some architectures may be partly walking things back a bit and potentially having a more strongly ordered mode.

    Almost no architecture moved around dependent loads, unless it was Alpha, at any rate.
     
  8. Laurent06

    Regular

    Joined:
    Dec 14, 2007
    Messages:
    847
    Likes Received:
    172
    Aarch64 still has post increment and ALU with shifted operands. Move multiple is indeed a pain to implement, in particular for register renaming, and so was removed.

    You either don't know Aarch64 or MIPS64. Given how wrong you were about what has been removed, I guess you didn't read much about Aarch64.

    If you're trying to say x86 is better suited to high performance or anything, then I refer you back to the article you cited and its findings. As I already said for higher performance, ISA mostly doesn't matter as long as it's not brain dead (such as first version of Alpha lacking byte ops).
     
  9. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,624
    Likes Received:
    1,050
    My mistake wrt. to post increment. I guess I expected it to be removed because it really doesn't play nice with OOOe implementations.

    Wrt. shifts, I went by David Kanter's article wrt. Aarch64, which, looking at the instruction set, seems to be wrong (implicit shifts for both arithmetic and logical instructions)

    My mistake.

    IMHO, not ditching post increment and inline shifts is a mistake if ARM wants to push implementations into higher performance markets.

    I'm not saying it is better, I'm saying it's a wash, like you and the article I linked to. x86 will never be competitive with Cortex M class cpu cores in the low low power regime, but as soon as you add OOOe there basically is no handicap.

    Cheers
     
  10. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,624
    Likes Received:
    1,050
    No, it isn't ISA specific, since it is all in the execution backend of the CPUs.

    However, a largish (>30 entries) ROB is essential for x86 performance. The need for a large ROB was what drove the invention of the SRAM based data capture scheduler in PPRO (by Andy Glew, then of Intel, now MIPS).

    It is interesting to look at the academic ground work for IPF from that era. The primary motivation for the explicitly parallel instruction bundling with issue boundaries was that ROBs wouldn't scale to large sizes.

    Well, a weak memory model is fine if your code Fortran or the compiler can resolve any aliasing. Once you move into C spaghetti/object oriented soup, the compiler really struggles and have to pepper the code with memory fences. I'm guessing the latter to be the motivation for the IPF designers to add ALAT.

    If you implement memory disambiguation vis a vis Intel, architectures with weak memory models can ignore some of the memory fences (like SFENCE). It would speed up memory accesses considerably if there are many of those. Of course going from zero dependency tracking (as in Alpha) to an Intel style behemoth LS unit is quite a jump.

    Cheers
     
  11. Laurent06

    Regular

    Joined:
    Dec 14, 2007
    Messages:
    847
    Likes Received:
    172
    We all do mistakes :)

    If you feel like reading docs, the ARMv8 architecture manual is available, though you need to register.

    http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487a/index.html

    This was carefully studied during the investigation phase of ARMv8. These instructions bring code density and performance benefits, while the issues implementing them are not too large:
    • post increment adds one register to rename; along with pair register loads this means 3 registers to rename in a cycle, which is not a big deal (and anyway much less than the trouble of renaming ldm...); note that the post increment now always is an immediate
    • inline shifts can be implemented either by adding a pipe stage or by micro-oping; note that the shift amount is an immediate which eases the pain (and suddenly I am wondering if your mistake about shift having disappeared from ALU instructions isn't coming from the fact shift by reg operands were indeed removed...).

    We definitely agree then!
     
  12. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,175
    Location:
    La-la land
    That's not a "serious" ultrabook... That's pretty much a bleeding edge ultrabook; except for screen resolution, is there a single notebook anywhere buyable for money that meets all of those demands of yours?

    Some of them are subjective of course and are not comparable from one person to another.
     
  13. DavidC

    Regular

    Joined:
    Sep 26, 2006
    Messages:
    348
    Likes Received:
    27
    One of the biggest reasons yes. Why do you think battery life doubled from Oak Trail to Clover Trail? With the same TDP? Why do you think we had mere 5% advances in battery per year and jumped to 50% with Haswell?

    Yes, which were one of the reasons Windows Tablets were never successful. No one really cares whether they weren't considering it or not, they should have developed it earlier.

    Highly doubt it. You can see the benefits of a new process that far overreaches architectural differences especially in GPUs.

    HD Graphics: 2-2.5x - http://www.anandtech.com/show/2901/4
    Desktop Sandy Bridge: 2x - http://www.anandtech.com/show/4083/...core-i7-2600k-i5-2500k-core-i3-2100-tested/11
    Mobile Sandy Bridge: 2-2.5x - http://www.anandtech.com/bench/product/348?vs=232

    http://forum.beyond3d.com/showpost.php?p=1563037&postcount=343

    Ultrabook definition does not take out that possibility as there's no requirement for certain weight.

    HP Spectre XT TouchSmart Ultrabook 15-4010nr: 4.9 lb

    <$1000: 768 display with sub-par quality, SSD caching, ~4-5 lb, single channel memory, usually last generation Core. And both high end and low end have crappy touchpads.

    And promising ALL THAT but not being able to meet it is probably one of the reasons that sales of the Ultrabook category is dismal, and you do not need a genius to see it.
     
  14. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    Gubbi, do you know any details of how the reordering engine dispatches uops in the PPro's ROB to the various ports? In particular what new ordering does it assign? I recall I read a document about the PPro which said it more or less used a FIFO ordering but preferred to group "back-to-back uops" first. Does that mean single uops that correspond to adjacent x86 instructions in program flow or cracked uops that come from a single complex op?

    Also, I think you've said once (and Linus Torvalds as well) that Core 2's ability to speculatively load ahead of stores in most cases was its biggest performance advantage at that time, more so than its added width. How do you guys figure? The Athlon had no out of order loading capabilities in its pipeline but beat out the PPro which could move loads before stores in some cases. I take it increasing width was just giving diminishing returns by the time the Core 2 rolled around?
     
    #534 Raqia, Oct 22, 2013
    Last edited by a moderator: Oct 22, 2013
  15. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,624
    Likes Received:
    1,050
    I'm guessing the FIFO ROB instruction picker has a limited scan range and so back-to-back uops simply increases the chances for finding instructions for all ports.

    I have no idea how instructions are picked in newer implementations (info is scarce, super secret sauce).

    x86 has a strong memory ordering model. That means the load store unit has to be certain a load isn't colliding with a pending store. If the address of the store is unknown no aliasing can be detected and the load waits. Core 2 allows the load to proceed and clean up afterwards if the load did alias with a store. Even when a load fails you get a substantial performance benefit because the CPU has executed further along, started more loads, some of which might miss the cache. When these loads are replayed, they see lower apparent latency than if there had been no speculation going on.

    Basically Core 2 has a strong memory ordering model running faster than a weak memory ordering model because you/the compiler don't have to be conservative with memory fences. It is equivalent to the ALAT in IPF, but without the manual setup/clean up mess.

    The consequence is fewer stalls on memory RAW hazards and it is what made a 4-wide implementation make any sense (earlier studies showed moving beyond 3-wide would give less than 4% additional performance).

    The Athlon beat PPRO/P2/P3 because it was wider almost everywhere and clocked as high. The Athlon 64/X2 beat P4 because it was wide and had an integrated memory controller halving memory latency. Core 2 beat the Athlon 64/Opteron despite having the memory controller on the northbridge, in part because of the advanced LS unit (and in part of smart prefetchers).

    When Nehalem came along, with its integrated memory controller, it was game over for Athlon/Opteron.

    Cheers
     
  16. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    Let me see if I've got it right: in weak memory models (with out hardware out of order loading), you're counting on the compiler to insert memory fences into machine code to make sure that what the processor executes is what a programmer's high level code means, but because a compiler might have no real understanding of your code as it behaves at run-time, it might insert more memory fences than are necessary to ensure correctness over aggressive reordering. Fences also take up an instruction even if they're ignored.

    A strong memory model has the correct ordering pre-baked into the semantics of the machine instructions so you can use well defined hardware rules to reorder them and still maintain correctness.

    This general technique improves performance because loads can be very long latency instructions and running ahead to execute the next load in parallel instead of waiting for them end to end yields more gains than reordering a bunch of lower latency instructions to run in parallel.
     
  17. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    I always wanted an Olivetti Quaderno, it's kind of a netbook but bad ass looking, with a 640x400 1bit monochrome display, a NEC V30 (80186 clone), a 20MB hard drive, audio support and PCMCIA slot and other connectors.. The claimed battery life is 8 hours!

    I've never seen one though.
    Has DOS + PDA tools in ROM and the screen is unlit.
    Why can't we just have shit like that, with today's technology. The mobile x86 CPU was solved over 20 years ago, and it's still going today (many options actually : Atom, Quark, Haswell, Broadwell, Temash and before that Geode, Xcore86/Vortex86)

    I would be pretty happy with a 2W x86 SoC, on-board flash + internal micro SD slots, a small and thin 2560x1600 monochrome display (that I can go using outside and in the sunlight), a good speaker, ruggedized plastic case and thick keyboard keys, crazy light - it could use a phablet battery, why not.
    Battery life would be drained when you turn on the wifi or 4G.
     
    #537 Blazkowicz, Oct 22, 2013
    Last edited by a moderator: Oct 22, 2013
  18. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,175
    Location:
    La-la land
    We do have shit like that... They're called tablets, and they don't have shitty hardware like 186 CPUs and 1-bit low-res no-lit displays, and actually get up to 10 hours battery life (or maybe even more) rather than 8.
     
  19. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,624
    Likes Received:
    1,050
    Correct. A strong memory ordering model implies every load and store is done in order as viewed by the program counter. There are various weak memory ordering models where the weakest has zero concern for ordering and a load or store execute when it is dispatched.

    Since only apparent order has to be correct (at least for non-IO memory mapped addresses) there is a lot of room for optimization. The first is to let loads execute out of order with other loads. The next is to execute load/store operations out of order where the addresses are known (and aliasing can be detected). The final one is to speculatively execute loads even though the address of a pending store(s) isn't know.

    All of this costs silicon real estate so a LS unit implementing a weak memory ordering model takes up a lot less silicon real estate than a fast strong memory ordering LS unit does. For example Silvermonts' LS unit is in order to save transistors and power.

    A weak memory model is perfectly fine if you can guarantee there is no aliasing between your data structures. Fortran and C++ arrays are guaranteed not to overlap so resolving aliasing is easy (there is none), exchange your C++ arrays (foobar[]) with pointers (*foobar) and the compiler is up shit creek without a paddle.

    Correct.

    Exactly, since your CPU doesn't stall on RAW hazards it can speculate further and start more loads earlier.

    Cheers
     
  20. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    Thanks Gubbi. I've taken another look at the specs of the Athlon vs. P6 derivatives and the Athlon had 4x the L1 cache of its contemporary P6 based rivals. This alone must have obviated most latency penalties relative to the P6 from its inability to reorder loads.
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...