I Can Hazwell?

I.S.T. · Oct 1, 2013

3dilettante said:
It could be that one design is a large core that must scale to server and desktop loads with high single-threaded performance, high multithreaded utilization, very wide vector units, high system bandwidth, with pipeline clocks going nearly to 4 GHz.

The other design focused on a specific lower-performance and power-sensitive niche, which meant it could forego pipeline logic and circuits tuned for high clocks it doesn's strive for, or the additional resources for extra threads it doesn't have.

Indeed. That is what I was hinting at, but I didn't really get it across very well. Thank you for restating it in actual words and not the scribblings of a moron.

3dilettante · Oct 1, 2013

Gubbi said:
I have a dead link in my bookmarks that links to AMD's "x86 everywhere" presentation from 2005 where they stated x86 decode was 4% of the 30mm^2 of an Athlon 64 core in 90nm. The dual issue in-order cores in 360 and PS3 was of similar size and a fraction of the performance.

Was AMD claiming the only additional area related to x86 was specifically in the decoders?

Some elements aren't as relevant to Bobcat, such as the A64 devoting more storage in the L1 due to the predecode bits.
Then there's the longer pipeline, with 2-3 stages for picking and lane selection.
Then there's the internal cracking into micro ops that requires full tracking for each op and then folding back into the ROB.
There's the need to track flags throughout the engine, and stuff like the support of the x87 FP pipeline.

There is a lot of impressive work over decades to make x86 look like it doesn't have disadvantages.
The whole uop cache for Intel's cores are an impressive bit of work to get into a single cycle in parallel with the L1, all to avoid using that front end.

At any rate, should we believe AMD when they said that then, or after they promised an 8-16 core ARM processor with embedded microserver fabric with higher-clocking A57 cores with 2-4x the performance and higher perf/watt than a 4-core Jaguar-based Opteron-X?

Gubbi · Oct 1, 2013

3dilettante said:
Was AMD claiming the only additional area related to x86 was specifically in the decoders?

That's what the slide said: The overhead of x86 was 4%.

3dilettante said:
Some elements aren't as relevant to Bobcat, such as the A64 devoting more storage in the L1 due to the predecode bits.

The instruction border bits were stored in the ECC bits of the I$, A64 only supported parity checking on I$ (damage to read-only instructions = no harm done: Reload from memory). You can argue they could have saved a few bits in the instruction cache without this, however it allowed reuse of the same SRAM macro as the one used for D$.

3dilettante said:
Then there's the longer pipeline, with 2-3 stages for picking and lane selection.

I'll grant you an extra pipe stage for picking instructions after scan.

The instruction grouping/lanes was a consequence of the ROB structure used. By grouping instructions in threes AMD used less hardware to track instructions ending up with a more compact ROB (thus faster) which had larger capacity than Intel's counterpart. Power 4&5 and the PPC970 derivative also had instruction groups, and much more restrictive ones at that.

Pipeline length is a function of operation frequency and power consumption design point. Athlon 64 had a 12 stage pipeline, POWER 4&5 had 14 (and twice the schedule-execution latency to boot!). Cortex A15 has a 15 stage pipeline as do AMD's Bobcat.

3dilettante said:
Then there's the internal cracking into micro ops that requires full tracking for each op and then folding back into the ROB.

Most common instructions map one-to-one to internal ops. Some map to multiple ops but can be decoded in a single cycle, some instructions require microcode fallback. Microcode execution is rare because it is slow. It is slow because little hardware is used on it (in part).

Micro op fusion is not specific to x86. RISCs would benefit too (eg. Power's compute predicate+branch). Packing multiple ops into a single ROB entry increases the virtual size of the ROB and improves execution efficiency (and power).

3dilettante said:
There's the need to track flags throughout the engine, and stuff like the support of the x87 FP pipeline.

Flags are renamed with every instruction that modify them in the register rename stage. They add 6 bits to the result buses throughout the chip (newer CPUs split them up to avoid false dependencies on unused flags). x87 is indeed a headache and an abomination that won't die.

3dilettante said:
There is a lot of impressive work over decades to make x86 look like it doesn't have disadvantages.

Agree entirely.

Many techniques were developed to work around shortcomings in the ISA: SRAM based ROBs were pioneered in PPRO to support a large ROB (this design is still amazing to me, the register file only has three ports, almost all register values live in the ROB). Unaligned memory accesses (since forever) seems like a small deal, but isn't. Large store-to-load forwarding queues because of the frequent register spilling effectively extending the size of the register file. Speculative loads again to work around all the false RAW hazards introduced by register spills and incidentally giving a massive jump to wide superscalar implementations.

Today you wouldn't think of building a high end CPU design without these features (well Intel are the only ones with truly speculative loads, the new IBM POWER8 might feature it).

The success of x86 is thanks to Intel and AMD engineers overcoming the challenges posed by the ISA, - and by pure luck, because the ISA is actually fairly efficient.

The two-operand instruction scheme was a performance bottleneck until OOO execution became a reality, then it became a boon because you save encoding bits for one operand per instruction on average. The addressing modes are *very* useful and simple compared to VAX and M68K (in particular M68020 and onwards). The instruction format doesn't have the long sequential dependency chains found in M68020. I'm not saying the prefix system is elegant, but it is easier to make fast than other CISCY schemes.

Cheers

3dilettante · Oct 1, 2013

Gubbi said:
That's what the slide said: The overhead of x86 was 4%.

Overhead means a lot of things, or nothing if they didn't want to say that much.

Pipeline length is a function of operation frequency and power consumption design point. Athlon 64 had a 12 stage pipeline, POWER 4&5 had 14 (and twice the schedule-execution latency to boot!). Cortex A15 has a 15 stage pipeline as do AMD's Bobcat.

For cores that had the same targets, the pipeline was longer. AMD also hid a chunk of delay in the predecoder, so it doesn't appear in the primary mispredict pipeline.
Intel's desktop cores hide a number of stages that show up in the uop cache miss case, which isn't discussed widely.

Most common instructions map one-to-one to internal ops. Some map to multiple ops but can be decoded in a single cycle, some instructions require microcode fallback. Microcode execution is rare because it is slow. It is slow because little hardware is used on it (in part).

For AMD, they mapped to a macro op, which itself will split if it's reg-mem.
Exception handling for an in-flight instruction is more complex if it can be in-flight in the execution pipeline while also potentially faulting in memory.

It's redundant work that expands the amount of storage past the front end, and also something which Silvermont bucks the trend by not doing.

Many techniques were developed to work around shortcomings in the ISA: SRAM based ROBs were pioneered in PPRO to support a large ROB (this design is still amazing to me, the register file only has three ports, almost all register values live in the ROB).

Which ISA shortcoming is having a (fat) ROB meant to address?
It is also not something current architectures do, for power reasons.

Speculative loads again to work around all the false RAW hazards introduced by register spills and incidentally giving a massive jump to wide superscalar implementations.

While the pressure is more acute with heavier reliance on memory, it wouldn't be unique to the ISA.
Dynamic memory disambiguation may not be as necessary from an ISA standpoint for weaker memory models, although if you mean speculative loads other architectures have done their own forms.

Disambiguation is something the power-conscious Silvermont doesn't do.
It may also simplify some of the work of natively supporting x86 through the pipeline, I'm not sure.

Gubbi · Oct 1, 2013

3dilettante said:
For AMD, they mapped to a macro op, which itself will split if it's reg-mem.

A64 execute reg-mem ops directly. All integer exec units had an AGU for this purpose (and LS port). The cache itself was only dual ported. Using macro-ops like this was a balancing act. Using macro-ops all the way to the execution units allowed for denser instructions in the ROB with a corresponding increase in performance at the cost of not being able to start the AGU op until all operands were ready for the macro-op. Intel's scheme clearly won out, especially in combination with speculative loads.

3dilettante said:
Exception handling for an in-flight instruction is more complex if it can be in-flight in the execution pipeline while also potentially faulting in memory.

Why? You roll back execution until before the faulting instruction (note, not op).

3dilettante said:
It's redundant work that expands the amount of storage past the front end, and also something which Silvermont bucks the trend by not doing.

Silvermont is more akin to A64 in this regard. A reg-mem instruction is inherently sequential in nature anyway, the ALU part can't take place before the operands are ready and with an in-order memory pipeline there probably isn't much to gain from trying to execute the load early.

3dilettante said:
Which ISA shortcoming is having a (fat) ROB meant to address?

You have to remember what came before SRAM based data capture ROBs. It was basically CAM based reservation stations which had very limited benefit.

Data capture schedulers/ROBs made perfect sense when they had 32 entries, each 32 bits wide and wire delay/power was modest. Then the number of entries grew, register width doubled and the power-delay product of wires increased as geometries shrinked. So now we have PRF based ROBs instead. - Everywhere.

3dilettante said:
While the pressure is more acute with heavier reliance on memory, it wouldn't be unique to the ISA.

IMO, the high register pressure (and thus, more spills) and strong memory model compound the problem.

3dilettante said:
Dynamic memory disambiguation may not be as necessary from an ISA standpoint for weaker memory models, although if you mean speculative loads other architectures have done their own forms.

Depends on where you performance-power design point lies, no? If you want low power, then no, don't speculate. If you want high performance you want it.

IPF has a weak memory model and they still opted to implement the ALAT to promote loads before the addresses of pending stores are known. Every load in Intel Core 2 and forwards are basically ALAT loads.

3dilettante said:
Disambiguation is something the power-conscious Silvermont doesn't do.
It may also simplify some of the work of natively supporting x86 through the pipeline, I'm not sure.

You're right. It doesn't make sense for a low power / low performance micro architecture. The LS units in Intel's desktop CPUs from Core 2 and onwards are massive. In general the bigger your OOOe resources are the more you want it. A single scattering store, where the address for the store comes from a load from LLC or worse, main memory will stall a CPU core dead without speculative loads. With a 192 entry ROB like in Haswell that is a lot of lost performance.

Cheers

Malo · Oct 2, 2013

DavidC said:
A serious Convertible Ultrabook IMO is:

-2.5 pound or lighter
-13-15mm thick
-Keyboard with excellent travel(for example the XPS 12 I use is quite excellent, while Yogas and few other Ultrabooks suck)
-Thermal management that can handle full specifications of the CPU, and is very quiet
-WiFi that can not only pick up APs well and have good range and throughput, but can do extra features, like WiDi really well
-Decent Digitizer with pen slots
-1920x1080 to start. Excellent factory color production and brightness ideal. No glaringly obvious issues with the touchscreen not working after few hours of use(which XPS 12 suffers from)
-Dual channel memory(surprising how many popular Ultrabooks don't have that)
-10 hour battery life
-Excellent quality control with zero light bleed on the screen, no bend on the chassis and keyboard, and tech support that can actually fix issues
-$999

I believe my Helix comes close to all those requirements but the price of course. It's the only thing I found to do so however it set me (well the company) back $1700 for it. Slightly thicker and heavier (in clamshell mode) that what you specified but the fact that it's a hybrid as well is a huge feature (convertible and detachable).

3dilettante · Oct 2, 2013

Gubbi said:
Why? You roll back execution until before the faulting instruction (note, not op).

Precise exceptions and correct fault handling are more difficult to maintain in a pipelined processor, on top of which is a requirement of atomicity and a memory access that can produce exceptions and faults an unknown number of cycles later and potentially multiple exceptions deep.
Splitting the ops internally was what made the problem tractable and scalable performance-wise.

You have to remember what came before SRAM based data capture ROBs. It was basically CAM based reservation stations which had very limited benefit.

Is there anything specific to x86 or any other ISA that makes it stand out as an ISA-specific feature?

Depends on where you performance-power design point lies, no? If you want low power, then no, don't speculate. If you want high performance you want it.

It also depends on how strongly ordered the memory model is. For good or ill, if the architecture doesn't define a stronger ordering, adding hardware to enforce it doesn't stand out as a first-choice. That being said, I think some architectures may be partly walking things back a bit and potentially having a more strongly ordered mode.

IPF has a weak memory model and they still opted to implement the ALAT to promote loads before the addresses of pending stores are known. Every load in Intel Core 2 and forwards are basically ALAT loads.

Almost no architecture moved around dependent loads, unless it was Alpha, at any rate.

Laurent06 · Oct 2, 2013

Gubbi said:
ARM has a lot more ISA issues for medium to high end implementations. I've already mentioned post increment adressing, move multiple and implicit shift. There is also condtional execution. All of which was ejected from the ARM64 spec.

Aarch64 still has post increment and ALU with shifted operands. Move multiple is indeed a pain to implement, in particular for register renaming, and so was removed.

ARM64 is just a cleaned up MIPS64 AFAICT.

You either don't know Aarch64 or MIPS64. Given how wrong you were about what has been removed, I guess you didn't read much about Aarch64.

If you're trying to say x86 is better suited to high performance or anything, then I refer you back to the article you cited and its findings. As I already said for higher performance, ISA mostly doesn't matter as long as it's not brain dead (such as first version of Alpha lacking byte ops).

Gubbi · Oct 2, 2013

Laurent06 said:
Aarch64 still has post increment and ALU with shifted operands. Move multiple is indeed a pain to implement, in particular for register renaming, and so was removed.

My mistake wrt. to post increment. I guess I expected it to be removed because it really doesn't play nice with OOOe implementations.

Wrt. shifts, I went by David Kanter's article wrt. Aarch64, which, looking at the instruction set, seems to be wrong (implicit shifts for both arithmetic and logical instructions)

My mistake.

IMHO, not ditching post increment and inline shifts is a mistake if ARM wants to push implementations into higher performance markets.

Laurent06 said:
If you're trying to say x86 is better suited to high performance or anything, then I refer you back to the article you cited and its findings.

I'm not saying it is better, I'm saying it's a wash, like you and the article I linked to. x86 will never be competitive with Cortex M class cpu cores in the low low power regime, but as soon as you add OOOe there basically is no handicap.

Cheers

Gubbi · Oct 2, 2013

3dilettante said:
Is there anything specific to x86 or any other ISA that makes it stand out as an ISA-specific feature?

No, it isn't ISA specific, since it is all in the execution backend of the CPUs.

However, a largish (>30 entries) ROB is essential for x86 performance. The need for a large ROB was what drove the invention of the SRAM based data capture scheduler in PPRO (by Andy Glew, then of Intel, now MIPS).

It is interesting to look at the academic ground work for IPF from that era. The primary motivation for the explicitly parallel instruction bundling with issue boundaries was that ROBs wouldn't scale to large sizes.

3dilettante said:
It also depends on how strongly ordered the memory model is. For good or ill, if the architecture doesn't define a stronger ordering, adding hardware to enforce it doesn't stand out as a first-choice.

Well, a weak memory model is fine if your code Fortran or the compiler can resolve any aliasing. Once you move into C spaghetti/object oriented soup, the compiler really struggles and have to pepper the code with memory fences. I'm guessing the latter to be the motivation for the IPF designers to add ALAT.

3dilettante said:
That being said, I think some architectures may be partly walking things back a bit and potentially having a more strongly ordered mode.

If you implement memory disambiguation vis a vis Intel, architectures with weak memory models can ignore some of the memory fences (like SFENCE). It would speed up memory accesses considerably if there are many of those. Of course going from zero dependency tracking (as in Alpha) to an Intel style behemoth LS unit is quite a jump.

Cheers

Laurent06 · Oct 2, 2013

Gubbi said:
My mistake wrt. to post increment. I guess I expected it to be removed because it really doesn't play nice with OOOe implementations.

Wrt. shifts, I went by David Kanter's article wrt. Aarch64, which, looking at the instruction set, seems to be wrong (implicit shifts for both arithmetic and logical instructions)

My mistake.

We all do mistakes

If you feel like reading docs, the ARMv8 architecture manual is available, though you need to register.

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487a/index.html

IMHO, not ditching post increment and inline shifts is a mistake if ARM wants to push implementations into higher performance markets.

This was carefully studied during the investigation phase of ARMv8. These instructions bring code density and performance benefits, while the issues implementing them are not too large:

post increment adds one register to rename; along with pair register loads this means 3 registers to rename in a cycle, which is not a big deal (and anyway much less than the trouble of renaming ldm...); note that the post increment now always is an immediate
inline shifts can be implemented either by adding a pipe stage or by micro-oping; note that the shift amount is an immediate which eases the pain (and suddenly I am wondering if your mistake about shift having disappeared from ALU instructions isn't coming from the fact shift by reg operands were indeed removed...).

I'm not saying it is better, I'm saying it's a wash, like you and the article I linked to. x86 will never be competitive with Cortex M class cpu cores in the low low power regime, but as soon as you add OOOe there basically is no handicap.

We definitely agree then!

Grall · Oct 2, 2013

DavidC said:
A serious Convertible Ultrabook IMO is:

That's not a "serious" ultrabook... That's pretty much a bleeding edge ultrabook; except for screen resolution, is there a single notebook anywhere buyable for money that meets all of those demands of yours?

Some of them are subjective of course and are not comparable from one person to another.

DavidC · Oct 18, 2013

Albuquerque said:
Your assumption of Baytrail getting all it's power savings from enhanced idle and nothing else is absurd, and provably wrong.

One of the biggest reasons yes. Why do you think battery life doubled from Oak Trail to Clover Trail? With the same TDP? Why do you think we had mere 5% advances in battery per year and jumped to 50% with Haswell?

run iPhones and iPads were specifically built for this use case, x86 was built 30 years ago and wasn't ever imagined for this capability. The Windows architecture never considered such things until only the last few years...

Yes, which were one of the reasons Windows Tablets were never successful. No one really cares whether they weren't considering it or not, they should have developed it earlier.

You keep talking about Intels' x86 process being bad, but then you point at Apple working on their own ARM. You obviously do not understand: these two things are not the same, and the differences are far larger than "the process".

Highly doubt it. You can see the benefits of a new process that far overreaches architectural differences especially in GPUs.

You somehow think that Intel was making triple-digit performance increases in GPU's before Ivy Bridge? Where? Can you show me even a single example of that? I can't find anything close to that, anywhere, anytime in their history.

HD Graphics: 2-2.5x - http://www.anandtech.com/show/2901/4
Desktop Sandy Bridge: 2x - http://www.anandtech.com/show/4083/...core-i7-2600k-i5-2500k-core-i3-2100-tested/11
Mobile Sandy Bridge: 2-2.5x - http://www.anandtech.com/bench/product/348?vs=232

There was no claim, that I can find, of 7x GPU performance increase. The burden of proof is on YOU who made this statement, not someone else.

http://forum.beyond3d.com/showpost.php?p=1563037&postcount=343

There are no 5 pound ultrabooks, the very definition of ultrabook precludes that possibility.
You claim that "Desirable" ultrabooks are more than $1000, that's personal preference and not an indicator of availability and certainly not an indicator of "mainstream".

Ultrabook definition does not take out that possibility as there's no requirement for certain weight.

HP Spectre XT TouchSmart Ultrabook 15-4010nr: 4.9 lb

<$1000: 768 display with sub-par quality, SSD caching, ~4-5 lb, single channel memory, usually last generation Core. And both high end and low end have crappy touchpads.

And promising ALL THAT but not being able to meet it is probably one of the reasons that sales of the Ultrabook category is dismal, and you do not need a genius to see it.

Raqia · Oct 22, 2013

Gubbi said:
No, it isn't ISA specific, since it is all in the execution backend of the CPUs.

However, a largish (>30 entries) ROB is essential for x86 performance. The need for a large ROB was what drove the invention of the SRAM based data capture scheduler in PPRO (by Andy Glew, then of Intel, now MIPS).

It is interesting to look at the academic ground work for IPF from that era. The primary motivation for the explicitly parallel instruction bundling with issue boundaries was that ROBs wouldn't scale to large sizes.

Well, a weak memory model is fine if your code Fortran or the compiler can resolve any aliasing. Once you move into C spaghetti/object oriented soup, the compiler really struggles and have to pepper the code with memory fences. I'm guessing the latter to be the motivation for the IPF designers to add ALAT.

If you implement memory disambiguation vis a vis Intel, architectures with weak memory models can ignore some of the memory fences (like SFENCE). It would speed up memory accesses considerably if there are many of those. Of course going from zero dependency tracking (as in Alpha) to an Intel style behemoth LS unit is quite a jump.

Cheers

Gubbi, do you know any details of how the reordering engine dispatches uops in the PPro's ROB to the various ports? In particular what new ordering does it assign? I recall I read a document about the PPro which said it more or less used a FIFO ordering but preferred to group "back-to-back uops" first. Does that mean single uops that correspond to adjacent x86 instructions in program flow or cracked uops that come from a single complex op?

Also, I think you've said once (and Linus Torvalds as well) that Core 2's ability to speculatively load ahead of stores in most cases was its biggest performance advantage at that time, more so than its added width. How do you guys figure? The Athlon had no out of order loading capabilities in its pipeline but beat out the PPro which could move loads before stores in some cases. I take it increasing width was just giving diminishing returns by the time the Core 2 rolled around?

Gubbi · Oct 22, 2013

Raqia said:
Gubbi, do you know any details of how the reordering engine dispatches uops in the PPro's ROB to the various ports? In particular what new ordering does it assign? I recall I read a document about the PPro which said it more or less used a FIFO ordering but preferred to group "back-to-back uops" first. Does that mean single uops that correspond to adjacent x86 instructions in program flow or cracked uops that come from a single complex op?

I'm guessing the FIFO ROB instruction picker has a limited scan range and so back-to-back uops simply increases the chances for finding instructions for all ports.

I have no idea how instructions are picked in newer implementations (info is scarce, super secret sauce).

Raqia said:
Also, I think you've said once (and Linus Torvalds as well) that Core 2's ability to speculatively load ahead of stores in most cases was its biggest performance advantage at that time, more so than its added width. How do you guys figure?

x86 has a strong memory ordering model. That means the load store unit has to be certain a load isn't colliding with a pending store. If the address of the store is unknown no aliasing can be detected and the load waits. Core 2 allows the load to proceed and clean up afterwards if the load did alias with a store. Even when a load fails you get a substantial performance benefit because the CPU has executed further along, started more loads, some of which might miss the cache. When these loads are replayed, they see lower apparent latency than if there had been no speculation going on.

Basically Core 2 has a strong memory ordering model running faster than a weak memory ordering model because you/the compiler don't have to be conservative with memory fences. It is equivalent to the ALAT in IPF, but without the manual setup/clean up mess.

The consequence is fewer stalls on memory RAW hazards and it is what made a 4-wide implementation make any sense (earlier studies showed moving beyond 3-wide would give less than 4% additional performance).

Raqia said:
The Athlon had no out of order capabilities in its pipeline but beat out the PPro which could move loads before stores in some cases. I take it increasing width was just giving diminishing returns by the time the Core 2 rolled around?

The Athlon beat PPRO/P2/P3 because it was wider almost everywhere and clocked as high. The Athlon 64/X2 beat P4 because it was wide and had an integrated memory controller halving memory latency. Core 2 beat the Athlon 64/Opteron despite having the memory controller on the northbridge, in part because of the advanced LS unit (and in part of smart prefetchers).

When Nehalem came along, with its integrated memory controller, it was game over for Athlon/Opteron.

Cheers

Raqia · Oct 22, 2013

Gubbi said:
I'm guessing the FIFO ROB instruction picker has a limited scan range and so back-to-back uops simply increases the chances for finding instructions for all ports.

I have no idea how instructions are picked in newer implementations (info is scarce, super secret sauce).

x86 has a strong memory ordering model. That means the load store unit has to be certain a load isn't colliding with a pending store. If the address of the store is unknown no aliasing can be detected and the load waits. Core 2 allows the load to proceed and clean up afterwards if the load did alias with a store. Even when a load fails you get a substantial performance benefit because the CPU has executed further along, started more loads, some of which might miss the cache. When these loads are replayed, they see lower apparent latency than if there had been no speculation going on.

Basically Core 2 has a strong memory ordering model running faster than a weak memory ordering model because you/the compiler don't have to be conservative with memory fences. It is equivalent to the ALAT in IPF, but without the manual setup/clean up mess.

The consequence is fewer stalls on memory RAW hazards and it is what made a 4-wide implementation make any sense (earlier studies showed moving beyond 3-wide would give less than 4% additional performance).

The Athlon beat PPRO/P2/P3 because it was wider almost everywhere and clocked as high. The Athlon 64/X2 beat P4 because it was wide and had an integrated memory controller halving memory latency. Core 2 beat the Athlon 64/Opteron despite having the memory controller on the northbridge, in part because of the advanced LS unit (and in part of smart prefetchers).

When Nehalem came along, with its integrated memory controller, it was game over for Athlon/Opteron.

Cheers

Let me see if I've got it right: in weak memory models (with out hardware out of order loading), you're counting on the compiler to insert memory fences into machine code to make sure that what the processor executes is what a programmer's high level code means, but because a compiler might have no real understanding of your code as it behaves at run-time, it might insert more memory fences than are necessary to ensure correctness over aggressive reordering. Fences also take up an instruction even if they're ignored.

A strong memory model has the correct ordering pre-baked into the semantics of the machine instructions so you can use well defined hardware rules to reorder them and still maintain correctness.

This general technique improves performance because loads can be very long latency instructions and running ahead to execute the next load in parallel instead of waiting for them end to end yields more gains than reordering a bunch of lower latency instructions to run in parallel.

Blazkowicz · Oct 22, 2013

Albuquerque said:
Your assertion that x86 devices should have always had connected standby (S0ix) is laughable, as it wasn't even supported by the operating systems until Win8. ARM and the underlying operating systems that run iPhones and iPads were specifically built for this use case, x86 was built 30 years ago and wasn't ever imagined for this capability. The Windows architecture never considered such things until only the last few years...

I always wanted an Olivetti Quaderno, it's kind of a netbook but bad ass looking, with a 640x400 1bit monochrome display, a NEC V30 (80186 clone), a 20MB hard drive, audio support and PCMCIA slot and other connectors.. The claimed battery life is 8 hours!

I've never seen one though.
Has DOS + PDA tools in ROM and the screen is unlit.
Why can't we just have shit like that, with today's technology. The mobile x86 CPU was solved over 20 years ago, and it's still going today (many options actually : Atom, Quark, Haswell, Broadwell, Temash and before that Geode, Xcore86/Vortex86)

I would be pretty happy with a 2W x86 SoC, on-board flash + internal micro SD slots, a small and thin 2560x1600 monochrome display (that I can go using outside and in the sunlight), a good speaker, ruggedized plastic case and thick keyboard keys, crazy light - it could use a phablet battery, why not.
Battery life would be drained when you turn on the wifi or 4G.

Grall · Oct 22, 2013

We do have shit like that... They're called tablets, and they don't have shitty hardware like 186 CPUs and 1-bit low-res no-lit displays, and actually get up to 10 hours battery life (or maybe even more) rather than 8.

Gubbi · Oct 22, 2013

Raqia said:
Let me see if I've got it right: in weak memory models (with out hardware out of order loading), you're counting on the compiler to insert memory fences into machine code to make sure that what the processor executes is what a programmer's high level code means, but because a compiler might have no real understanding of your code as it behaves at run-time, it might insert more memory fences than are necessary to ensure correctness over aggressive reordering. Fences also take up an instruction even if they're ignored.

Correct. A strong memory ordering model implies every load and store is done in order as viewed by the program counter. There are various weak memory ordering models where the weakest has zero concern for ordering and a load or store execute when it is dispatched.

Since only apparent order has to be correct (at least for non-IO memory mapped addresses) there is a lot of room for optimization. The first is to let loads execute out of order with other loads. The next is to execute load/store operations out of order where the addresses are known (and aliasing can be detected). The final one is to speculatively execute loads even though the address of a pending store(s) isn't know.

All of this costs silicon real estate so a LS unit implementing a weak memory ordering model takes up a lot less silicon real estate than a fast strong memory ordering LS unit does. For example Silvermonts' LS unit is in order to save transistors and power.

A weak memory model is perfectly fine if you can guarantee there is no aliasing between your data structures. Fortran and C++ arrays are guaranteed not to overlap so resolving aliasing is easy (there is none), exchange your C++ arrays (foobar[]) with pointers (*foobar) and the compiler is up shit creek without a paddle.

Raqia said:
A strong memory model has the correct ordering pre-baked into the semantics of the machine instructions so you can use well defined hardware rules to reorder them and still maintain correctness.

Correct.

Raqia said:
This general technique improves performance because loads can be very long latency instructions and running ahead to execute the next load in parallel instead of waiting for them end to end yields more gains than reordering a bunch of lower latency instructions to run in parallel.

Exactly, since your CPU doesn't stall on RAW hazards it can speculate further and start more loads earlier.

Cheers

Raqia · Oct 22, 2013

Thanks Gubbi. I've taken another look at the specs of the Athlon vs. P6 derivatives and the Athlon had 4x the L1 cache of its contemporary P6 based rivals. This alone must have obviated most latency penalties relative to the P6 from its inability to reorder loads.

I Can Hazwell?

I.S.T.

3dilettante

Gubbi

3dilettante

Gubbi

Malo

Yak Mechanicum

3dilettante

Laurent06

Gubbi

Gubbi

Laurent06

Grall

Invisible Member

DavidC

Raqia

Gubbi

Raqia

Blazkowicz

Grall

Invisible Member

Gubbi

Raqia