I Can Hazwell?

Gubbi · Sep 13, 2012

Exophase said:
It's not pretty, but the gather could be replayed in its entirety upon any fault whatsoever, and if you so desire, after an interrupt. All that's required is that the cache and TLB have at least as many ways as there are elements in the vector, because otherwise you could get an infinite loop where the later fields keep evicting the former ones.

Replaying the gather (re-issuing it from the scheduler) won't do. The ROB is of limited size, 168 entries in Haswell, equivalent to just 42 cycles at full tilt. You'd need to remove the gather and all subsequent instructions from the ROB, which means a faulting gather acts like a mispredicted branc on top of all the other penalties.

The load, mask, loop until mask=0 guarantees forward progress, doesn't require any internal state saved on preemption, doesn't throw away work done by subsequent instructions already executed thanks to the OOO machinery. The OOO machinery can also overlap multiple independent gather-loops.

And as Aaron points out. There is nothing preventing Intel recognizing the load/mask/branch idiom in future implementations, speeding it up.

Cheers

Exophase · Sep 13, 2012

Gubbi said:
Replaying the gather (re-issuing it from the scheduler) won't do. The ROB is of limited size, 168 entries in Haswell, equivalent to just 42 cycles at full tilt. You'd need to remove the gather and all subsequent instructions from the ROB, which means a faulting gather acts like a mispredicted branc on top of all the other penalties.

Maybe we have a different idea in mind of what "replay" means, I'm merely saying to restart the gather instruction as if it had made no progress. Not to flush the reissue from where the gather was as if it caused a branch misprediction. The point is not having to save and restore any internal state in order to resume a partially completed gather. Is completely re-executing instructions on faults rather than trying to complete them really that unusual?

Gubbi said:
The load, mask, loop until mask=0 guarantees forward progress, doesn't require any internal state saved on preemption, doesn't throw away work done by subsequent instructions already executed thanks to the OOO machinery. The OOO machinery can also overlap multiple independent gather-loops.

And as Aaron points out. There is nothing preventing Intel recognizing the load/mask/branch idiom in future implementations, speeding it up.

I don't think I understand something: are the two of you arguing in favor of the mask loop approach or are you trying to say you think that's what Haswell is actually using? Because that is definitely not how gather is specified in the AVX2 documentation. For better or worse, this isn't what Intel is doing.

And do you also not think adding a branch on vector mask instruction is a pretty big shift?

rpg.314 · Sep 14, 2012

aaronspink said:
microcoded has a lot of rather thorny problems that aren't fun to implement, verify, or validate on a cpu. The coded sequence solution provides a defined bounded operation that can be implemented and verified much easier.

The problem really is one of bounds. Consider that each entry in the gather could potentially point to a different PTE and that PTE may or may not be loaded into TLBs or even cache. And I cannot recall off the top of my head whether the PTE can themselves be virtually/indirectly allocated, etc. So for 1 gather you could be looking at potentially 16+ tlb fills + page faults + memory accesses, etc. We're talking upwards of thousands of cycles in what would be in the microcoded case an atomic operation that has significant implications up and down the architecture and validation stack. If you look at errata for various processors, you are likely to find many entries associated with long complicated atomic memory operations.

By implementing it as a load->mask->fill->update instruction, the side effects are significantly restricted and the performance difference should be minimal in a modern core.

But why is a masking based approach incompatible with microcode? Let the microcode generate an unrolled loop. If any of these issues arise, then declare exception and update masks just like you would with a load->mask->fill->update instruction.

All the PTE/TLB related issues could happen just as easily with a load->mask->fill->update instruction.

aaronspink · Sep 14, 2012

Exophase said:
The problem is that requirement for a new vector branch instruction. Masks in AVX2 are full vector registers, and it isn't designed to branch using a vector as condition input. You'd probably have to output to the zero flag instead. Still, that's two vectors the instruction has to write instead of one. I'm not aware of any AVX2 instructions with such capability. Surely that hurts the design somewhere. Larrabee gets around it by having a special register set for predicates.

obviously the mask registers aren't the main register set, they are status registers.

Exophase · Sep 14, 2012

But AVX2 already specifies a gather with an arbitrary vector mask to merge the output. Are you now proposing a whole new set of instructions to manipulate and branch on these status registers? You want to turn it into LRBNI full stop?

rpg.314 · Sep 14, 2012

Exophase said:
But AVX2 already specifies a gather with an arbitrary vector mask to merge the output. Are you now proposing a whole new set of instructions to manipulate and branch on these status registers? You want to turn it into LRBNI full stop?

I think some kind of unification there is on the way.

Blazkowicz · Sep 15, 2012

hkultala said:
[offtopic]
yep.

But this is a bad strategy.

A) What is usually needed is
1) 1 to few very powerful cpu cores, for code which does not parallelize well
2) Large number of weak cores for massively parallel code
Very few code needs something like 6-12 relatively powerful cores. This is either too few or not enough.

Many-powerful-cores is useful in servers, that is when you deal with many requests and want them to be low latency.
Bulldozer was often described as a server chips. it makes sense.. only, the power use is too high, leading to clocks too low, and as on the desktop if faces sandy bridge which humiliates it.

The logicial conclusion of your post is, we need a cheap with a few powerful cores and many weak ones. this is not easy, even Intel CPU + Knights/Xeon Phi isn't quite there because they run two different x86 instruction sets.
The strong+weak mix now exists in the Tegra3, and yet to launch Cortex A15 + Cortex A7 designs, but it's here to save milliwatts, not for performance. It also helps that the device is shipped with a custom linux kernel with adequate scheduler.

Workstations happily use up your 6 or 12 or more cores or threads, because that's what is easily available now. In fact, if the Piledriver incremental improvement is good enough I think AMD could pit dual socket desktop boards against single socket 2011. It's versatile. Same hardware can be used to run 40 linux VMs, or to do video rendering or something. "Weak cores" solutions have to get better (the Xeon Phi is a significant milestone and we'll see what nvidia Maxwell is up to, as well as AMD steamroller and post-steamroller). But the software aspect is crucial (an 8-core, 12-core or 16-core machine has the advantage of running regular software, not specially crafted one)

I wonder about Intel's next crazy CPU with many powerful cores, the EX variants. They now seem to skip the generations that bring a new archicture but not a new process. So we have Westmere-EX, Ivy Bridge-EX and I suppose Broadwell-EX after that.

gongo · Sep 15, 2012

impressive specs on show...doubling of a lot of units, but why are sites saying the only performance doubling is the iGPU and a marked increase in idle watt savings...the CPU side expects to see no more than 10% gains over Ivy Bridge on average?? It is a 95W part vs 77W...have we come to a point that clockspeed is a limiting factor...? AFAIK Intel has not gone over 4Ghz on Turbo for their quadcore parts..for a long time....

From LGA1366 to LGA1155...we saw the average clockspeed went from 2.66/3.2Ghz to 3.4/3.8Ghz, can Hazwell be limited by clockspeed? As the 22nm Ivy Bridge refresh did not clock much higher (by which i mean extreme overclocking - 5Ghz) than Sandy even after the delidding and replacement of heatpaste...

tunafish · Sep 15, 2012

gongo said:
impressive specs on show...doubling of a lot of units, but why are sites saying the only performance doubling is the iGPU and a marked increase in idle watt savings...the CPU side expects to see no more than 10% gains over Ivy Bridge on average?? It is a 95W part vs 77W...have we come to a point that clockspeed is a limiting factor...? AFAIK Intel has not gone over 4Ghz on Turbo for their quadcore parts..for a long time....

From LGA1366 to LGA1155...we saw the average clockspeed went from 2.66/3.2Ghz to 3.4/3.8Ghz, can Hazwell be limited by clockspeed? As the 22nm Ivy Bridge refresh did not clock much higher (by which i mean extreme overclocking - 5Ghz) than Sandy even after the delidding and replacement of heatpaste...

Most of the doublings happened for the SIMD units. That won't be even visible until software is recompiled -- and in some cases, rewritten. I expect Haswell to be one of the biggest advances in CPU speed in a long time, once the software catches up. In terms of "I have this old app, when I copy it over to my Haswell machine, how much faster will it go?", I don't expect all that much.

Also, many of the Haswell changes scream "reduction in possible clock speed" to me. Even if it won't clock lower than Ivy, it will certainly clock lower than it would without those changes. Basically, Intel went: "Okay, let's make every pipeline stage 10% longer. What can you do with that?"

gongo · Sep 15, 2012

Why does Intel want to stop the clock?? Is this really really the end of the mhz....? Everyone loved Sandy Bridge, and a lot more after how weak the Ivy Bridge refresh was but LGA1366/1156 to LGA1155 is really down to how fast Sandy Bridge can clock up..if Hazwell desktop quadcore parts dont follow that kind of clock up...what is Intel gameplan for the desktop users....how long will x86 software catch up with Hazwell doubled SIMD units?

Exophase · Sep 15, 2012

gongo said:
Why does Intel want to stop the clock?? Is this really really the end of the mhz....? Everyone loved Sandy Bridge, and a lot more after how weak the Ivy Bridge refresh was but LGA1366/1156 to LGA1155 is really down to how fast Sandy Bridge can clock up..if Hazwell desktop quadcore parts dont follow that kind of clock up...what is Intel gameplan for the desktop users....how long will x86 software catch up with Hazwell doubled SIMD units?

Intel is optimizing for perf/W, and the perf/W isn't that great at what the current CPUs can overclock to. So they aren't as aggressive as they could be with stock nor even turbo clocks. If eating out of that unused headroom means better perf/W, especially at the lower clock targets for ULV parts, then it's an obvious win to them.

When you refer to "everyone" and "desktop users" I get the distinct impression that you're actually referring to overclockers, which are still somewhat of a niche, especially when only the highest ends of the product lines can really overclock to begin with. At stock speeds IB is an obvious win, nobody would say that it makes SB look better.

That all said, we really don't know if Haswell's changes are forcing clock time to go up, especially vs SB which was on an older process. It's a given that some stages will take longer to check dependencies of and dispatch to those additional ports, but the clock time is only as fast as the slowest pipeline stage. And while CPU designers will do their best to make the stages run as close to the same speed as possible I doubt they really get it perfect, so who knows if there wasn't another slower stage that gave them this headroom.

Furthermore, you'd expect IB would reduce cycle time requirements vs SB if only slightly, yet the overclocking potential was less most likely due to power density issues. So they may have had headroom that wasn't even accessible and at that point there's no reason not to trade that for perf/W. And Haswell may be better optimized for the process with regards to power distribution. So I wouldn't say that it's a given that it'll hit peak clocks below what SB could.

entity279 · Sep 16, 2012

Well AMD had been the ones shouting about the increased use of CAD - created layouts. I think anyway, with process tech advancing its inevitable that automation will be used more and more everywhere.

Maybe the somewhat lack of aggressive frequency increases is also an indication that intel now uses a bit less custom logic?

Rootax · Sep 16, 2012

gongo said:
Why does Intel want to stop the clock?? Is this really really the end of the mhz....? Everyone loved Sandy Bridge, and a lot more after how weak the Ivy Bridge refresh was but LGA1366/1156 to LGA1155 is really down to how fast Sandy Bridge can clock up..if Hazwell desktop quadcore parts dont follow that kind of clock up...what is Intel gameplan for the desktop users....how long will x86 software catch up with Hazwell doubled SIMD units?

I think some softwares can quickly take advantages of AVX2 no ? Like video encoding stuffs, 2d/3d rendering softwares, no ?

Blazkowicz · Sep 16, 2012

I don't remember any benchmark using the special abiilities of Bulldozer (FMA4)
software may be slow moving but maybe the authors don't care unless the new stuff is on Intel CPUs.

gongo · Sep 16, 2012

Do you think Haswell doubled SIMD units will rock with PC games..? Well..i think a i5 4570K will clock between 3.6Ghz to 4Ghz and the i7 4770K will be between 3.8Ghz to 4.2Ghz...both capable of running stock 1866mhz ram...these numbers from thin air around me i just pulled out off...but at these kind of clocks, where will Haswell stands?? Guesses gentlement?

It is smaller than the jump from LGA1366/1156 to Sandy Bridge...

fellix · Sep 16, 2012

Blazkowicz said:
I don't remember any benchmark using the special abiilities of Bulldozer (FMA4)

There was a slide back then from the official BD presentation, showing rather impressive results from some OCL kernel with FMA4 support, but nothing more to the date.

Raqia · Sep 16, 2012

TDP for IVB is 77W and Haswell is 95W. Presumably they're talking about the top end desktop parts, and those don't necessarily have the highest powered GPUs attached so part of that ~18W increase has to go into the CPU. The revisions to the architecture itself, though sizable, can't be consuming all that power, so I'm expecting top clocks to be higher as well.

tunafish · Sep 16, 2012

gongo said:
Do you think Haswell doubled SIMD units will rock with PC games..? Well..i think a i5 4570K will clock between 3.6Ghz to 4Ghz and the i7 4770K will be between 3.8Ghz to 4.2Ghz...both capable of running stock 1866mhz ram...these numbers from thin air around me i just pulled out off...but at these kind of clocks, where will Haswell stands?? Guesses gentlement?

Actually, I think those clocks are totally unrealistic. It's running on the same process as Ivy, and every single major change they made makes the critical path longer. The typical clocks are going down, not up. Alternatively, since Intel chips now have huge clock headroom over stock, they might just spend some of that to maintain present clocks.

It is smaller than the jump from LGA1366/1156 to Sandy Bridge...

From Nehalem to Sandy Bridge, the CPU got much better L3 and uop cache. From Ivy to Haswell, it gets twice the bandwidth to L2, bank-conflict-free L1, +1 ALU, 2 branches per clock, the ability to do 2L+1W per clock, and all the new instruction set goodness.

For existing software, this will be the biggest gain in IPC since Core -> Core 2. However, I expect the IPC gain to be compensated by lower clocks.

gongo · Sep 16, 2012

tunafish said:
Actually, I think those clocks are totally unrealistic. It's running on the same process as Ivy, and every single change they made screams to me "longer critical path". The typical clocks are going down, not up. Alternatively, since Intel chips now have huge clock headroom over stock, they might just spend some of that to maintain present clocks.

From Nehalem to Sandy Bridge, the CPU got much better L3 and uop cache. From Ivy to Haswell, it gets twice the bandwidth to L2, bank-conflict-free L1, +1 ALU, 2 branches per clock, the ability to do 2L+1W per clock, and all the new instruction set goodness.

For existing software, this will be the biggest gain in IPC since Core -> Core 2. However, I expect the IPC gain to be compensated by lower clocks.

Well i meant clock speed..sorry if it was unclear.
77W to 95W tdp should account for some more clocks....maybe 3.6ghz to 4ghz for the 4770K....and the 4570K should take the speed of the present 3770K...3.5ghz to 3.9ghz..kinda of sucky if that is all...

Could we be waiting for Haswell-E parts with the 8 cores sku..i dont understand why Intel dont want to go higher on the 95W Haswell quad desktop sku..why do we need double the iGPU performance on my gaming PC? I find it irony doing so will actually have a hand in killing off the desktop market....it is like irony fox..../cries profits from desktop is shrinking.../design your next cpu around perf/watt and portability..

Blazkowicz · Sep 16, 2012

VRMs are moved onto the CPU package, I believe this accounts for most of the TDP increase.

This also means a piece of crap motherboard with a non-K high end CPU gets (even more) attractive. Expect creative chipset segmentation and annoying marketing of IGP tiers (well we've had these things in place already)