View Full Version : I Can Hazwell?
Chinese site Chiphell has dredged up some Haswell slides (http://www.chiphell.com/thread-308643-1-1.html) by plumbing the depths of the interwebs - or something. Some interesting tidbits appear, such as:
* New socket for desktops - AGAIN; LGA 1150.
* Desktop CPUs rated up to 95W TDP.
* ULV CPU variant uses multi-chip module with platform hub and CPU on the same substrate; 1.5mm thickness; rated at 15W TDP.
* Thunderbolt support listed as a bulletpoint (unknown if integrated into chipset, and if so into which variants.)
* More reductions in power useage, and so on.
Chinese site Chiphell has dredged up some Haswell slides (http://www.chiphell.com/thread-308643-1-1.html) by plumbing the depths of the interwebs - or something. Some interesting tidbits appear, such as:
* New socket for desktops - AGAIN; LGA 1150.
* Desktop CPUs rated up to 95W TDP.
* ULV CPU variant uses multi-chip module with platform hub and CPU on the same substrate; 1.5mm thickness; rated at 15W TDP.
* Thunderbolt support listed as a bulletpoint (unknown if integrated into chipset, and if so into which variants.)
* More reductions in power useage, and so on.
Ggggrrrrrrrrrrrr, Intel and sockets....
edp is embedded displayport used in laptops and such devices with internal connections.
ddr3l is special low voltage ddr3 standard that uses 1.35V
I would have asked you guys what those are but then google spoiled all the fun :(
Thankfully though googling on "Fully integrated VR" came up with little meaningful results so I can at least have an excuse to post here. What does that mean? virtual reality? voltage regulators? vaccine for rabies?
Man from Atlantis
09-Nov-2011, 17:33
and this means i skip 1155 and wait for 1150 :D
3dilettante
09-Nov-2011, 17:40
Thankfully though googling on "Fully integrated VR" came up with little meaningful results so I can at least have an excuse to post here. What does that mean? virtual reality? voltage regulators? vaccine for rabies?
Voltage regulators, I believe.
Looks like integrated graphics will get another healthy boost - there are now 3 (GT1 GT2 GT3) options. Interestingly, no plans to offer the fastest option on the desktop, even though the ULV version gets it?
Yes, I'd speculate and say "VR" probably refers to voltage regulators. It's been talked about in the past - in Anand presentations and stuff - that integrating those is the next step to increase power efficency since it would allow much faster low-high-low voltage transitions.
3dilettante
09-Nov-2011, 19:17
There should be cost benefits from further integration of board components on-die and better power consumption with more responsive scaling.
One point of curiousity with that is whether it will change the warmup period for AVX mode, where full 256-bit throughput is not acheived until after a warmup period of 70 or so cycles.
I've seen discussion that this could be due to an internal microcode rampup to keep voltage droop in check until the power delivery can catch up.
Maybe it will be less necessary when the voltage regulation is on-chip.
I have questions I hope will be answered in time.
The SB -> IB -> HW TDP divot is curious. What factors lead to IB having a lower TDP than either side? Does the higher level of integration mean somewhat higher chip consumption despite lower overall platform draw?
Does the integrated VR mean less ability by board partners to differentiate based on wacky custom VRM specs?
Will the on-die VR have a lower voltage ceiling than current enthusiast boards permit?
One point of curiousity with that is whether it will change the warmup period for AVX mode, where full 256-bit throughput is not acheived until after a warmup period of 70 or so cycles.
I've seen discussion that this could be due to an internal microcode rampup to keep voltage droop in check until the power delivery can catch up.
Maybe it will be less necessary when the voltage regulation is on-chip.
Isn't that why Intel moved to physical register file organisation, precisely to avoid the heavy power load by the widened internal data paths?
3dilettante
09-Nov-2011, 20:46
A PRF reduces general power consumption, but the AVX units do more work in 256 bit mode versus the regular 128.
If the core's voltage levels are currently adjusted to suppply 128-bit mode adequately, the sudden addition of twice as many vector ALUs executing could pull them below safe levels.
A possible fix is to have the chip emit a startup sequence of ops so that the full load from the ALUs is delayed until the power delivery system can catch up.
This is why I wonder if moving the voltage regulators on-die could speed up the process because additional power is much closer and more responsive than before.
Lightman
10-Nov-2011, 12:39
I think IVB 77W TDP to HSW TDP of 95W is probably dictated by number of transistors used in each design. Both of them are 22nm tri-gate monsters, but I expect HSW to have significantly more trannies deployed as this is new core with possibly even more cache + AVX2.
It can also indicate that Intel leaves themselves door open to eventual 6-core version without hitting TDP wall imposed by too weak motherboard VRM's.
DarthShader
10-Nov-2011, 15:52
Those TDP numbers surely contain the efficiency losses from the integrated VRs. No idea how efficient such design can be, but possibly at full load ca. 10W could be generated by the VR.
denev2004
11-Nov-2011, 04:57
The link has been deleted.
denev2004
11-Nov-2011, 04:59
There should be cost benefits from further integration of board components on-die and better power consumption with more responsive scaling.
One point of curiousity with that is whether it will change the warmup period for AVX mode, where full 256-bit throughput is not acheived until after a warmup period of 70 or so cycles.
I've seen discussion that this could be due to an internal microcode rampup to keep voltage droop in check until the power delivery can catch up.
Maybe it will be less necessary when the voltage regulation is on-chip.
I have questions I hope will be answered in time.
The SB -> IB -> HW TDP divot is curious. What factors lead to IB having a lower TDP than either side? Does the higher level of integration mean somewhat higher chip consumption despite lower overall platform draw?
Does the integrated VR mean less ability by board partners to differentiate based on wacky custom VRM specs?
Will the on-die VR have a lower voltage ceiling than current enthusiast boards permit?
Well I always believe the TDP of IVY due to the frequency
Intel has kept Performacne 2 level quad-core CPU TDP at 95W for a long time since Q6600 95W Version.
metafor
11-Nov-2011, 22:49
Isn't that why Intel moved to physical register file organisation, precisely to avoid the heavy power load by the widened internal data paths?
You still have to support the datapath width through pipeline registers. The move to a PRF gets rid of various queues having to contain the operand data but not the arithmetic pipe.
And that's still a lot of power for a 256-bit datapath, especially for the multipliers.
The higher TDP may just be from the fact that they're moving to 6 or 8 cores; the fact that their mobile parts have a lower TDP probably means we're still getting better efficiency per core as usual.
3dilettante
30-Dec-2011, 05:20
The TDPs are for ULV, mobile, and desktop ranges, which top out at 4 cores. Perhaps something is waiting in the wings, but it seems 4 Haswell cores can manage to hit 95W if need be.
Transactional memory with Intel Haswell (http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/)
denev2004
11-Feb-2012, 07:12
Transactional memory with Intel Haswell (http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/)
I‘d like to know more about it...I quite don't understand they way they express it
I‘d like to know more about it...I quite don't understand they way they express it
Ars Technica has a nice article about it:
http://arstechnica.com/business/news/2012/02/transactional-memory-going-mainstream-with-intel-haswell.ars
Ars Technica has a nice article about it:
http://arstechnica.com/business/news/2012/02/transactional-memory-going-mainstream-with-intel-haswell.ars
I don't know, the transactional memory sounds like a solution which should be on software's side, not on hardware. On the other hand, maybe platforms like "Concurrent collections" (http://msdn.microsoft.com/en-us/library/dd997305.aspx) in .NET can be rewritten and optimized by using transactional memory.
I don't know, the transactional memory sounds like a solution which should be on software's side, not on hardware.
Indeed, there is quite long history of "transactional memory" on software side and numerous implementations. I haven't tried any of them, but I'm sure all these come with very long list of situations where transactions (as in "transparent, automatic and carefree safety net") break down: network acess, filesystem access and whatnot.
The CPU-s already have most rudimentary transactional support - "compare and swap" instructions. I remember these used to stall all other work on CPU. If you think about transactions in Haswell as ability to atomically execute a bit more than one CPU instruction and having these without stalling the whole 8-core CPU, then transactions is quite reasonable feature to have.
Transactional support in CPU-s must have some appeal, because many have tried it. I remember Sun (canceled) and IBM (delivered as a part of one supercomputer). If Intel delivers mass-produced product that scratches an itch many programmers apparently have, then hats off for pushing state of the art forward.
Software based transactional memory is generally too slow to have any performance advantage, though.
Some CPU have "LL/SC" (load-link/store-conditional) instructions, such as many RISC CPU, which is in a way similar to transactional memory, and can also be used to implement lock-free algorithms. But of course transactional memory is still much broader.
I don't know, the transactional memory sounds like a solution which should be on software's side, not on hardware.Think of it similarly of a HW encryption or even just prefetching. It's a HW implementation of an operation that should make it far more efficient than just implementing it in software
rpg.314
14-Feb-2012, 01:51
I don't know, the transactional memory sounds like a solution which should be on software's side, not on hardware. On the other hand, maybe platforms like "Concurrent collections" (http://msdn.microsoft.com/en-us/library/dd997305.aspx) in .NET can be rewritten and optimized by using transactional memory.
Usual languages make a mess of side effects and mutation, which is why it's hard for them to do it with good performance. Haskell is the perhaps only language which can do it without a significant performnace hit.
3dilettante
14-Feb-2012, 15:12
This functionality allows the core to speculate past something that it normally couldn't, in the same vein of how current designs reorder memory operations and replay if it turns out some of the conflict.
The problem with synchronization was that it was something hardware couldn't safely make assumptions about without programmer intervention.
Intel hasn't disclosed any implementations yet, so what parts of the memory pipeline it changes and what limits it imposes are not yet known.
It could be using cache miss buffers, which was posited by another similar proposal. Writes and reads would be routed to the buffers, which would be also in contact with the cache controller.
Once the last instruction in the section ends the speculative state, it would be an all or nothing cache fill.
Superb: David Kanter: Analysis of Haswell's Transactional Memory (http://www.realworldtech.com/page.cfm?ArticleID=RWT021512050738)
3dilettante
15-Feb-2012, 15:17
I'd be curious about the proposed use of per-line bits on each line in the L1/L2.
The lines in the write set must become visible atomically, but as presented it sounded like Kanter felt like a significant portion of the thousands of lines in the L2 could be in play for a transaction.
There could be a way to flash clear the listings, perhaps storing the bits in a compact table or using some kind of nested list next to the cache controller and core instead of across the L2.
If there is a sequential update process, there might be a period of inconsistency, where some lines in the transaction become visible externally while others are still pending. Another thread's accesses might straddle that update line, which may require a check by the cache controller and possibly an additional stall in processing the snoop.
I'd be curious about the proposed use of per-line bits on each line in the L1/L2.
As I understand it, it is used for hardware lock elision. Every cache line accessed within a HLE transaction sets a bit for each line it accesses. The L3 holds a copy of all cache lines in L1/L2. If a different processor tries to access a cache line, the L2 bits are checked for the particular cache line, if a bit is set, the transaction is aborted and data reloaded from L3 to the L1/L2 and the foreign processor. The critical section is then retried with strict synchronization.
This allows a processor to speculatively perform a critical section.
Cheers
3dilettante
15-Feb-2012, 19:44
As I understand it, it is used for hardware lock elision.
They're used for HTM and ellision. Ellision is overlaid on top of the HTM functionality.
My question concerns where he expects those bits to be located physically, and how they are updated atomically. The worst-case theoretical maximum is 5k+ lines being updated as being visible upon the completion of a speculative section. (It will probably give waaay before that.)
Per Intel's description, all modified lines in a transaction become visible to other readers atomically, and that can be violated if those bits were actually physically on every line and the core can't set them to be visible within a single cycle.
Per Intel's description, all modified lines in a transaction become visible to other readers atomically, and that can be violated if those bits were actually physically on every line and the core can't set them to be visible within a single cycle.
I would expect the bits to be located with the L2 cache tags. The L1s aren't store through, but the core still needs to update usage heauristics for the LRU algorithm used in L2 replacement so that a hot L1 line isn't evicted. When the core stores usage data, it can piggyback the HLE bit.
And yeah, they would need some special hardware to clear all bits in one cycle.
Cheers
tunafish
24-Feb-2012, 13:44
My question concerns where he expects those bits to be located physically, and how they are updated atomically. The worst-case theoretical maximum is 5k+ lines being updated as being visible upon the completion of a speculative section. (It will probably give waaay before that.)
Per Intel's description, all modified lines in a transaction become visible to other readers atomically, and that can be violated if those bits were actually physically on every line and the core can't set them to be visible within a single cycle.
I would expect the bits to be located with the L2 cache tags. The L1s aren't store through, but the core still needs to update usage heauristics for the LRU algorithm used in L2 replacement so that a hot L1 line isn't evicted. When the core stores usage data, it can piggyback the HLE bit.
Cache tags are the smart place to put them, because with transactions nearly every operation that touches the tags will also need to at least look at the transaction bits. Although I'd fully expect that there will be txn bits in both L1 and L2 tags, and they will just be transported intact to L2 whenever data is evicted there. (so L2 tags don't need any extra space and complexity to handle the ways of the L1 cache).
And yeah, they would need some special hardware to clear all bits in one cycle.
The big point is that this hardware really isn't that expensive. To turn an operation from line select to flash broadcast needs one extra signal line and one extra gate per signal junction. Instead of each junction choosing "if a, then propagate to A, if not a then propagate to b", it goes to "if a or t then propagate to A, if (not a) or t then propagate to B".
And the only broadcast operations needed are commit (clear txn bits), and rollback (mark line as invalid).
So yes, it's entirely feasible for the entire cache to be used in a single transaction.
More talk now about Haswell being a multi-chip module in its higher incarnations, featuring an external L4 cache die entirely dedicated for graphics use. The schematic imagery shown over at VR-Zone (http://vr-zone.com/articles/mystery-solved--haswell-expected-to-up-the-graphics-ante-further-again/15272.html) suggests that the Haswell chip will feature a more square-ish overall shape rather than the increasingly rectangular appearance that has been the case since the Core i-series first launched.
Also, going by this image, the GPU portion now appears to occupy the majority of the die, and be more than twice the size of previous generations. Of course, this might not be an accurate representation of the chip and its layout, but speculation is fun, isn't it? After all, it's what drives every internet tech forum...! ;)
3dilettante
19-Mar-2012, 19:01
Are their schematics sourced from anywhere, or are those diagrams just a best guess?
No idea, if the article doesn't say. Presumably leaked stuff, considering the diagrams are fairly detailed, but it could of course be the child of an active imagination in some random Adobe Illustrator user on the internet. :)
More talk now about Haswell being a multi-chip module in its higher incarnations, featuring an external L4 cache die entirely dedicated for graphics use. I wonder how crazy would it have been to implement it as an on-chip NUMA instead of plain L4 cache just for GPU. (decent :-P) OSes have had NUMA support for quite a while, I guess it would have kind-of worked with that kind of setup as well.
Probably not enough storage to have been of any use as NUMA RAM. If that die is almost all DRAM, how much could it possibly store, surely not more than say, 128MB or so. That's a piss in the ocean today in terms of RAM.
Even if it's 512MB it's nothing major to speak of really.
Also, how would you get the OS to properly manage this low-latency RAM and not waste it on some useless random resident DLL crap or whatever that isn't used very much? There's no special support for such memory devices to my knowledge (because they don't actually exist yet... ;))
Also, how would you get the OS to properly manage this low-latency RAM and not waste it on some useless random resident DLL crap or whatever that isn't used very much? There's no special support for such memory devices to my knowledge (because they don't actually exist yet... ;))How would it be any different from any random implementation of NUMA? I thought when OS sees some app (or parts of RAM) gets used more it simply pulls that closer to the CPU. Though I admit I know really little about how NUMA works in reality so I may have unrealistic expectations :)
Well, how would it know this RAM is closer...? Besides, from what I understand no desktop OS can optimize RAM during runtime; once something's loaded somewhere it pretty much stays there.
IDF in live from anandtech (http://www.anandtech.com/show/6263/intel-haswell-architecture-disclosure-live-blog)
Good Lord!
Haswell is indeed taking full advantage of the 22nm tech.
Looks like the main pipeline stays the same and there's a lot of quantity improvements to certain structures as well as tons of un-core changes. Some of the more significant changes so far for the CPU side:
uArch:
- Extra integer & extra store port
- AVX2
- TSX
- L1 latency & conflict improvements
- Beefed up cache bandwidth L1 & L2
power:
- Active idle state
No news on clockspeeds yet.
Good Lord!
Haswell is indeed taking full advantage of the 22nm tech.
Indeed after AMD presentation of the Streamrollers cores, some here hinted at mild improvements vs Ivy Bridge, I may not get everything but the thing looks like a monster.
They already had a huge advantage in cache bandwidth they are doubling that advantage.
They are doubling the FP throughput vs Ivy bridge 32/16 Sp/DP FLOPs per cycle per core.
They are improving single thread performances, bigger reordering buffer, improved branch prediction, better power management feature should allow the turbo to kick in more often.
They are widening the architecture, it can now execute 8 micro ops on the fly from 6 in previous architectures.
They improved significantly pretty much everything. It looks to me like a freaking monster. If Intel doesn't cut feature and artificially loweer the performances of their "low end" part the next core i3 might be all you need as gamer and more.
They made significant improvement to the GPU, video decode hardware, the video encode hardware.
All of this while lowering power consumption, it's nothing short of amazing.
Then there is compute, their GPU were doing good already they added extra muscles both for the GPU and the CPU and now the gpu in compute operation has 0.5TB/s of bandwidth the last level of cache the thing could literally fly.
gpu in compute operation has 0.5TB/s of bandwidth the last level of cache the thing could literally fly.
Yeah, that's really good.
Total LLC in current high end i7 models is 8 MB (Haswell should be similar in this regard). That's quite a bit more shared cache for the GPU compared to GCN for example (768KB shared L2 in 7970, 512KB in models with 256 bit memory bus). This is excellent news for some GPU compute algorithms. I wonder if they can also use the 8 MB cache as a render target (like we Xbox 360 programmers use the EDRAM). 8 MB is not enough for a whole 1080p RT, but it would be really good for shadow map rendering for example (depth buffer only, and lots of overdraw). 512x512 shadow map is 1 MB, 1024x1024 is 4 MB (both would fit nicely to LLC).
Transactional memory seems to be really interesting as well :)
Bouncing Zabaglione Bros.
11-Sep-2012, 21:07
Are Intel really telling us all this stuff eight months before launch? That seems a bit bonkers, or is there the posibility that they might surprise people with an early launch?
It's not like Intel are up against anything substantial from AMD.
3dilettante
11-Sep-2012, 21:14
Intel's improved fetch bandwidth and widened the back end to handle two branches per cycle.
It doesn't mention predicting two branches per cycle, though.
It's kind of interesting to see a core that could for some reason generate 3 store addresses a cycle. Perhaps the extra port is to keep the store address out of the way of the load calculations as much as possible.
The L1 is much better in terms of bandwidth, and it sounds like they truly dual-ported it given the claimed elimination of bank conflicts. This was a noted constraint with SB and its competition.
Transactional memory is handled by the L1, which sidesteps my concern about earlier speculation of enlisting the larger and more distant L2.
No mention of how gather is handled, though.
There's a lot of stuff done with an eye towards comprehensive power management and SOC-level control, including further empowering the system agent to trick the OS as to the true status of the cores and their thread activity.
Exophase
11-Sep-2012, 21:40
Are Intel really telling us all this stuff eight months before launch? That seems a bit bonkers, or is there the posibility that they might surprise people with an early launch?
It's not like Intel are up against anything substantial from AMD.
Par for the course for Intel to release new uarch information at IDF just like they always have. When else would they?
No mention of how gather is handled, though.
Hopefully they spill the beans soon. I just hope that you don't have to code a loop for it (like you must for Knights Corner). The worst case is that it's just a long microcoded sequence, but that wouldn't make much sense. I am keeping my pessimistic view until Intel proves me otherwise. Efficient gather is almost too good to be true :)
... Intel has added two extra ports, but none of them does load related things. And "no changes to key pipelines" either. No mention about other load related improvements either. So my conclusion is that gather likely takes several cycles to complete (even without cache misses).
Looks like the Intel's Oregon team is responsible for building around the basic pipeline flow foundation that its Haifa team lays out. This has been true for several consecutive pairs of architectural "tocks" since they started following that execution plan: Conroe->Nehalem, and SandyBridge->Haswell. I'm expecting Skylake to be the next major retooling of the pipeline by the Israel team.
And "no changes to key pipelines" either.
I noticed that as well. What exactly does it mean?
It's kind of interesting to see a core that could for some reason generate 3 store addresses a cycle. Perhaps the extra port is to keep the store address out of the way of the load calculations as much as possible.
Maybe a copy&paste error from SNB/IVB? Since there's now a dedicated store AGU, and only one store data port, it seems like there would be no reason at all to use a shared load/store AGU for calculating the store address. In some way that would be more like Nehalem/Westmere, which also had separate load and store AGUs (but of course just one of each).
But OMG this thing is a beast. AMD thought there's not much point of having a 3rd INT ALU and intel now has 4...
I wonder about the front-end though, no mention of any improvements there. Is that really good enough to feed that monster back-end?
PDFs for presentations today:
https://intel.activeevents.com/sf12/scheduler/catalog.do
click the top link if you're like me and not registered. Do a search for "Haswell." The presentations ARCS001 and SPCS001 are worth a look.
ARCS001 on page 12 states that the extra AGU for store alleviates pressure on ports 2 & 3 for loads. I guess that's something they identified from their simulations as a bottleneck, maybe for hyperthreading? They've also directly addressed many bottlenecks that Agner Fog identified on page 174 of his guide (http://www.agner.org/optimize/microarchitecture.pdf).
Hmm interesting that both port 0 and port 1 can do FMA and port 1 now can do fp mul too but port 0 can't do fp add. Any ideas why that would be?
Maybe in common legacy workloads, FP Adds coincide w/ branches, shifts, and divides.
Edit: I guess more importantly FP Adds coincide w/ FP Mul. The whole point of FMAC is to increase efficiency on that particular combination of instructions, and you still need to take into account legacy code.
No word on the rumored on-package graphics S/DRAM module?
tunafish
12-Sep-2012, 02:33
Well, how would it know this RAM is closer...?
Presumably, the BIOS would give hints that good operating systems would record and make use of.
Besides, from what I understand no desktop OS can optimize RAM during runtime; once something's loaded somewhere it pretty much stays there.
This is incorrect. All software only use virtual addresses, the OS is free to relocate the physical pages anywhere it wants. The oldest example is of course swapping to disk, but modern Linux can do things like migrating memory to a closer NUMA node, or migrating pages to merge many small pages into few 2MB ones. AFAIK, Windows does nothing of the sort.
rpg.314
12-Sep-2012, 03:31
Well, how would it know this RAM is closer...? Besides, from what I understand no desktop OS can optimize RAM during runtime; once something's loaded somewhere it pretty much stays there.
Just the way driver knows GPU memory is closer to GPU. It driver will allocate rendertargets in that memory and when it's full, kick them back to CPU RAM.
3dilettante
12-Sep-2012, 04:12
Hopefully they spill the beans soon. I just hope that you don't have to code a loop for it (like you must for Knights Corner). The worst case is that it's just a long microcoded sequence, but that wouldn't make much sense. I am keeping my pessimistic view until Intel proves me otherwise. Efficient gather is almost too good to be true :)
... Intel has added two extra ports, but none of them does load related things. And "no changes to key pipelines" either. No mention about other load related improvements either. So my conclusion is that gather likely takes several cycles to complete (even without cache misses).
Port 6 and 7 provide integer capability and branching, while also keeping the vector pipes unencumbered.
An internal gather loop could utilize the extra integer operand access and branch capability of the extra ports without simultaneously blocking the vector pipes that would make use of a gather instruction. The store AGU all by itself seems unbalanced, unless it's sharing that port with something they've chosen not to discuss yet, maybe the specialized hardware that would scan a gather index register and detect how many belong to the same cache line.
Port 5 has vector shuffles, which might include that permute unit that both gather and vector work would like. It's not zero-sum because a gather would provide data from memory in the desired arrangement.
The data given so far makes Haswell sound much more interesting than Steamroller, although there's always the chance that more is to come on the latter's account. The breadth of the engineering effort for this architecture visually dwarfs the competition. The promotion of integer vector instructions to 256-bit is going to put some serious hurt on one of the few areas BD was not outclassed in.
rpg.314
12-Sep-2012, 04:16
Hopefully they spill the beans soon. I just hope that you don't have to code a loop for it (like you must for Knights Corner). The worst case is that it's just a long microcoded sequence, but that wouldn't make much sense. I am keeping my pessimistic view until Intel proves me otherwise. Efficient gather is almost too good to be true :)
... Intel has added two extra ports, but none of them does load related things. And "no changes to key pipelines" either. No mention about other load related improvements either. So my conclusion is that gather likely takes several cycles to complete (even without cache misses).
A microcoded sequence could still be faster.
This is incorrect. All software only use virtual addresses, the OS is free to relocate the physical pages anywhere it wants.
It was my understanding that these virtual addresses were transformed into actual addresses once the program code was loaded somewhere into RAM by the operating system and would thus be unable to move, but if that's not true then it's pretty cool.
modern Linux can do things like migrating memory to a closer NUMA node, or migrating pages to merge many small pages into few 2MB ones.
That sounds extremely useful actually. Considering the number of people experiencing crashes in certain misbehaved software like bethesda's Skyrim due to memory fragmentation, it would have been nice if windows had been able to do this as well. Any idea why microsoft haven't bothered to pursue this path? Perhaps they're too content with their current market dominance... *shrug*
But OMG this thing is a beast. AMD thought there's not much point of having a 3rd INT ALU and intel now has 4...
Intel has hyperthreading, so they can feed the ALUs from two instruction streams. Four ALUs is overkill for ILP alone, but add the TLP from two threads to the mix, and the situation becomes very different. As long as other parts of the chip are not a bottleneck, hyperthreading (two threads on a single core) should have performance closer to two separate (2 ALU) cores (in ALU tasks). It's not looking good for AMD.
A microcoded sequence could still be faster.
It would likely be at least slightly faster. If nothing else is improved, at least gather takes only one slot in (x86) L1 instruction cache (but several slots in uop cache), and they can choose the optimal uop sequence for the processor (x86 compilers are too general purpose for this task). But that's a bit pessimistic view, I must admit. Maybe I have spent too much time evading stuff like microcoded imul and sraw (variable shifts) in console programming :)
Port 6 and 7 provide integer capability and branching, while also keeping the vector pipes unencumbered.
An internal gather loop could utilize the extra integer operand access and branch capability of the extra ports without simultaneously blocking the vector pipes that would make use of a gather instruction. The store AGU all by itself seems unbalanced, unless it's sharing that port with something they've chosen not to discuss yet, maybe the specialized hardware that would scan a gather index register and detect how many belong to the same cache line.
Port 5 has vector shuffles, which might include that permute unit that both gather and vector work would like. It's not zero-sum because a gather would provide data from memory in the desired arrangement.
Good points. That would (also) be a good use for the extra ALU/branch ports. If your algorithm is vector math heavy (almost no ALU ops), and doesn't include too many gathers, the CPU should be able to mask out ("co-issue") all the microcoded ALU ops from the gather. But for algorithms that already have interleaved ALU and vector ops, this technique would make the ALU a bottleneck. It would also prevent the other thread (HT) of running ALU heavy code while the other runs vector heavy code... but of course the current ways of doing gather manually are even worse. And the fourth ALU helps in both cases.
The breadth of the engineering effort for this architecture visually dwarfs the competition. The promotion of integer vector instructions to 256-bit is going to put some serious hurt on one of the few areas BD was not outclassed in.
Fourth ALU should be improving the integer performance as well (especially with hyperthreading). It seems that they have made some very good architectural choices that fit together very well. Previously (year ago) I though that gather would be one of the key new features of this architecture, but Haswell has so much more than that to offer. I can't wait to do some performance tests with transactional memory. Assuming it's fully L1 based, the transaction cannot access more than 32KB of memory (minus hyperthreading, minus cache aliasing = around 10KB to be sure). But that's more than enough for games, as game access patterns are usually cache line optimized, and limited in scope. Enterprise software however might need more than Haswell L1 has to offer for their transactions.
Hopefully they spill the beans soon. I just hope that you don't have to code a loop for it (like you must for Knights Corner). The worst case is that it's just a long microcoded sequence, but that wouldn't make much sense. I am keeping my pessimistic view until Intel proves me otherwise. Efficient gather is almost too good to be true :)
It adds a lot of complexity to support single instruction gather. A gather instruction could generate a multitude of addresses that all cause a MMU page-walk, the accesses themselves would require a multi-ported cache to be efficient, each access potentially causing a full cache miss.
Worst case, a single gather instructions could take several thousands of cycles to complete. So you either make it interruptible or suffer intolerable interrupt latency, the former means you need to save partial state of registers (*ugh*) the latter is just not acceptable.
Mind you, a fair fraction of accesses are likely to miss caches and with 4 to 8 cores on a chip it'll be relatively easy to saturate the main memory interface anyway.
So you end up spending a lot of complexity and power on something that might not add a whole lot of performance in the end.
Cheers
Considering the number of people experiencing crashes in certain misbehaved software like bethesda's Skyrim due to memory fragmentation, it would have been nice if windows had been able to do this as well. Any idea why microsoft haven't bothered to pursue this path?
Skyrim is a 32 bit executable, and has 32 bit pointers. It runs out of virtual address space, so no reordering can help it. 64 bit pointers allow a 64 bit virtual address space. With 64 bit pointers you pretty much never run out of virtual address space.
It runs out of virtual address space, so no reordering can help it.
Yeah, I know it's 32-bit, but if windows supported reordering maybe a large enough chunk of continuous memory could be presented to the game.
Yeah, I know it's 32-bit, but if windows supported reordering maybe a large enough chunk of continuous memory could be presented to the game.
No operating system can reorder your software's own virtual address space. It can only reorder the physical data in memory, and update virtual address tables accordingly. If you have only 32 bit pointers in your game and you do a lot of dynamic memory allocation, you will eventually run out of continuous memory blocks (in the 32 bit virtual memory address space), and there's nothing an OS can do to help you.
tunafish
12-Sep-2012, 11:51
It was my understanding that these virtual addresses were transformed into actual addresses once the program code was loaded somewhere into RAM by the operating system and would thus be unable to move, but if that's not true then it's pretty cool.
No, that's what page tables and TLBs are for. Basically, when you issue a load, the first thing that happens is that the CPU looks for the address you gave it from the TLB. If found, it takes the physical address stored in the TLB, and uses it instead. If not found, it fires up the page walker, and walks the page tables (an in-memory data structure) to find the correct physical address (and stores it in the TLB). If still not found, it interrupts into the OS and lets it handle it.
So the address translation is entirely dynamic and run-time. It's how processes are separated on multi-tasking operating systems -- your address 0x4000 can point to something completely different than my 0x4000, and the privileged operating system structures are not found in either of our address spaces.
You can actually do all kinds of neat things with page tables. For example, Azul systems uses it for unblocking GC. Basically, it gives you a cheap (ish) hook you can invoke on any memory access to a given page (on x86, that's 4k/2M/1G granularity).
That sounds extremely useful actually. Considering the number of people experiencing crashes in certain misbehaved software like bethesda's Skyrim due to memory fragmentation
This wouldn't actually help. The things that get fragmented are the 32-bit virtual addresses -- the physical pieces of ram can be moved about at will, but the 32-bit addresses cannot change, simply because the OS would then have to fix up every address in the program, and it cannot know what is an address and what is an unfortunately chosen integer.
it would have been nice if windows had been able to do this as well. Any idea why microsoft haven't bothered to pursue this path? Perhaps they're too content with their current market dominance... *shrug*
They just can't keep up. In the internals, modern Linux is now about a decade ahead of Win8, and the difference is growing, not decreasing.
Exophase
12-Sep-2012, 16:31
Intel's improved fetch bandwidth and widened the back end to handle two branches per cycle.
It doesn't mention predicting two branches per cycle, though.
Unless they decoupled the branch predictor from the rest of the frontend like AMD did I don't think they'd really even be able to predict multiple taken branches in one cycle. One block is loaded from fetch/uop cache and for that you can only make use of one BTB hit. No later instructions in the block would apply.
You could benefit from being able to predict multiple untaken branches in a block (up to the end or first taken branch). It may already do this. I know the BTB supports up to 4 branches per fetch block in SB; the prediction resolution before lookup may be capable of predicting all four in parallel.
Hmm interesting that both port 0 and port 1 can do FMA and port 1 now can do fp mul too but port 0 can't do fp add. Any ideas why that would be?
My guess would be this: on SB/IB, FADD and FMUL latency is only 3 cycles but on Haswell FMA latency is 5 cycles which is substantially higher. David Kanter has remarked that Intel engineers found Bulldozer's 5-6 cycle FMA latency to be a weakness, so I don't think they'd be happy with 5 cycles for FADD and FMUL. So I'm guessing they did what they could to bypass the FMA unit to reduce latency for FADD and FMUL: you can get a multiply result early and start an add late. And for the early multiply result the rest of the FMA is a don't care, if it runs at all, but for the early add you have to feed it a 0 to start with. So it may be that a fast FADD is more complex to support than a fast FMUL and therefore they only have one.
It would likely be at least slightly faster. If nothing else is improved, at least gather takes only one slot in (x86) L1 instruction cache (but several slots in uop cache), and they can choose the optimal uop sequence for the processor (x86 compilers are too general purpose for this task).
In SB/IB the uop cache doesn't store more than the first few uops from a microcode sequence. Were you thinking that Haswell would expand entire microcode routines into the uop cache? I'm not sure they'd do this because it'd complicate the mapping between uop cache and L1 instruction cache and it'd also open up the potential for uop cache thrashing with lots of microcode instructions which would all have to be inlined into the cache to get proper performance.
Without such a mechanism Haswell would need to have much faster microcode ROM throughput to maintain a fast microcoded gather. Historically it has only been one uop per cycle, where the decoders then can't provide anything. This might be enough for the gather itself (depending on what microcode is available), but it stills everything else. It's hard to imagine Intel investing in either the ability to dispatch from both the microcode and uop cache/decoders simultaneously nor a wide microcode ROM that can feed several uops per cycle, but I really wouldn't know what they do and don't find practical here..
Barring that I'd expect the gather to be done by an independent hardware state machine, regardless of whether or not it can service multiple loads per cycle. Even if it's stuck at one load per cycle it'll still be a lot better than the current alternative.
It adds a lot of complexity to support single instruction gather. A gather instruction could generate a multitude of addresses that all cause a MMU page-walk, the accesses themselves would require a multi-ported cache to be efficient, each access potentially causing a full cache miss.
Worst case, a single gather instructions could take several thousands of cycles to complete. So you either make it interruptible or suffer intolerable interrupt latency, the former means you need to save partial state of registers (*ugh*) the latter is just not acceptable.
Mind you, a fair fraction of accesses are likely to miss caches and with 4 to 8 cores on a chip it'll be relatively easy to saturate the main memory interface anyway.
So you end up spending a lot of complexity and power on something that might not add a whole lot of performance in the end.
Cheers
It's not pretty, but the gather could be replayed in its entirety upon any fault whatsoever, and if you so desire, after an interrupt. All that's required is that the cache and TLB have at least as many ways as there are elements in the vector, because otherwise you could get an infinite loop where the later fields keep evicting the former ones. This shouldn't be a problem for Haswell. In the normal case, the cost of a cache miss is big compared to the cost of redoing the earlier loads which are now in cache. An interrupt can evict the stuff that was gathered from the cache, but that's not that much worse than it evicting any of the rest of the program's working set (not to mention, having to save/restore registers). And interrupts aren't really frequent enough for this to be a concern.
This may be why AVX2 has no scatter instruction. Replaying a half-done scatter has more consequences.
No one would expect a gather instruction that's single cycle if everything's in cache. The reasonable highest end expectation is a gather that can load multiple elements if they're all in the same cache line, like Larrabee can. Tons of code would benefit from this. A lot of useful gathers can even have multiple fields going to the exact same address.
Sebbbi, Tuna; thanks. Your posts are very educational, as always. :)
My guess would be this: on SB/IB, FADD and FMUL latency is only 3 cycles but on Haswell FMA latency is 5 cycles which is substantially higher. David Kanter has remarked that Intel engineers found Bulldozer's 5-6 cycle FMA latency to be a weakness, so I don't think they'd be happy with 5 cycles for FADD and FMUL. So I'm guessing they did what they could to bypass the FMA unit to reduce latency for FADD and FMUL: you can get a multiply result early and start an add late. And for the early multiply result the rest of the FMA is a don't care, if it runs at all, but for the early add you have to feed it a 0 to start with. So it may be that a fast FADD is more complex to support than a fast FMUL and therefore they only have one.
Hmm that makes sense, though you're wrong about the latencies. Only fadd is 3 cycles on snb/ivb/hsw, fmul is 5 cycles, same as fma. So maybe fadd indeed has some special path to get latency down to 3 whereas for the fmul it can just use ordinary fma path. This is indeed different to amd which had same latency for fmul and fadd (and now fma) for ages (K8/K10 had latency 4, BD latency 5-6).
Exophase
12-Sep-2012, 17:04
You're right, my mistake. All the more reason why it only supports one FADD, though. It's possible that it's implemented with an entirely separate unit.
Having a big difference in latency between FADD and FMUL is actually kind of surprising, the significand multiplication itself must be eating a lot of that because you'd expect the normalization to be more expensive with the add.
tunafish
12-Sep-2012, 17:53
I'd give another vote to "FADD is probably it's own dedicated unit". Both because FADD units are much cheaper than multiply ones, and because scheduling instructions gets a lot harder in cases where you can stuff things into the middle of a pipeline.
Also, the slides for ARCS004 and 005 are now up. TSX works on L1 (only), and gather zeroes out elements in the mask register when it successfully fetches them -- this way, after any fault the gather can just be restarted and it keeps all the work it has already done.
So sadly it's almost double confirm that the next core i3 4xxx are to ridicule the upcoming quad core (/2 modules) Streamrollers for all personal usages.
AMD sounds really in a tough spot... especially as Haswell might be released before Streamrollers.
The scary part part is whereas Jaguar cores look nice we have no release date and Intel has a counter... If Atom uses the same power saving techniques as Haswell and taking in account how long Intel must have been working on that one I'm scared that the "really nice Jaguar core" may look like toys in front of Intel offering.
So sadly it's almost double confirm that the next core i3 4xxx are to ridicule the upcoming quad core (/2 modules) Streamrollers for all personal usages.
AMD sounds really in a tough spot... especially as Haswell might be released before Streamrollers.
I think the aim of Bulldozer, and therefore Steamroller was never really to compete with Intel on a core-for-core basis, but to allow for more cores within the same transistor and power budget. This is precisely what AMD does with the FX lineup—well, technically they use more transistors and power, but still—and on the A lineup, that is for APUs, they choose to spend the extra transistors and power on the GPU.
Steamroller should continue that trend and, frankly, if Kaveri really does deliver a 30% performance improvement over Trinity, it should be more than enough for most people, so spending extra transistors and power on the GPU seems like the right thing to do.
The scary part part is whereas Jaguar cores look nice we have no release date and Intel has a counter... If Atom uses the same power saving techniques as Haswell and taking in account how long Intel must have been working on that one I'm scared that the "really nice Jaguar core" may look like toys in front of Intel offering.
We'll see. I think the new Atom is supposed to be OoO—which would make it an Atom only in name—but it's still targeted at phones while Jaguar isn't meant to go any lower than tablets. Different power targets usually mean different performance targets too. I wouldn't write AMD off just yet.
now the gpu in compute operation has 0.5TB/s of bandwidth the last level of cache the thing could literally fly.
Unless it was specified, I think the cache means the GPU dedicated L3 cache, rather than LLC. That makes a big difference.
hkultala
13-Sep-2012, 08:09
I think the aim of Bulldozer, and therefore Steamroller was never really to compete with Intel on a core-for-core basis, but to allow for more cores within the same transistor and power budget.
yep.
But this is a bad strategy.
A) What is usually needed is
1) 1 to few very powerful cpu cores, for code which does not parallelize well
2) Large number of weak cores for massively parallel code
Very few code needs something like 6-12 relatively powerful cores. This is either too few or not enough.
B) Transistors are getting "almost free", so sacraficing single-thread-performance to save transistors is just bad tradeoff. Especially when those saved transistors are only used to put more "semi-powerful" cores to the chip.
Sacraficing single-thread ipc however might be reasonable in cases where it allows higher total performance by higher clock speed, or considerable power savings.
Single-thread performance is still very important, and now when it's harder to get single-thread performance improvements, it just mean cpu developers should concentrate more on it, not give up.
I think Intel has the better strategy here. They are concentrating more on single-thread performance, and using SMT to also get improvements with multiple threads.
AMD's strategy with fusion would also be very good, if they just had executed it correctly and designed a proper "high performance for single-thread" cpu core for it, instead of having to use either 4 outdated cores, or 4 new semi-powerful "mini-cores".
Intel is also bringing the "many weak cores" into play with larrabee/knights line, wonder when they release a single chip with both high-end x86 cores and larrabee cores.
aaronspink
13-Sep-2012, 16:18
A microcoded sequence could still be faster.
microcoded has a lot of rather thorny problems that aren't fun to implement, verify, or validate on a cpu. The coded sequence solution provides a defined bounded operation that can be implemented and verified much easier.
The problem really is one of bounds. Consider that each entry in the gather could potentially point to a different PTE and that PTE may or may not be loaded into TLBs or even cache. And I cannot recall off the top of my head whether the PTE can themselves be virtually/indirectly allocated, etc. So for 1 gather you could be looking at potentially 16+ tlb fills + page faults + memory accesses, etc. We're talking upwards of thousands of cycles in what would be in the microcoded case an atomic operation that has significant implications up and down the architecture and validation stack. If you look at errata for various processors, you are likely to find many entries associated with long complicated atomic memory operations.
By implementing it as a load->mask->fill->update instruction, the side effects are significantly restricted and the performance difference should be minimal in a modern core.
aaronspink
13-Sep-2012, 16:34
And as a follow up, doing it as a looped instruction sequence:
A: GATHER Y,X+([IMM]/W),Z
B: BNZ Z, A
has the benefit of allowing things like offloading to micro engines in the future if desired while not requiring it at the start. It is perfectly possible in the future to stick a microengine off an L1 or L2 in the future and pass the instruction through to the microengine to do the whole gather.
Given the common use cases for gather, having a small microengine with 16-64(1-4 element vector aka RGBA/XYZ/XY/etc) cachelines would enable very fast gather generation.
I think the aim of Bulldozer, and therefore Steamroller was never really to compete with Intel on a core-for-core basis, but to allow for more cores within the same transistor and power budget. This is precisely what AMD does with the FX lineup—well, technically they use more transistors and power, but still—and on the A lineup, that is for APUs, they choose to spend the extra transistors and power on the GPU.
Steamroller should continue that trend and, frankly, if Kaveri really does deliver a 30% performance improvement over Trinity, it should be more than enough for most people, so spending extra transistors and power on the GPU seems like the right thing to do.
Well I've made up my mind so to speak, no matter people telling me the situation is not that bad wrt to CMT.
CMT did not deliver on this premise 'more core within the same silicon and power budget".
Trintiy modules are tinier than 2 Star cores, they are more featured but not enough to make a significant difference (I made gross measurement). If not for the better power management features in power constrained environment and the use of new instructions the Star core are still better.
It's imo the contrary, for an "industrial/production" pov actually the modules approach offer lesser granularity than lesser cores. Back in time AMD could sell 1 and 3 cores variantions, they now longer can't.
On the igp integration side of the equation Intel is sadly ahead of AMD. AMD seems completely focus on fixing their modules, the "uncore" progress at really low speed if at all.
And the Anandtech seems to confirm that fast memory (http://www.anandtech.com/show/6277/haswell-up-to-128mb-onpackage-cache-ulv-gpu-performance-estimates) will make it into Haswell. AMD may lost here too. On the compute side of thing Intel IGP was already arguably better.
AMD is putting together mostly its CPU parts and GPUs parts whereas Intel develop its APU as a whole. May be AMD if they were not putting all their efforts in fixing their modules performances they could do better here too.
In the mean time a quick list of what they postpone to fix:
The L3 (won't be done before at least 2014)
Fp/SIMD performances (won't be done before at least 2014)
Support for AVX2 (won't be done before at least 2014)
Single thread performances should catch up with prior architecture may be in 2013 with streamrollers.
Overall I fail to see how AMD could be in a worse situation if they have passed on CMT.
They had proven solutions in front of them with SMT and cache hierarchy of CPUs like Nehalem and Power7. They decided to come with their own take and for me it failed. They should have make the bitter and difficult conclusion as soon as BD launched (or no that longer after engineer sample were out) to push BD how (or scrap it) and start something new.
A 3 issue std CPU core which would include all the refinement they included in BD and then PD. Such a CPU I fail to understand how it would not completely out perform their previous architectures and as such it would be closer to Intel offering.
Such a CPU might have ended bigger than bot Star core or half a BD/PD module but by how much? I suspect not that much not even to significantly change their costs.
It may also be a bit more power hungry but it might allow for better power management and turbo. You have more granularity, you could change clock speed, clock gate on a per cores basis vs a module basis (that for coarse grained).
If they didn't /couldn't copy IBM or Intel approaches for the cache hierarchy, they may have come with something akin to Jaguar which looks saner. I can't see ( or understand) why AMD that is still doing great things (may be while beating a dead horse...) could not successfully engineer something like that.
At least they could fight Intel Dual cores with Tri core instead of quad cores (better usage of salvage parts) and have a chance to actually look good.
They are lagging Intel more and more
All this sounds a bit like angst but I believe that AMD can do so much better. The sad thing is imho that by 2014 when or if most CMT approach pitfalls have been fixed (while still not bridging the gap with Intel, more the contrary), and depending on the success of Windows8 RT they might be threatened by ARM64 CPUs. ARM is already more advanced in the APU road than AMD, with its mali/a15 CPU. They are to end between a rock and an hard place :(
We'll see. I think the new Atom is supposed to be OoO—which would make it an Atom only in name—but it's still targeted at phones while Jaguar isn't meant to go any lower than tablets. Different power targets usually mean different performance targets too. I wouldn't write AMD off just yet.I don't think they are meant for phone only. I'll try to be optimistic but like for the Streamrollers, the jaguar have no release date, AMD may have only a short head start.
EDIT
OOps sorry I just realize that we are indeed in the wrong thread to discuss that matter, sorry for the Ot.
Exophase
13-Sep-2012, 19:01
And as a follow up, doing it as a looped instruction sequence:
A: GATHER Y,X+([IMM]/W),Z
B: BNZ Z, A
has the benefit of allowing things like offloading to micro engines in the future if desired while not requiring it at the start. It is perfectly possible in the future to stick a microengine off an L1 or L2 in the future and pass the instruction through to the microengine to do the whole gather.
Given the common use cases for gather, having a small microengine with 16-64(1-4 element vector aka RGBA/XYZ/XY/etc) cachelines would enable very fast gather generation.
The problem is that requirement for a new vector branch instruction. Masks in AVX2 are full vector registers, and it isn't designed to branch using a vector as condition input. You'd probably have to output to the zero flag instead. Still, that's two vectors the instruction has to write instead of one. I'm not aware of any AVX2 instructions with such capability. Surely that hurts the design somewhere. Larrabee gets around it by having a special register set for predicates.
It's not pretty, but the gather could be replayed in its entirety upon any fault whatsoever, and if you so desire, after an interrupt. All that's required is that the cache and TLB have at least as many ways as there are elements in the vector, because otherwise you could get an infinite loop where the later fields keep evicting the former ones.
Replaying the gather (re-issuing it from the scheduler) won't do. The ROB is of limited size, 168 entries in Haswell, equivalent to just 42 cycles at full tilt. You'd need to remove the gather and all subsequent instructions from the ROB, which means a faulting gather acts like a mispredicted branc on top of all the other penalties.
The load, mask, loop until mask=0 guarantees forward progress, doesn't require any internal state saved on preemption, doesn't throw away work done by subsequent instructions already executed thanks to the OOO machinery. The OOO machinery can also overlap multiple independent gather-loops.
And as Aaron points out. There is nothing preventing Intel recognizing the load/mask/branch idiom in future implementations, speeding it up.
Cheers
Exophase
13-Sep-2012, 22:30
Replaying the gather (re-issuing it from the scheduler) won't do. The ROB is of limited size, 168 entries in Haswell, equivalent to just 42 cycles at full tilt. You'd need to remove the gather and all subsequent instructions from the ROB, which means a faulting gather acts like a mispredicted branc on top of all the other penalties.
Maybe we have a different idea in mind of what "replay" means, I'm merely saying to restart the gather instruction as if it had made no progress. Not to flush the reissue from where the gather was as if it caused a branch misprediction. The point is not having to save and restore any internal state in order to resume a partially completed gather. Is completely re-executing instructions on faults rather than trying to complete them really that unusual?
The load, mask, loop until mask=0 guarantees forward progress, doesn't require any internal state saved on preemption, doesn't throw away work done by subsequent instructions already executed thanks to the OOO machinery. The OOO machinery can also overlap multiple independent gather-loops.
And as Aaron points out. There is nothing preventing Intel recognizing the load/mask/branch idiom in future implementations, speeding it up.
I don't think I understand something: are the two of you arguing in favor of the mask loop approach or are you trying to say you think that's what Haswell is actually using? Because that is definitely not how gather is specified in the AVX2 documentation. For better or worse, this isn't what Intel is doing.
And do you also not think adding a branch on vector mask instruction is a pretty big shift?
rpg.314
14-Sep-2012, 04:02
microcoded has a lot of rather thorny problems that aren't fun to implement, verify, or validate on a cpu. The coded sequence solution provides a defined bounded operation that can be implemented and verified much easier.
The problem really is one of bounds. Consider that each entry in the gather could potentially point to a different PTE and that PTE may or may not be loaded into TLBs or even cache. And I cannot recall off the top of my head whether the PTE can themselves be virtually/indirectly allocated, etc. So for 1 gather you could be looking at potentially 16+ tlb fills + page faults + memory accesses, etc. We're talking upwards of thousands of cycles in what would be in the microcoded case an atomic operation that has significant implications up and down the architecture and validation stack. If you look at errata for various processors, you are likely to find many entries associated with long complicated atomic memory operations.
By implementing it as a load->mask->fill->update instruction, the side effects are significantly restricted and the performance difference should be minimal in a modern core.
But why is a masking based approach incompatible with microcode? Let the microcode generate an unrolled loop. If any of these issues arise, then declare exception and update masks just like you would with a load->mask->fill->update instruction.
All the PTE/TLB related issues could happen just as easily with a load->mask->fill->update instruction.
aaronspink
14-Sep-2012, 04:54
The problem is that requirement for a new vector branch instruction. Masks in AVX2 are full vector registers, and it isn't designed to branch using a vector as condition input. You'd probably have to output to the zero flag instead. Still, that's two vectors the instruction has to write instead of one. I'm not aware of any AVX2 instructions with such capability. Surely that hurts the design somewhere. Larrabee gets around it by having a special register set for predicates.
obviously the mask registers aren't the main register set, they are status registers.
Exophase
14-Sep-2012, 05:56
But AVX2 already specifies a gather with an arbitrary vector mask to merge the output. Are you now proposing a whole new set of instructions to manipulate and branch on these status registers? You want to turn it into LRBNI full stop?
rpg.314
14-Sep-2012, 06:45
But AVX2 already specifies a gather with an arbitrary vector mask to merge the output. Are you now proposing a whole new set of instructions to manipulate and branch on these status registers? You want to turn it into LRBNI full stop?
I think some kind of unification there is on the way.
Blazkowicz
15-Sep-2012, 16:00
[offtopic]
yep.
But this is a bad strategy.
A) What is usually needed is
1) 1 to few very powerful cpu cores, for code which does not parallelize well
2) Large number of weak cores for massively parallel code
Very few code needs something like 6-12 relatively powerful cores. This is either too few or not enough.
Many-powerful-cores is useful in servers, that is when you deal with many requests and want them to be low latency.
Bulldozer was often described as a server chips. it makes sense.. only, the power use is too high, leading to clocks too low, and as on the desktop if faces sandy bridge which humiliates it.
The logicial conclusion of your post is, we need a cheap with a few powerful cores and many weak ones. this is not easy, even Intel CPU + Knights/Xeon Phi isn't quite there because they run two different x86 instruction sets.
The strong+weak mix now exists in the Tegra3, and yet to launch Cortex A15 + Cortex A7 designs, but it's here to save milliwatts, not for performance. It also helps that the device is shipped with a custom linux kernel with adequate scheduler.
Workstations happily use up your 6 or 12 or more cores or threads, because that's what is easily available now. In fact, if the Piledriver incremental improvement is good enough I think AMD could pit dual socket desktop boards against single socket 2011. It's versatile. Same hardware can be used to run 40 linux VMs, or to do video rendering or something. "Weak cores" solutions have to get better (the Xeon Phi is a significant milestone and we'll see what nvidia Maxwell is up to, as well as AMD steamroller and post-steamroller). But the software aspect is crucial (an 8-core, 12-core or 16-core machine has the advantage of running regular software, not specially crafted one)
I wonder about Intel's next crazy CPU with many powerful cores, the EX variants. They now seem to skip the generations that bring a new archicture but not a new process. So we have Westmere-EX, Ivy Bridge-EX and I suppose Broadwell-EX after that.
impressive specs on show...doubling of a lot of units, but why are sites saying the only performance doubling is the iGPU and a marked increase in idle watt savings...the CPU side expects to see no more than 10% gains over Ivy Bridge on average?? It is a 95W part vs 77W...have we come to a point that clockspeed is a limiting factor...? AFAIK Intel has not gone over 4Ghz on Turbo for their quadcore parts..for a long time....
From LGA1366 to LGA1155...we saw the average clockspeed went from 2.66/3.2Ghz to 3.4/3.8Ghz, can Hazwell be limited by clockspeed? As the 22nm Ivy Bridge refresh did not clock much higher (by which i mean extreme overclocking - 5Ghz) than Sandy even after the delidding and replacement of heatpaste...
tunafish
15-Sep-2012, 18:09
impressive specs on show...doubling of a lot of units, but why are sites saying the only performance doubling is the iGPU and a marked increase in idle watt savings...the CPU side expects to see no more than 10% gains over Ivy Bridge on average?? It is a 95W part vs 77W...have we come to a point that clockspeed is a limiting factor...? AFAIK Intel has not gone over 4Ghz on Turbo for their quadcore parts..for a long time....
From LGA1366 to LGA1155...we saw the average clockspeed went from 2.66/3.2Ghz to 3.4/3.8Ghz, can Hazwell be limited by clockspeed? As the 22nm Ivy Bridge refresh did not clock much higher (by which i mean extreme overclocking - 5Ghz) than Sandy even after the delidding and replacement of heatpaste...
Most of the doublings happened for the SIMD units. That won't be even visible until software is recompiled -- and in some cases, rewritten. I expect Haswell to be one of the biggest advances in CPU speed in a long time, once the software catches up. In terms of "I have this old app, when I copy it over to my Haswell machine, how much faster will it go?", I don't expect all that much.
Also, many of the Haswell changes scream "reduction in possible clock speed" to me. Even if it won't clock lower than Ivy, it will certainly clock lower than it would without those changes. Basically, Intel went: "Okay, let's make every pipeline stage 10% longer. What can you do with that?"
Why does Intel want to stop the clock?? Is this really really the end of the mhz....? Everyone loved Sandy Bridge, and a lot more after how weak the Ivy Bridge refresh was but LGA1366/1156 to LGA1155 is really down to how fast Sandy Bridge can clock up..if Hazwell desktop quadcore parts dont follow that kind of clock up...what is Intel gameplan for the desktop users....how long will x86 software catch up with Hazwell doubled SIMD units?
Exophase
15-Sep-2012, 19:19
Why does Intel want to stop the clock?? Is this really really the end of the mhz....? Everyone loved Sandy Bridge, and a lot more after how weak the Ivy Bridge refresh was but LGA1366/1156 to LGA1155 is really down to how fast Sandy Bridge can clock up..if Hazwell desktop quadcore parts dont follow that kind of clock up...what is Intel gameplan for the desktop users....how long will x86 software catch up with Hazwell doubled SIMD units?
Intel is optimizing for perf/W, and the perf/W isn't that great at what the current CPUs can overclock to. So they aren't as aggressive as they could be with stock nor even turbo clocks. If eating out of that unused headroom means better perf/W, especially at the lower clock targets for ULV parts, then it's an obvious win to them.
When you refer to "everyone" and "desktop users" I get the distinct impression that you're actually referring to overclockers, which are still somewhat of a niche, especially when only the highest ends of the product lines can really overclock to begin with. At stock speeds IB is an obvious win, nobody would say that it makes SB look better.
That all said, we really don't know if Haswell's changes are forcing clock time to go up, especially vs SB which was on an older process. It's a given that some stages will take longer to check dependencies of and dispatch to those additional ports, but the clock time is only as fast as the slowest pipeline stage. And while CPU designers will do their best to make the stages run as close to the same speed as possible I doubt they really get it perfect, so who knows if there wasn't another slower stage that gave them this headroom.
Furthermore, you'd expect IB would reduce cycle time requirements vs SB if only slightly, yet the overclocking potential was less most likely due to power density issues. So they may have had headroom that wasn't even accessible and at that point there's no reason not to trade that for perf/W. And Haswell may be better optimized for the process with regards to power distribution. So I wouldn't say that it's a given that it'll hit peak clocks below what SB could.
entity279
16-Sep-2012, 12:24
Well AMD had been the ones shouting about the increased use of CAD - created layouts. I think anyway, with process tech advancing its inevitable that automation will be used more and more everywhere.
Maybe the somewhat lack of aggressive frequency increases is also an indication that intel now uses a bit less custom logic?
Why does Intel want to stop the clock?? Is this really really the end of the mhz....? Everyone loved Sandy Bridge, and a lot more after how weak the Ivy Bridge refresh was but LGA1366/1156 to LGA1155 is really down to how fast Sandy Bridge can clock up..if Hazwell desktop quadcore parts dont follow that kind of clock up...what is Intel gameplan for the desktop users....how long will x86 software catch up with Hazwell doubled SIMD units?
I think some softwares can quickly take advantages of AVX2 no ? Like video encoding stuffs, 2d/3d rendering softwares, no ?
Blazkowicz
16-Sep-2012, 15:11
I don't remember any benchmark using the special abiilities of Bulldozer (FMA4)
software may be slow moving but maybe the authors don't care unless the new stuff is on Intel CPUs.
Do you think Haswell doubled SIMD units will rock with PC games..? Well..i think a i5 4570K will clock between 3.6Ghz to 4Ghz and the i7 4770K will be between 3.8Ghz to 4.2Ghz...both capable of running stock 1866mhz ram...these numbers from thin air around me i just pulled out off...but at these kind of clocks, where will Haswell stands?? Guesses gentlement?
It is smaller than the jump from LGA1366/1156 to Sandy Bridge...
I don't remember any benchmark using the special abiilities of Bulldozer (FMA4)
There was a slide back then from the official BD presentation, showing rather impressive results from some OCL kernel with FMA4 support, but nothing more to the date.
TDP for IVB is 77W and Haswell is 95W. Presumably they're talking about the top end desktop parts, and those don't necessarily have the highest powered GPUs attached so part of that ~18W increase has to go into the CPU. The revisions to the architecture itself, though sizable, can't be consuming all that power, so I'm expecting top clocks to be higher as well.
tunafish
16-Sep-2012, 16:08
Do you think Haswell doubled SIMD units will rock with PC games..? Well..i think a i5 4570K will clock between 3.6Ghz to 4Ghz and the i7 4770K will be between 3.8Ghz to 4.2Ghz...both capable of running stock 1866mhz ram...these numbers from thin air around me i just pulled out off...but at these kind of clocks, where will Haswell stands?? Guesses gentlement?
Actually, I think those clocks are totally unrealistic. It's running on the same process as Ivy, and every single major change they made makes the critical path longer. The typical clocks are going down, not up. Alternatively, since Intel chips now have huge clock headroom over stock, they might just spend some of that to maintain present clocks.
It is smaller than the jump from LGA1366/1156 to Sandy Bridge...
From Nehalem to Sandy Bridge, the CPU got much better L3 and uop cache. From Ivy to Haswell, it gets twice the bandwidth to L2, bank-conflict-free L1, +1 ALU, 2 branches per clock, the ability to do 2L+1W per clock, and all the new instruction set goodness.
For existing software, this will be the biggest gain in IPC since Core -> Core 2. However, I expect the IPC gain to be compensated by lower clocks.
Actually, I think those clocks are totally unrealistic. It's running on the same process as Ivy, and every single change they made screams to me "longer critical path". The typical clocks are going down, not up. Alternatively, since Intel chips now have huge clock headroom over stock, they might just spend some of that to maintain present clocks.
From Nehalem to Sandy Bridge, the CPU got much better L3 and uop cache. From Ivy to Haswell, it gets twice the bandwidth to L2, bank-conflict-free L1, +1 ALU, 2 branches per clock, the ability to do 2L+1W per clock, and all the new instruction set goodness.
For existing software, this will be the biggest gain in IPC since Core -> Core 2. However, I expect the IPC gain to be compensated by lower clocks.
Well i meant clock speed..sorry if it was unclear.
77W to 95W tdp should account for some more clocks....maybe 3.6ghz to 4ghz for the 4770K....and the 4570K should take the speed of the present 3770K...3.5ghz to 3.9ghz..kinda of sucky if that is all...
Could we be waiting for Haswell-E parts with the 8 cores sku..i dont understand why Intel dont want to go higher on the 95W Haswell quad desktop sku..why do we need double the iGPU performance on my gaming PC? I find it irony doing so will actually have a hand in killing off the desktop market....it is like irony fox..../cries profits from desktop is shrinking.../design your next cpu around perf/watt and portability..
Blazkowicz
16-Sep-2012, 16:35
VRMs are moved onto the CPU package, I believe this accounts for most of the TDP increase.
This also means a piece of crap motherboard with a non-K high end CPU gets (even more) attractive. Expect creative chipset segmentation and annoying marketing of IGP tiers (well we've had these things in place already)
entity279
16-Sep-2012, 17:12
Could we be waiting for Haswell-E parts with the 8 cores sku..
Yes, we will be waiting for those.
Personally at least, I wouldn't want to waste my nice after-market cooler for a 77W processor. :razz:
Blazkowicz
16-Sep-2012, 17:42
you will have to wait a long time for Haswell-E.
Haswell will coexist with Ivybridge-E and we don't know anything about a 2011 socket successor or any future CPU on that socket. Haswell might be skipped, Ivybridge-E will be the high end till 2014 then maybe Intel moves to ddr4 for its high end and servers, where ddr4 may be useful in getting truckload amounts of memory.
I've read that memory chips use a ridiculously large amount of power in datacenters, by the way. imagine racks upon racks of PCs loaded with 256GB ram, 10Gb networking, loads of VMs, sprawling databases etc.
so, on the consumer side, Intel releases a stop-gap socket, the 1150, which still supports ddr3. but servers (and 2011 is a server socket, in addition to high end desktop) need ddr4 sooner and Intel might not bother with new ddr3 sockets.
green.pixel
16-Sep-2012, 21:07
Could we be waiting for Haswell-E parts with the 8 cores sku..i dont understand why Intel dont want to go higher on the 95W Haswell quad desktop sku..
It will take quite a while before you see any significant use of an 8C CPU for gaming.
Blazkowicz
16-Sep-2012, 21:28
Intel doesn't want to sell you an 8 core CPU with a H61 chipset or its next gen equivalent while they could milk you for a X79 instead.
shiznit
27-Sep-2012, 22:22
Found the links to the webcasts:
http://intelstudios.edgesuite.net/idf/2012/sf/ti/SPCS001/index.htm
http://intelstudios.edgesuite.net/idf/2012/sf/ti/ARCS001/index.htm
I don't think I understand something: are the two of you arguing in favor of the mask loop approach or are you trying to say you think that's what Haswell is actually using? Because that is definitely not how gather is specified in the AVX2 documentation. For better or worse, this isn't what Intel is doing.
And do you also not think adding a branch on vector mask instruction is a pretty big shift?
Mea culpa. Assumption is the mother of all fuck ups, - and all that. I assumed Haswell used an implementation similar to Knights Corner.
It does indeed looks like Haswell can interrupt in the middle of its gather instruction with the result of a partially completed gather stored in registers (data+mask). Which then begs the question how they've done that.
Cheers
Nice exposé:
http://www.anandtech.com/show/6355/intels-haswell-architecture
Blazkowicz
05-Oct-2012, 17:50
Yes Knight Corner is like Intel throwing the usual x86 SIMD extensions out the window, and branching to do something different. I don't know enough to tell what was thrown out (e.g. does it support SSE2 or not, even x87 etc., on non-FP stuff does it even meet i686)
Nice exposé:
http://www.anandtech.com/show/6355/intels-haswell-architecture
Well they might have cut the bullshit about Apple and comparing the A6 to Intel products especially in an article speaking of Haswel, which ARM based CPU may never ever touch the performances.
As for mobile, let wait for the pain next year with OoO Atom shipping on 22nm process, I would bet it is (finally) going to hurt).
EDIT
+1 to my self the focus on Apple is really on the verge of F-Boyism on that one, crazy.
Comparing ultrabook and laptop to tablet... that is non sensical and short sighted.
Next year Intel will have awesome and I would bet unmatched products to power Windows 8 (which in turns will have matured) to power "serious" tablets ( I mean akin the high end MSFT so a viable substitute to laptops), thanks to the new atom dual and quad core configuration supporting 4GB of ram and more and so on.
Well they might have cut the bullshit about Apple and comparing the A6 to Intel products especially in an article speaking of Haswel, which ARM based CPU may never ever touch the performances.
A
EDIT
+1 to my self the focus on Apple is really on the verge of F-Boyism on that one, crazy.
Comparing ultrabook and laptop to tablet...
The A6 CPU is the first internally developed ARM micro architecture from Apple, and quite an ambitious one. It appears to be faster on a clock normalized basis than Cortex A-15 and it appears to have a much better memory subsystem than any other mobile SOC CPU.
Apple acquired significant CPU design know-how when they bought PA Semi and Intrinsity. Some of these guys know how to build high power processors.
Can A6 compete against Haswell in desktops and high power laptops ? Hell no. But in a Mac Book Air form factor it might just be competitive with Intel's offerings. Intel is threatened by this; Notice how much of the Haswell material is about power savings rather than outright compute performance.
I can't wait to see the next iPad with an A6 in a larger power envelope.
Cheers
Can A6 compete against Haswell in desktops and high power laptops ? Hell no. But in a Mac Book Air form factor it might just be competitive with Intel's offerings. Intel is threatened by this; Notice how much of the Haswell material is about power savings rather than outright compute performance.
Definitely. Intel actually decided to sacrifice some L3 access latency in Haswell, by separating its clock domain from the CPU cores with the single goal to shave off few watts of power.
ninelven
08-Oct-2012, 23:08
It appears to be faster on a clock normalized basis than Cortex A-15 Are there any A-15s out yet to compare it against?
tunafish
09-Oct-2012, 01:26
Apple acquired significant CPU design know-how when they bought PA Semi and Intrinsity. Some of these guys know how to build high power processors
It's interesting to note that there are more chip designers who worked on K8 working at Apple than there are working at AMD. PA Semi was one of the favourite destinations for chip designers when they fled AMD.
rpg.314
09-Oct-2012, 05:13
Are there any A-15s out yet to compare it against?
Krait seems to be a good proxy for A15.
Exophase
09-Oct-2012, 05:28
Krait seems to be a good proxy for A15.
Krait has nothing to do with Cortex-A15.
rpg.314
09-Oct-2012, 05:41
Krait has nothing to do with Cortex-A15.
Which is why I used the word proxy. I expect Krait and A15 will end up rather close.
Exophase
09-Oct-2012, 06:52
Which is why I used the word proxy. I expect Krait and A15 will end up rather close.
Based on what exactly? Why would you expect Krait to represent Cortex-A15 any better than A6? If anything the one released closer in time would be more likely to be representative wouldn't it?
tunafish
09-Oct-2012, 08:27
A15 and Krait both have 3-wide decode and 128-bit FPU, but that's pretty much where the similarities end. Krait has a shorter pipeline and a low-latency, really weird cache subsystem. In comparison, A15 will have higher latencies and higher clocks. Whether the power consumption will blow up when it's taken to those clocks is a whole another (and as of yet unknown) issue.
I don't think that Krait is in any way a good proxy for A15 performance. In fact, I simply think that there isn't enough published data on A15 to make any sort of informed judgement yet.
Are there any A-15s out yet to compare it against?
No exact science was used in my estimate, I was going by the claimed 40% IPC improvement of A15 vs A9.
The interesting point is of course how power consumption compares with Krait and A15 at a given performance level.
Cheers
I don't think that Krait is in any way a good proxy for A15 performance. In fact, I simply think that there isn't enough published data on A15 to make any sort of informed judgement yet.
We know A15 is a 3-wide superscalar OOO with a 40+ entry ROB, and we know NEON instructions are now tracked by the ROB (rename tables for both ARM and NEON registers), it also has a wider memory subsystem. It is going to be a fair bit faster on normal integer/fp code and a lot faster on NEON code.
Wiki says Krait is OOO, but I can't find that claim anywhere in any of the Qualcomm PR material.
The only immidiate difference seems to be the length of the pipeline, 11 stages for Krait and 15 stages for A15. The long pipeline indicate a higher operating frequency target, and together with all the virtualization support ARM has added, it seems to me A15 is really targetted more at low power servers than mobile SOCs.
I expect A15 to be faster than Krait, but burn more power being so.
Cheers
Exophase
09-Oct-2012, 19:04
A15 and Krait both have 3-wide decode and 128-bit FPU, but that's pretty much where the similarities end. Krait has a shorter pipeline and a low-latency, really weird cache subsystem. In comparison, A15 will have higher latencies and higher clocks. Whether the power consumption will blow up when it's taken to those clocks is a whole another (and as of yet unknown) issue.
I don't think that Krait is in any way a good proxy for A15 performance. In fact, I simply think that there isn't enough published data on A15 to make any sort of informed judgement yet.
There's a ton of published information on Cortex-A15, it's Krait that we know close to nothing about. Your information, taken from AnandTech, pretty much sums it up, where "short pipeline" and "low latency" are incredibly vague descriptions. In actuality we don't know what the fetch bandwidth is, we don't know what its integer execution unit resources are, we don't know if it can support simultaneous loads and stores, we don't know how deep its reordering capabilities are, we don't know what its branch prediction is like.. these are all things we have pretty good descriptions of for Cortex-A15. Sure it may be established that they both have 128-bit NEON units, but what's the latency like - are you going to take at face value that its SIMD is lower latency just because Anand says it has a smaller pipeline? There's way too much missing information, and there's definitely a lot of room where Cortex-A15 could outperform Krait, and unless ARM's estimations of how it'll perform vs A9 are totally unrealistic it will outperform Krait.
Note that a lot of Cortex-A15's long pipeline is in a frontend that can be partially bypassed if code is running from the loop buffer.
No exact science was used in my estimate, I was going by the claimed 40% IPC improvement of A15 vs A9.
That 40% number only applies to Dhrystone. ARM gave numbers of 50% improvement on integer code and 100% improvement on FP (presumably SIMD, maybe also including integer SIMD?) and memory bound stuff. They also cited that they had an internal goal to improve typical IPC by 50% over Cortex-A9, which they feel they've met.
You have to consider that, aside from the benchmark being abject garbage, there's just less room to grow with Dhrystone. It all fits in L1 cache, uses pretty predictable branches, and spends a lot of time in library functions that can be hand optimized. So Cortex-A15's strengths aren't going to benefit it as much as it'll benefit real programs.
rpg.314
10-Oct-2012, 03:46
Based on what exactly? Why would you expect Krait to represent Cortex-A15 any better than A6? If anything the one released closer in time would be more likely to be representative wouldn't it?
Because Scorpion core came out about a year before A9 and it wasn't a bad proxy.
By proxy I mean <20% difference. YMMV with this metric.
Exophase
10-Oct-2012, 04:37
20% difference is good for a proxy? Seriously? Do you even have anything really showing Scorpion to A9 being a typical < 20% at same clock speed?
Scorpion is much closer to A8 than A9, making the latter comparison over the former seems totally disingenuous :/
rpg.314
10-Oct-2012, 04:54
20% difference is good for a proxy? Seriously? Good enough for me. It's a pretty bad metric if you are doing an in depth comparison, no doubt about that. But in terms of the user experience with actual apps, I think this much difference is not perceptible.
That 40% number only applies to Dhrystone. ARM gave numbers of 50% improvement on integer code and 100% improvement on FP (presumably SIMD, maybe also including integer SIMD?)
IMO, the 40% IPC increase in Dhrystone is the upper limit we will see for IPC improvements. Memory latency doesn't magically go away, so a real workload that busts out of cache is going to see less IPC improvement.
The 50% performance improvement is with frequency improvements AFAICT (at a fixed power consumption level)
The 100% FP is only for SIMD code. The A9 doesn't track data dependencies on NEON registers. Using NEON instructions thus effectively turn the A9 into an in-order processor. The A15 has two remap tables, one for ARM registers and one for NEON registers. That and the wider datapaths is going to improve SIMD code immensely, but much less for regular FP code.
Cheers
Exophase
10-Oct-2012, 16:15
IMO, the 40% IPC increase in Dhrystone is the upper limit we will see for IPC improvements. Memory latency doesn't magically go away, so a real workload that busts out of cache is going to see less IPC improvement.
That's like saying that the performance difference between Cortex-A8 and Ivy Bridge is purely down to their cache and main memory latencies. There's a big continuum of performance opportunities based on how well you can a) extract parallelism and b) schedule to hide latency. Cortex-A15 makes big advances on both fronts. Without knowing the weaknesses of what you're starting with that's a pretty blind statement. Given that Cortex-A15 doesn't actually add much to the execution resources on the integer side, over Cortex-A8 and A9, I'd say it really is all about better management of said resources.
Besides that, Cortex-A9 implementations do tend to have relatively high L2 latency and relatively high main memory latency, so there's plenty of room for improvement; the former can actually be delivered by ARM since the L2 is tightly coupled with the CPUs again.
What really confuses me is how you can make this statement while simultaneously saying A6's CPU is higher performing - does only it get to magically make latency go away?
The 50% performance improvement is with frequency improvements AFAICT (at a fixed power consumption level)
No it isn't. There's no ambiguity in what ARM said.
The 100% FP is only for SIMD code. The A9 doesn't track data dependencies on NEON registers. Using NEON instructions thus effectively turn the A9 into an in-order processor. The A15 has two remap tables, one for ARM registers and one for NEON registers. That and the wider datapaths is going to improve SIMD code immensely, but much less for regular FP code.
The 100% number was NOT just given for SIMD.
Everything you said about OoO applies to scalar VFP in Cortex-A9 vs Cortex-A51 just as much as it implies to NEON. The word isn't back yet but it's also possible that there are two "real work" VFP pipes (ie, 2x scalar FMADDs)
That's like saying that the performance difference between Cortex-A8 and Ivy Bridge is purely down to their cache and main memory latencies. There's a big continuum of performance opportunities based on how well you can a) extract parallelism and b) schedule to hide latency. Cortex-A15 makes big advances on both fronts.
Without knowing the weaknesses of what you're starting with that's a pretty blind statement. Given that Cortex-A15 doesn't actually add much to the execution resources on the integer side, over Cortex-A8 and A9, I'd say it really is all about better management of said resources.
The A9 has 2-wide instruction decode and retirement, a 24 entry reorder window and 4 dispatch ports.
The A15 has 3-wide decode and retirement and a 40+ entry reorder window. None of the material I've seen is more detailed than "40+" ROB entries and none says how many dispatch ports or execution units it has. The A15 is designed for higher operating frequency. Higher operating frequency generally increases latency measured in cycles to the memory subsystem, if ARM combats this with a new cache architecture, fine, but it still means that the amount of ROB entries per peak instruction throughput per cycle roughly stays the same, so don't expect more than 50% IPC increase.
What really confuses me is how you can make this statement while simultaneously saying A6's CPU is higher performing - does only it get to magically make latency go away?
It's not my claim. The A6 enjoys a 60% IPC increase vs A9 as per Anandtech's tests. ARM claims a 40% (or 50%) IPC increase of A15 over A9. I merely tried to connect the dots.
We don't know anything about the A6 other than it has a kickass memory subsystem. Where does the performance come from ? Is it 4-wide? Does it have multi ported D$? How big is the reorder buffer? Does it have memory disambiguation ?
Cheers
Exophase
10-Oct-2012, 21:21
The A9 has 2-wide instruction decode and retirement, a 24 entry reorder window and 4 dispatch ports.
Where did you read that it has a 24 entry reorder window? Or 4 dispatch ports for that matter?
This is the best Cortex-A9 reference I've seen: http://www.docstoc.com/docs/73399229/Cortex-A9-Processor-Microarchitecture
When they say "3+1" dispatch all diagrams would suggest that's either referring to the third port being capable of going to LS vs NEON/VFP, or the separate branch resolution. It's not a real quad dispatch either way.
There's no official documentation on the issue queue, but the diagram draws 6 squares, so the best guess will be that it's 6 wide. Everything else about it suggests a unified scheduler. Given that ARM themselves says that 8 scheduler slots were pushing the upper limit of feasibility in their design constraints for Cortex-A15 it'd be awfully strange if Cortex-A9 had 24, although I suppose it's possible given that they were designed by two totally different teams.
The A15 has 3-wide decode and retirement and a 40+ entry reorder window. None of the material I've seen is more detailed than "40+" ROB entries and none says how many dispatch ports or execution units it has. The A15 is designed for higher operating frequency. Higher operating frequency generally increases latency measured in cycles to the memory subsystem, if ARM combats this with a new cache architecture, fine, but it still means that the amount of ROB entries per peak instruction throughput per cycle roughly stays the same, so don't expect more than 50% IPC increase.
You're not looking very hard for information. http://www.arm.com/files/pdf/AT-Exploring_the_Design_of_the_Cortex-A15.pdf
A15 has 8 issue queues (to each execution pipeline) in 5 clusters, each with 8 slots. That's 64 entries total. It can dispatch to each of the 8 pipelines each cycle. The pipelines are 2x simple ALU, 1x branch, 1x MUL, 1x load, 1x store, and 2x NEON/VFP. Note that the ALUs bring back parallel shift + op execution, which was moved to separate stages in A9.
But there's way more to the comparison than just execution window, execution width, and latency to the memory subsystem. I don't think I really need to start listing things.
It's not my claim. The A6 enjoys a 60% IPC increase vs A9 as per Anandtech's tests. ARM claims a 40% (or 50%) IPC increase of A15 over A9. I merely tried to connect the dots.
Are you aware that A6 runs at up to 1.3GHz and therefore was probably running at that clock speed during Anand's tests?
We don't know anything about the A6 other than it has a kickass memory subsystem. Where does the performance come from ? Is it 4-wide? Does it have multi ported D$? How big is the reorder buffer? Does it have memory disambiguation ?
True, we don't know those things, but it seems you don't know a lot about Cortex-A15 too.. 4-wide seems pretty outrageous for a phone chip.
Anyway, back to the original claim - regardless of what you think the maximum improvement Cortex-A15 can bring is, why would you think Dhrystone would be what's representative of the upper limit? Dhrystone is relatively static, predictable, small, and the test is designed so that you can spend a lot of the time in hand tuned ASM. An other words, an easy problem A lot of the hardware in Cortex-A15, quite possibly the majority of it, is designed for problems harder than Dhrystone.
Where did you read that it has a 24 entry reorder window? Or 4 dispatch ports for that matter?
This is the best Cortex-A9 reference I've seen: http://www.docstoc.com/docs/73399229/Cortex-A9-Processor-Microarchitecture
Reorder window size inferred from 56 rename entries with 32 needed for architected state (int+fp).
Dispatch: Page 6 here (www.ruhr-uni-bochum.de/integriertesysteme/emuco/files/System_Level_Benchmarking_Analysis_of_the_Cortex_A 9_MPCore.pdf)
Although the diagram is confusing, it does say up to FOUR dispatches per cycle.
You're not looking very hard for information. http://www.arm.com/files/pdf/AT-Exploring_the_Design_of_the_Cortex-A15.pdf
A15 has 8 issue queues (to each execution pipeline) in 5 clusters, each with 8 slots. That's 64 entries total. It can dispatch to each of the 8 pipelines each cycle. The pipelines are 2x simple ALU, 1x branch, 1x MUL, 1x load, 1x store, and 2x NEON/VFP. Note that the ALUs bring back parallel shift + op execution, which was moved to separate stages in A9.
The amount of instructions in issue queues doesn't say anything about the rename capacity and hence the size of reorder window.
When an instruction is renamed, it is allocated an entry in the commit queue. The only time I've seen the size of the commit queue mentioned was in comp.arch on usenet two years ago, where the number 40 was mentioned.
Are you aware that A6 runs at up to 1.3GHz and therefore was probably running at that clock speed during Anand's tests?
No, I wasn't aware of that. I'm surprised Apple doesn't market it as a 1.3GHz processor then.
True, we don't know those things, but it seems you don't know a lot about Cortex-A15 too.. 4-wide seems pretty outrageous for a phone chip.
Nobody, outside of ARM, knows much about A15.
Anyway, back to the original claim - regardless of what you think the maximum improvement Cortex-A15 can bring is, why would you think Dhrystone would be what's representative of the upper limit? Dhrystone is relatively static, predictable, small, and the test is designed so that you can spend a lot of the time in hand tuned ASM. An other words, an easy problem A lot of the hardware in Cortex-A15, quite possibly the majority of it, is designed for problems harder than Dhrystone.
Dhrystone runs close to the maximum of what the execution core of the CPU is capable of. A real workload is not fully contained in D$ and you then have to contend with memory latencies.
The A15 can execute 50% more instructions per cycle. That also implies that latency of a memory operation grows by 50% measured in instructions even if number of cycles stays the same.In order to get a perfect 50% speedup you'd need to reduce main memory latency to 66%.
Can the A15 do that? Possibly, the tests I've seen of A9 shows a 200ns main memory latency, so there is certainly room for improvement.
Also, datapaths are twice as wide so that'll buy you a lot on throughput workloads (FP and media). The extra bandwidth can also be used for more aggressive prefetch where you effectively trade bandwidth for lower latency
Cheers
Blazkowicz
11-Oct-2012, 11:25
Drhystone? Here's an extremely old benchmark, that not only can be abused with compilation optimisation (thanks wikipedia) but will also typically entirely fit in L1. Nowadays mobile CPU have become like PCs of the past 15 years with a hierarchy of L1, L2 and memory with a huge relative latency, so you're not testing real performance and don't even have an excuse for it.
Drhystone? Here's an extremely old benchmark, that not only can be abused with compilation optimisation (thanks wikipedia) but will also typically entirely fit in L1. Nowadays mobile CPU have become like PCs of the past 15 years with a hierarchy of L1, L2 and memory with a huge relative latency, so you're not testing real performance and don't even have an excuse for it.
Part of my point. ARM claiming a 50% IPC increase in Dhrystone tells you nothing.
Cheers
Exophase
11-Oct-2012, 16:09
Reorder window size inferred from 56 rename entries with 32 needed for architected state (int+fp).
Dispatch: Page 6 here (http://www.ruhr-uni-bochum.de/integriertesysteme/emuco/files/System_Level_Benchmarking_Analysis_of_the_Cortex_A 9_MPCore.pdf)
Although the diagram is confusing, it does say up to FOUR dispatches per cycle.
Size of physical register file/rename capability is not the same as reordering capability. Sandy Bridge, for instance, has an instruction window based on the size of its ROB (168 entries), not its integer PRF (144 entries) or floating point PRF (160 entries). You could have zero register renaming whatsoever and still provide reordering.
The document I linked is much more detailed than yours, and makes it pretty clear it doesn't any true capability to dispatch four things in one cycle. The comment is probably counting folded branch resolution as dispatch, which is fair in the sense that it correlates to an instruction that was decoded and issued, but still not what most would consider true dispatch. But this is really nit-picking over details.
The amount of instructions in issue queues doesn't say anything about the rename capacity and hence the size of reorder window.
Sure it does. It's the issue queues that are scanned for instructions to dispatch each cycle. It is literally the pool from which eligible instructions are chosen and when it's full you can't add to the reordering capacity. Maybe you're confused by it being called a "queue." These queues are analogous to ROBs in other processors. ARM makes it very clear in the article I linked that the instruction window is dictated by the size and quantity of these queues.
Of course, since they don't have a unified scheduler, you generally won't come that close to actually utilizing the full reordering capacity, in general it'll probably be < 40 instructions.
When an instruction is renamed, it is allocated an entry in the commit queue. The only time I've seen the size of the commit queue mentioned was in comp.arch on usenet two years ago, where the number 40 was mentioned.
You will see that the instructions are issued to the issue queues after renaming. The number 40 probably came from someone multiplying the 5 clusters by 8 instead of the 8 pipelines (the document I linked indicates that this is the partitioning of the queues)
No, I wasn't aware of that. I'm surprised Apple doesn't market it as a 1.3GHz processor then.
Since when has Apple ever marketed the MHz of anything?
Nobody, outside of ARM, knows much about A15.
Did you even read the document I linked? It's far more detailed than any Cortex-A9 document out there! It's also more detailed than most descriptions Intel or AMD has given for their CPUs. You can find some more information in the publicly visible TRM (like various buffer/cache sizes/associativities).
Dhrystone runs close to the maximum of what the execution core of the CPU is capable of. A real workload is not fully contained in D$ and you then have to contend with memory latencies.
Yes, we both agree on this.
The A15 can execute 50% more instructions per cycle. That also implies that latency of a memory operation grows by 50% measured in instructions even if number of cycles stays the same.In order to get a perfect 50% speedup you'd need to reduce main memory latency to 66%.
It can decode 50% more instructions per cycle. It can fetch 100% more instructions per cycle. It can dispatch at least 100% more instructions per cycle. Its general branch misprediction penalty is larger but its mispredict rate is better. Its loop buffer lets it bypass fetch and most of decode stages, and is probably more capable than Cortex-A9's (larger, can handle two forward branches with unknown predict capability). It can execute loads and stores in parallel. It has wider reordering capability. It has better prefetchers. It can predict indirect branches better than by just using the last thing in the BTB. It can perform shifts and ALU operations in parallel. If I'm reading things right, the load-use latency is generally one cycle where Cortex-A9 is often two. Its L2 is more tightly coupled meaning lower latency in addition to twice the interface width. It has a bigger TLB hierarchy and new partitioning to include both load and store DTLBs.
Taking all that and putting it into a simplistic equation saying that it must need 66% lower MAIN memory latency to achieve 50% better perf/clock on average is a total farce. I don't know what you're doing here. You find out the performance by benchmarking it, but right now the best thing to go on is ARM's claim that it'll get 50% better performance.
Can the A15 do that? Possibly, the tests I've seen of A9 shows a 200ns main memory latency, so there is certainly room for improvement.
You will find that the numbers vary a lot based on which SoC we're talking about, which makes sense since the processor isn't responsible for the rest of the memory interface.
You'll also find that despite some SoCs having main memory latencies over 50% better than others they don't usually get a huge boost in performance. Cortex-A15 is less sensitivity to main memory latency than Cortex-A9 (I'm not claiming how much, but it's definitely less). Do I have to explain why?
Part of my point. ARM claiming a 50% IPC increase in Dhrystone tells you nothing.
I feel like you're not listening to me. ARM claimed 40% higher Dhrystone scores at the same MHz. They claimed 50% higher integer performance in general, again at the same MHz. The latter was not about Dhrystone. They haven't explained it further but some other charts imply this number is from SPEC.
ninelven
11-Oct-2012, 20:31
Eh, actual A-15 hardware will be out soon enough... I am content to wait for real world results.
Size of physical register file/rename capability is not the same as reordering capability. Sandy Bridge, for instance, has an instruction window based on the size of its ROB (168 entries), not its integer PRF (144 entries) or floating point PRF (160 entries). You could have zero register renaming whatsoever and still provide reordering.
The number of rename entries determine how many results you can rename and thus how many instructions you can have in flight. I never claimed the physical register file size had anything to do with it other that you need rename entries to map to non-speculated state.
The document I linked is much more detailed than yours, and makes it pretty clear it doesn't any true capability to dispatch four things in one cycle. The comment is probably counting folded branch resolution as dispatch, which is fair in the sense that it correlates to an instruction that was decoded and issued, but still not what most would consider true dispatch. But this is really nit-picking over details.
The document you linked clearly states, on page 7, four instructions can be dispatched per cycle, the diagram clearly shows 4 arrows to exec pipes: Two integer, one LS and one FP/NEON. On page 14 the diagram shows three arrows and one to the branch unit, so you may very well be right. To me, it isn't clear at all.
Sure it does. It's the issue queues that are scanned for instructions to dispatch each cycle. It is literally the pool from which eligible instructions are chosen and when it's full you can't add to the reordering capacity. Maybe you're confused by it being called a "queue." These queues are analogous to ROBs in other processors. ARM makes it very clear in the article I linked that the instruction window is dictated by the size and quantity of these queues.
That would make the issue queues equivalent to reservation stations/local ROBs like we know from OOO x86 CPUs.
Without a global scheduler the OOO capabilities are much more limited than an equivalent x86 implementation. A simple integer rich workload with a few loads missing D$ sprinkled in could effectively limit the amount of instructions in flight to the size of the int issue queues, - 16 entries.
AFAICT, if you're right, the only way to get anywhere near the maximum number of instructions in flight is FP/NEON code. There is always a surprising amount of integer chores in FP codes and that way most of the issue queues could be filled (or at least see any action).
You will see that the instructions are issued to the issue queues after renaming. The number 40 probably came from someone multiplying the 5 clusters by 8 instead of the 8 pipelines (the document I linked indicates that this is the partitioning of the queues)
Since all instructions except branches and nops produce a result (branches do too in ARM, since the PC is a general register, but I expect it to be special cased), the amount of instructions in flight is limited by the amount of entries in the commit queue where results sits until speculated state is resolved (branches). That queue has 40 entries (read in an ARM document, linked to in a usenet post in november 2010, the ARM document is now nowhere to be found.)
Did you even read the document I linked? It's far more detailed than any Cortex-A9 document out there! It's also more detailed than most descriptions Intel or AMD has given for their CPUs. You can find some more information in the publicly visible TRM (like various buffer/cache sizes/associativities).
I did. It is not only far more detailed than any Cortex-A9 document, it is also much more confusing than any document detailing micro architecture I've ever seen from AMD or Intel.
The commit queue looks like a data-full ROB, but it claims to be a PRF OOO implementation. The OOO capabilities looks to be ample except they are limited by the issue queue sizes.
BTW. This is off topic for this thread, move it ?
Cheers
UniversalTruth
10-Nov-2012, 20:33
Intel to Merge Xeon and Itanium in 2015-2017 (http://vr-zone.com/articles/intel-to-merge-xeon-and-itanium-in-2015-2017/17781.html)
Ivy Bridge (Core i3/i5/i7) debuted in 2012
Haswell (Core i3/i5/i7) will debut in early 2013
Ivy Bridge-EP (Xeon E3/E5) should arrive in mid-2013
Ivy Bridge-E (Core i7) debuts in late 2013
Ivy Bridge-EX for critical servers (Xeon E7) debuts in late 2013
Broadwell (Core i3/i5/i7) should ship in early 2014
Haswell-EP (Xeon E3, E5) should ship by mid 2014
Haswell-E (Core i7) debuts in late 2014
Haswell-EX (Xeon E7) is planned for late 2014
Broadwell-EP (Xeon E3 / E5) is planned for mid 2015
Broadwell-E (Core i7) arrives in late 2015
Broadwell-EX (Xeon E7) is planned for late 2016
The new socket could be the one you already know - according to some sources, Intel plans to re-wire the LGA-2011 for Haswell/Broadwell, making it incompatible with Sandy Bridge/Ivy Bridge-based products. The rewiring isn't being done to support new architectures, but rather provide more power - according to documents we saw, Intel plans to introduce 150W and up to 180W parts when Haswell and Broadcom architectures enter the cut throat server business.
Hmm, sounds very nice. :razz: 180 W CPU, I want for my desktop machine. :mrgreen:
Merging them is highly inaccurate. Merging the support system(Socket, perhaps chipset, etc) is accurate. We won't be seeing itaniums on our PCs, and for good reason.
Blazkowicz
11-Nov-2012, 03:14
It could allow an x86 in one socket and an itanium in another, assuming you would want to do that.
Merging them is highly inaccurate. Merging the support system(Socket, perhaps chipset, etc) is accurate. We won't be seeing itaniums on our PCs, and for good reason.
I haven't read the linked article (yet), but I assume this would be preparation for a move to (relatively) painlessly kill off itanium, since that product is dead anyway.
So, the day intel finally pulls the plug on itanium, customers could drop in x86 chips there instead.
Frankly I don't know why it took intel so long. Back in 2006 roadmaps suggested that Xeons and Itanics will use the same chipsets in the future and ultimately boards could support both chips (I dunno what happened with the "same chipsets" but up to now at least the sockets obviously ended up different). Remember QuickPath was initially known as CSI ("Common System Interface").
Blazkowicz
12-Nov-2012, 04:29
Intel has always done minimum service regarding socket compatibility, they had three generations of socket 370 and four of socket 775, each time the motherboards were backwards compatible but never forward compatible (millions of computers are stuck with a pentium 4 and can't get a Core 2 Celeron).
Or there's Socket 1156 and 1155, where everyone has forgotten what the new socket brought to the table already.
Intel is opportunist, they won't care about breaking compatibility if that means the CPU will use 1% less power or something. They are also good at pushing a new platform in the distribution channels. They care more about deadlines and such.
Intel is opportunist, they won't care about breaking compatibility
They only don't care because they don't have to. If they hadn't had a virtual monopoly on PC processors there's no way they so casually could shut out their entire existing market with repeated new platforms/sockets that bring only minimal (or even no) improvements.
This may change in the future as stationary computers are being increasingly encroached upon by mobile platforms. CPU sockets may in fact not even survive the end of this decade.
http://techreport.com/news/23885/leak-slides-spill-haswell-chipset-details
Interesting...
http://techreport.com/news/23885/leak-slides-spill-haswell-chipset-details
Interesting...
it's a shame that again, apart from the PEG the other PCIE ports are still 2.0...
but, correct me if I'm wrong, they are saying that what is today the "PCH" is going to be on the same package as the CPU?
tunafish
16-Nov-2012, 02:00
but, correct me if I'm wrong, they are saying that what is today the "PCH" is going to be on the same package as the CPU?
Yes. But so far, there is information only about the Lynx Point LP, or low power, model. I'd expect this to be in laptops.
"Intel’s Haswell CPU Microarchitecture" by David Kanter (http://www.realworldtech.com/haswell-cpu/)
Intel has indeed pushed the "mass-market state of the art" forward in many fronts at once. It would be truly sad if it turns out that the mass-market needs peak with dual-core consumption devices, which rules out future Haswell-like big jumps.
It would be truly sad if it turns out that the mass-market needs peak with dual-core consumption devices, which rules out future Haswell-like big jumps.
Let's be realistic - heavy computing capability in a CPU is only neccessary for those who actually do heavy computing. It's like expecting everyone to buy cars that can pull off competitive times at a dragracing strip - unrealistic! Not all that sad, really. It's simply reality.
Let's be realistic - heavy computing capability in a CPU is only neccessary for those who actually do heavy computing. It's like expecting everyone to buy cars that can pull off competitive times at a dragracing strip - unrealistic! Not all that sad, really. It's simply reality.
Starting from Pentium era, up until tablet/smartphone revolution, the CPUs for the regular mass market have also served the heavy computing market (from Intel/AMD standpoint). Most of the x86 users are currently indeed riding the "almost dragsters".
Yeah, because it made sense from many perspectives to have it work this way, but with Moore's law finally starting to hit the ceiling things are changing - and x86 CPUs are so much more powerful than what the average guy needs anyway it's silly.
When shrinking nodes don't bring any appreciable savings in cost per transistor anymore there's little room to improve performance anyway.
Blazkowicz
16-Nov-2012, 18:23
Starting from Pentium era, up until tablet/smartphone revolution, the CPUs for the regular mass market have also served the heavy computing market (from Intel/AMD standpoint). Most of the x86 users are currently indeed riding the "almost dragsters".
I'd put it at the Pentium Pro, which was indeed the fastest CPU on earth at its launch, on par with Alpha. Then the design scaled up to Pentium 3 1GHz and higher, and Athlon matched it.
Nowadays there's only Sparc supercomputers, POWER7 mini-computers and Z mainframes competing with the desktop PC :razz:
http://www.xbitlabs.com/news/cpu/display/20121226225930_Intel_s_Haswell_to_Feature_Secrete_ Weapon_Integrated_Voltage_Regulator.html
Intel Corp.’s next-generation code-named Haswell microprocessors will not only improve performance and feature some tricks to lower power consumption, but will also feature a secret weapon: integrated voltage regulator module (VRM). The latter will allow to improve granularity of power supply to central processing units and thus further cut power consumption without compromising performance.
3dilettante
27-Dec-2012, 17:09
To nitpick, I don't think that was really a secret.
Intel demonstrated an early concept of integrated CMOS VRM 7 years ago, so yeah, not exactly a secret. ;)
I'm not really a reddit fan, but this is amazing: http://www.reddit.com/r/IAmA/comments/15iaet/iama_cpu_architect_and_designer_at_intel_ama/
great thread, this has me intrigued
You'll see many things, especially with Broadwell, that you cannot do without owning a fab.
Squilliam
28-Dec-2012, 05:56
What will motherboard makers actually include on their boards after Intel is finished integrating everything?
entity279
28-Dec-2012, 17:04
This "secret" was well known for more than 6 months.
shiznit
29-Dec-2012, 01:06
great thread, this has me intrigued
You'll see many things, especially with Broadwell, that you cannot do without owning a fab.
HMC on interposer? It would explain the BGA only rumor...
rpg.314
29-Dec-2012, 03:04
That should be a packaging thing, hence fab agnostic.
3dilettante
29-Dec-2012, 04:08
Intel's been doing its own thing with packaging as well.
LGA is something they've been able to do, as well as some of the on-package memory for their mobile chips.
They've been able to push multi-chip packages to the consumer level in a way AMD has been unable to match.
There are a few other things that didn't make it to market, like bumpless build-up layer packaging, but Intel's quite formidable in that regard as well.
The increasing levels of integration and interposer tech might be blurring some of the lines as well.
As far as owning the fab goes, it sounds like Broadwell and friends are going to be tweaking the silicon even more extensively. This is at least partly out of necessity, since even the foundries are admitting they can't build anything decent without bringing the designers into the process.
Intel and AMD x86 has a longer history with that kind of integration, so it sounds like whenever the foundries make it to ten years ago, Intel's going even further.
rpg.314
30-Dec-2012, 03:57
Possible Broadwell fab-only exclusives
- http://translate.google.com/translate?hl=en&sl=&tl=en&u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcol umn%2Fkaigai%2F20121226_580154.html
- http://translate.google.com/translate?depth=1&hl=en&rurl=translate.google.com&tl=en&twu=1&u=http://pc.watch.impress.co.jp/docs/column/kaigai/20121227_580258.html
Voltage regulators that are on package in Haswell could come on die in Broadwell. If this is true, I am guessing the power granularity will be reduced and the frequency of operation would be increased.
Silent_Buddha
08-Jan-2013, 02:07
Interesting, this...
http://www.dailytech.com/article.aspx?newsid=29584
Seems to imply that Intel is showing off a 13" reference design using Haswell that is getting 13 hours of battery life in laptop form and 10 hours in tablet mode (detached from keyboard).
That is pretty impressive for the Core architechture chips.
Granted, there's no info from there on the specs of the reference design (how large is the battery? what's the weight of the tablet portion? how fast is the CPU running? etc.)
Regards,
SB
eastmen
08-Jan-2013, 02:11
really wish this was ready for the surface pro. I might have to rebuy next year
Pressure
08-Jan-2013, 03:14
Yeah, Ivy Bridge down to 7W TDP, impressive.
entity279
08-Jan-2013, 20:58
It's 10W TDP, as per Hardware.fr.
10W is still hellishly high for a tablet, but still mighty impressive for a core-series CPU.
Silent_Buddha
09-Jan-2013, 06:09
I wouldn't say hellishly high. Acer with some help from Intel is using a regular Core i5 3317U (17W TDP) in their W700 and still manages to get ~7 hours of video out of that. Which is ~2-3 hours more than most ultrabooks. All in a tablet that weights ~2.1 lbs (it's lighter than my 11.6" Win7 based Atom tablet, ~2.25 pounds, which gets less battery life).
At 10W TDP, that should allow them to increase battery life and reduce the weight.
Granted the benefits will be greatest when compared to Ultrabooks and other vendor tablets. I suspect Intel has had Acer use very specific components in the W700 in order to reduce power consumption. Things that other OEMs will eventually benefit from as well.
Regards,
SB
At 10W TDP, that should allow them to increase battery life and reduce the weight.
TDP won't affect battery life in the non demanding loads though. It's only if you are stressing it, which isn't what video playback does.
The W700 has a 54WHr battery, and 7 hours mean 7.7W average power use. That's not very far from idle.
The new Clover Trail(Z2760) Atoms double the battery life over previous Atoms(like the one you have on the Tablet), mostly because the idle power management is improved by leaps and bounds.
Albuquerque
09-Jan-2013, 19:42
In fairness, much of Haswell's additional component integration snippets surround major parts of power consumption -- a ton of it around overall chipset consumption and idle characteristics. So again, with TDP having little to no bearing on actual battery life, Haswell has a significant chance of giving us fantastic battery life enhancement.
eastmen
23-Jan-2013, 04:12
I would think Broadwell is where we will see major changes in power usage. Even Anand has mentioned that the increase in performance with Haswell comes with a higher TDP for those parts .
I would think Broadwell is where we will see major changes in power usage. Even Anand has mentioned that the increase in performance with Haswell comes with a higher TDP for those parts .
But as Albuquerque said, TDP doesn't matter much to battery life, it's all about idle power.
Albuquerque
24-Jan-2013, 15:23
Part of the reason that Haswell might have a higher TDP is their now fully integrated VRM circuitry as part of the package. At full load, the CPU is now going to put out quite a bit more heat simply due to the need for VRM cooling.
Now look at it from an idle perspective; with all that power control directly on-package, you've potentially one-upped the existing power gating capabilities. If your processor directly owns the VRM, imagine simply never even delivering the power across the substrate to an unneeded block, but instead turning it off at the source. You no longer have to suffer any leakage questions at the block that is gated; rather you can leave it at the VRM and call it done.
And for blocks that still need power? The capability to have as many as 320 VRM phases per CPU package means a ridiculous level of granularity and control for all the varying load levels that need to be supported at utmost efficiency. Only running at 5% load? Running at 105% load? The VRM is quite likely to be highly gated itself, so that only the necessary number of phases to deliver the required power need to be active at a given time. Potentially shutting off 19/20ths of your VRM because the CPU is twiddling thumbs makes for some righteous power savings at near-zero load conditions.
If I were a betting man, I would say the most significant portion of the touted "20x reduction in idle power draw over IVB**" claims from Intel can be directly laid at the feet of the integrated VRM.
** What the shit is a 20x reduction? Shouldn't this be 1/20th of the power draw? Or did they reduce it twenty times? I could be cynical and suggest that each time they reduced it, it was by 1/1000th of a watt. I hate stupid people who do this "5x slower" or "20x less" -- you don't multiply by a greater-than-one positive integer to achieve a result of less than you started. ARRRG
eastmen
24-Jan-2013, 17:29
Well the claims on the slide are against sandy bridge now ivy bridge. The cynic in me would even say it could be tech older than sandy bridge since the slide only states core i5 2011 vs 2013 targets. unless there is a new slide
http://images.anandtech.com/doci/6290/Intel%204th%20Gen.JPG
Anyway like I said while a reduction in idle power is nice I would rather wait for the die drop + new gpu .
Think about it the surface is out in two weeks. Haswells wont even ship till Q3 according to anand so the earliest I think we'd see a refresh is Q4. I'm sure the same time in 2013 broadwell will be ready. So you can wait almost a year now for a good x86 tablet and then have a much better chip come out a year after that. Or you can buy now and enjoy the tablet for a good 8-10 months before haswell comes and then continue using your tablet for 10-12 months till broadwell comes.
I'm choosing 2013 ivybridge then holiday 2014 / early 2015 broadwell. I can understand some people wanting to wait.
Remember I'm going from a 13.3 inch dual core neo at 1.6ghz with a 3200hd igp and a hd4330 dedicated. I get 3 hours of surfing the web if i'm lucky and maybe an hour and a half gaming. So the i5 will be a huge step up for me. Plus my gf gets my surface when I upgrade to the broadwell one so everyone will be happy.
Anyway intel is starting the 14nm transition and that's what broadwell will be on
http://www.xbitlabs.com/news/other/display/20130118154034_Intel_on_Track_to_Start_Making_14nm _Broadwell_Chips_in_2013.html
So I would expect early 2014 as the date they start rolling out for end users to buy.
Albuquerque
24-Jan-2013, 18:17
Ah, I see what you're saying. 2011 matches up with the volume production of Ivy Bridge, but actual shipments didn't start until 2012. The quotations floating around the 'net suggest that Intel called out Ivy Bridge in the presentation, I suppose in a verbal way. But the slide you have there is (purposefully?) non-specific in that regard.
There is also some conversation around the power reduction being linked only to standby C3 sleep state rather than a C0 (active) P-state (what we would consider power-on-idle). Your slide seems to indicate that we're comparing C3 sleep states, which is a different type of conversation entirely.
I guess the reality is "Wait and see!"
eastmen
25-Jan-2013, 02:44
There is also this one
http://images.anandtech.com/reviews/cpu/intel/Haswell/Architecture/poweroptimizer2sm.jpg
The clearly state that the 2nd gen (sandy bridge) cores are the base line in which they are comparing the 40th gen parts on. Ivybridge is 3rd gen
I see what your saying about the different states however.
Albuquerque
26-Jan-2013, 07:54
Yup, that slide obviously settles that question -- we're talking 28nm Sandy versus 22nm Haswell. In that regard, the ~20% power reduction claim seems paltry to me. I cannot specifically understand why Intel thinks we should be excited for that...
The S3 power savings would be very useful for devices spending large quantities of time in standby, especially combined with claims of significant latency reduction in transitions between S0 and S3. It strikes me as very "phone like", in that I would expect a tap of the power button to provide nearly instantaneous response. Nevertheless, I am skeptical of significance of that advantage on tablet battery life.
I had CDW send me a Samsung ATIV tablets (the version equipped with a core i5, 4GB ram, Win8 Pro 64, digitizer and capacitive 1080P display) for a 30 day eval. As a bit of a Microsoft apologist from time to time, I have defended the Surface Pro's quoted battery life to several other naysayers in my office and in my social network by pointing out that it's a 1080P ultrabook with the addition of a capacitive touchscreen AND digitizer. As such, any rational person should expect a four or five hour battery life.
True to my own expectations, Samsung's version of the Surface Pro did indeed repeatedly survive for four and five hours per charge while using it as my regular PC. Unfortunately as a tablet device, four or five hours really never satisfied me even though my normal Thinkpad laptop likely would never do any better either. Ultimately, the 30 day eval (I give it back next week) convinced me that Surface Pro may not have as much use in my environment as I had hoped.
In light of what you're showing me, I'm far less convinced that Haswell will "save" the battery life situation.
Silent_Buddha
27-Jan-2013, 08:10
In light of what you're showing me, I'm far less convinced that Haswell will "save" the battery life situation.
That would be true for the notebook targetted Mobile-M chips, but the Mobile-U chips are likely the ones targetted at tablets. In that case, even taking the 20x claims with a large helping of salt, that should increase things significantly.
Regards,
SB
eastmen
28-Jan-2013, 03:34
The battery life will get better that's for sure but we wont go from 5 hours to 10 hours. We might get another 1-2 hours out of it hopefully.
Like I said , I hope that broadwell brings a big reduction in power, we do get a micron drop with it , new gpu design and hopefully some improvements to the core designs .
The 5 hour battery life isn't to bad . the charger should be fast at 48w its not a trickle like an ipad.
The surface rt has a 24w power supply with a 32 Wh battery and takes 2 hours to charge. The surface pro has a 48w supply and a 42w power supply so it seems like charging times wont be bad at all. in comparison the ipad 3 takes 5 hours to charge.
http://media.bestofmicro.com/V/A/360118/original/recharge90.png
I believe the surface pro can also charge through usb so perhaps some of those external battery packs can be used too.
Albuquerque
28-Jan-2013, 16:25
That would be true for the notebook targetted Mobile-M chips, but the Mobile-U chips are likely the ones targetted at tablets. In that case, even taking the 20x claims with a large helping of salt, that should increase things significantly.
Regards,
SB
Yeah, but that "20x" claim was specifically linked to sleep state, not active-idle. That's the piece I was missing until eastmen posted up that slide earlier. I agree there will be power savings, but my estimate would be quite close to eastmen's hour or maybe two rather than a doubling. A 20% increase is "ok" for sure, but 20% on top of 4 or 5 hours is still a little short for a device that wants to be a tablet.
IMO, of course.
Yeah, but that "20x" claim was specifically linked to sleep state, not active-idle.
It won't be 20x reduction, due to the screen, but I know they got big potential there.
I have both a 5-inch Menlow device and a 10.1-inch Clover Trail device. The latter uses a MINIMUM 3.5W in idle with screen on while on the Clover Trail device I've seen it at 1.4W. Despite having similar TDP, the high idle power effectively limits how much battery life it can get. That was the issue with the pre-"new power management" days, even if the TDP was zero for the chip, the improvement on battery life would have been minimal.
The latter uses less power playing back videos in Youtube as the former does when mostly idle!
Albuquerque
15-Feb-2013, 03:35
Well, Intel never made claims about the entire device using "20x less power" -- and again, how the hell do you multiply a value by a whole positive number and end up with a smaller number? I hate that shit. Anyway, you're of course exactly right -- the biggest power draw will be the screen, exacerbated by the fact that it's a touch screen.
Well, Intel never made claims about the entire device using "20x less power" -- and again, how the hell do you multiply a value by a whole positive number and end up with a smaller number? I hate that shit. Anyway, you're of course exactly right -- the biggest power draw will be the screen, exacerbated by the fact that it's a touch screen.
The Atom Z2760 Clover Trail Tablet I was talking about also uses touchscreen, and is a Windows 8 device.
1.4W with screen-on idle. There's no other x86 platforms that can match that level of power use. Platform power itself probably did decrease order of magnitude compared to previous Atom platforms.
Any news about Haswell and chipsets supporting it ? My Q6600 is getting old in games and video apps like Handbrake :(
UniversalTruth
27-Feb-2013, 11:00
Haswell is on a release schedule for June 2013. So, you have to wait till then to see what's going on...
But I am almost in the same situation as you with Q9450 overclocked a little bit, and capable to run Crysis 3 decently at everything Medium, and with some slight/ negligible issues when half of the settings are at High.
So, I would miss Haswell and go for upgrade anytime maybe in 2-3 years time, also waiting to replace my DDR3 1333 with something at least DDR4 3000. :grin:
*3770K supports officially only DDR3 1600, so I am still in the waiting for a decent upgrade. :grin:
Haswell is on a release schedule for June 2013. So, you have to wait till then to see what's going on...
But I am almost in the same situation as you with Q9450 overclocked a little bit, and capable to run Crysis 3 decently at everything Medium, and with some slight/ negligible issues when half of the settings are at High.
So, I would miss Haswell and go for upgrade anytime maybe in 2-3 years time, also waiting to replace my DDR3 1333 with something at least DDR4 3000. :grin:
*3770K supports officially only DDR3 1600, so I am still in the waiting for a decent upgrade. :grin:
I'm personally waiting for Hybrid Memory Cube.
http://www.cadence.com/Community/blogs/ii/archive/2012/09/19/memcon-keynote-why-hybrid-memory-cube-will-revolutionize-system-memory.aspx
Might be a high end option in 2-3 years.
eastmen
27-Feb-2013, 14:16
I thought broadwell was to be the first ddr 4 intel chip out. Not haswell.
I thought it would be (at least) the chip AFTER broadwell... DDR4 will come to intel server chips first, starting next year at the earliest IIRC.
Yeah, we're not getting DDR4 until 2014 or 15. Even then, it'll be incredibly expensive compared to its predecessor until probably 2017 if the pattern for DDR2 and DDR3 holds...
http://www.tomshardware.com/reviews/core-i7-4770k-haswell-performance,3461.html
http://i.imgur.com/nMUumxT.png
:runaway:
It does look really nice in that case. Optimized to the max for it no doubt.
Exophase
18-Mar-2013, 20:35
It does look really nice in that case. Optimized to the max for it no doubt.
Indeed, but Mandelbrot set is super synthetic, I don't think there are many real world full problems that are similar. Its kernel is mostly multiplications and since each pixel is independent it's easy to parallelize. Note that Haswell improves FP throughput not just by adding FMA but by being able to simultaneously execute FMUL + FMUL instead of just FMUL + FADD like its predecessors.
Yeah it's ultra synthetic. Reminds me of Core 2 Duo's blitzing of mandelbrot.
http://techreport.com/review/10351/intel-core-2-duo-and-extreme-processors/15
http://techreport.com/r.x/core2/sandra-mm-int.gif
Svensk Viking
18-Mar-2013, 22:14
Is it likely AVX2 will be used for games? Are today's games generally using the latest instruction sets?
trinibwoy
18-Mar-2013, 23:24
It does look really nice in that case. Optimized to the max for it no doubt.
Isn't Ivy optimized too? 78% faster at the same clocks is nothing to scoff at. That's pretty impressive.
Exophase
19-Mar-2013, 00:24
Isn't Ivy optimized too? 78% faster at the same clocks is nothing to scoff at. That's pretty impressive.
We've known for a long time that Haswell would have double the integer SIMD width and two 256-bit FMA units that are both capable of FMUL (and one of FADD), as opposed to one 256-bit FADD and one 256-bit FMUL unit. We should be very concerned if a simple test couldn't be derived to demonstrate the addition of these units.
Mendelbrot is extremely parallel (easy to populate all the SIMD lanes and easy to hide the latency of the operations) and has a fairly high ALU to memory ratio. It probably doesn't even need the improved L1 load/store bandwidth to take advantage of the increased ALU throughput.
That doesn't mean that plenty of useful software won't have a big benefit from the improvements but rarely anywhere close to this dramatic.
Is it likely AVX2 will be used for games? Are today's games generally using the latest instruction sets?
I'd be surprised if most games used anything but SSE2 or 3. Maybe if we're lucky, SSSE3.
Exophase
19-Mar-2013, 05:27
I'd be surprised if most games used anything but SSE2 or 3. Maybe if we're lucky, SSSE3.
Code using up to AVX (maybe just AVX128) is going to be useful and probably utilized on consoles. The question is how much work it is to dynamically dispatch between this and weaker fallbacks. Probably not that much. Code size isn't a huge deal.
Actually a question here: how useful is AVX128 vs SSE 4.2?
Exophase
19-Mar-2013, 05:39
Actually a question here: how useful is AVX128 vs SSE 4.2?
The main feature is that it has three way addressing. I don't have a great idea how much it improves performance, but consider this: one of Ivy Bridge's most major changes is its ability to fold moves in the register renamer, a feature to get around the lack of three way addressing. But those moves still take up bandwidth on the front end (typically from the uop cache), so their impact is not totally removed. Sandy Bridge also had a fairly wide decode and lots of execution units lying around that could do moves. This isn't the case on Jaguar, where decode is one of the more narrow parts of the design and wasting a slot on a move will have a bigger impact. So I expect AVX128 on Jaguar to be more useful than move elimination was on IB. Yet the latter still had a measurable impact a lot of the time (I doubt the other change, some more dynamic buffers for single threaded scenarios, contributed as much).
So I think it'll probably at least be worth using, if compilers can take advantage of it well.
Blazkowicz
19-Mar-2013, 06:38
I don't know how to take it, regarding relevance of AVX, but Intel disables it on all Pentium and Celeron processors, because, hum!, because they feel like it.
Hopefully they'll enable it on Haswell celeron and pentium and go on disabling the other features instead (clearvideo, quicksync, TxT etc.)
Game developers and middleware developers can do both binaries with SSE 2 or 3 and AVX as feature levels, probably (as says Exophase). Around 1997 we had Pod and Pod MMX (and worse, they'd be various hard-wired versions for a few proto-GPUs).
Since consoles CPUs have AVX it seems a given that most major game engines and gaming middleware will use it.
Intel's also disabling TSX on the 4770K model.
They can't even make the new instruction sets consistent on the top and middle end models anymore. What the fuck, man. Just what the fuck.
Blazkowicz
19-Mar-2013, 08:54
Ah yes, they do this for VT-d, ECC support and other fringe things (TxT, vPro). A non-K CPU enables back most of it, and for ECC you have to get down to core i3.
They do market segmentation at every step, on the high end they don't want you to get away with running a 4.5GHz CPU on the cheapest Z77 or Z87 motherboard and get all these "pro" features.
What's up with the L2 throughput measured in Sandra? Haswell is supposed to read a full cache-line in a single cycle from that region. Is there some special conditions, or it is something with the benchmark method?
Supposedly, it's issues with pre-launch software(motherboard drivers/UEFI/etc).
I don't know how to take it, regarding relevance of AVX, but Intel disables it on all Pentium and Celeron processors, because, hum!, because they feel like it.
Hopefully they'll enable it on Haswell celeron and pentium and go on disabling the other features instead (clearvideo, quicksync, TxT etc.)
Game developers and middleware developers can do both binaries with SSE 2 or 3 and AVX as feature levels, probably (as says Exophase). Around 1997 we had Pod and Pod MMX (and worse, they'd be various hard-wired versions for a few proto-GPUs).
Since consoles CPUs have AVX it seems a given that most major game engines and gaming middleware will use it.
That really is a daft decision. They don't even let you pay $50 bucks to enable it! (Forgot exactly what feature it was that Intel allowed you to do this for; it was either cache or virtualization or both.)
I don't know how to take it, regarding relevance of AVX, but Intel disables it on all Pentium and Celeron processors, because, hum!, because they feel like it.
Hopefully they'll enable it on Haswell celeron and pentium and go on disabling the other features instead (clearvideo, quicksync, TxT etc.)
I think there's some hope. For the last few generations, intel just disabled the last major simd version on pentium/celerons - so from Core2 to Westmere celerons/pentiums had ssse3 (instead of 4.1/4.2), whereas SNB/IVB ones support sse4.2 instead of AVX. Continuing that "tradition" would mean they'd have AVX but not AVX2 (and probably no fma/f16c neither).
Silent_Buddha
19-Mar-2013, 21:13
I'd be surprised if most games used anything but SSE2 or 3. Maybe if we're lucky, SSSE3.
AVX is likely to get used by games on the next gen consoles. So there's a pretty good chance that as least AVX 1 will get used. I'm not sure there'd be a good reason to then not use it on PC versions of the same games. This obviously only applies to AAA games. Games that need to address a wider audience (casual focused games) are unlikely to bother.
Regards,
SB
UniversalTruth
20-Mar-2013, 08:02
Why so early?
Intel Core i7-4770K "Haswell" Listed on Dutch Stores (http://www.techpowerup.com/181690/Intel-Core-i7-4770K-quot-Haswell-quot-Listed-on-Dutch-Stores.html)
http://www.techpowerup.com/181690/Intel-Core-i7-4770K-quot-Haswell-quot-Listed-on-Dutch-Stores.html
Blazkowicz
20-Mar-2013, 08:10
So that you link to these stores, making them known :razz:
cal_guy
21-Mar-2013, 23:10
AVX is likely to get used by games on the next gen consoles. So there's a pretty good chance that as least AVX 1 will get used. I'm not sure there'd be a good reason to then not use it on PC versions of the same games. This obviously only applies to AAA games. Games that need to address a wider audience (casual focused games) are unlikely to bother.
Regards,
SB
Will they use the 256-bit AVX instructions that will actually increase the theoretical throughput of Sandy Bridge and up processors, or will they use 128-bit AVX instructions that are more of a match to Jaguar's SIMD units?
itsmydamnation
22-Mar-2013, 02:35
on bulldozer it here is about a 5% perf hit using 256bit ops over the 128bit ops. so my guess is for consoles they will use the 128bit ops.
Exophase
22-Mar-2013, 05:09
Whether or not the 256-bit AVX instructions make sense depends on whether or not Jaguar can decode them in one cycle using one decoder. If this is the case it can be a win to use them even if they require two uops to execute, since it'll free decode bandwidth which can be the tightest bottleneck in the Jaguar pipeline. I'm not sure if this will be the case or not.. it may depend on whether or not AVX instructions are split into two COPs up front or a single COP that gets split into two uops later in the pipeline.
On BD 256-bit AVX was worse than 128-bit because the decoder could only handle one per cycle, while it could handle four 128-bit ops (and therefore two 128-bit pairs).
UniversalTruth
23-Mar-2013, 19:48
What?! :shock: Windows Blue coming this summer?
Windows Blue is aimed at Intel 'Haswell' ultrabooks (http://news.cnet.com/8301-10805_3-57575826-75/windows-blue-is-aimed-at-intel-haswell-ultrabooks/)
MS want to roll one OS / year now I believe...
I won't buy another windows until they put the god damned start menu back in again, and stop that metro shit from popping up at login.
Silent_Buddha
23-Mar-2013, 20:50
MS want to roll one OS / year now I believe...
But with reduced prices as well. I expect that with a yearly model, they are aiming at either a 29.99 or 39.99 price point. Similar to the Digital Distribution special they had going for the first few months of Win8. Physical media distribution may be more.
Regards,
SB
Blazkowicz
23-Mar-2013, 21:24
What? Windows Blue? it's the first time ever I see it mentioned.
Apparently it can be thought off both like a service pack and an OSX .1 release.
So, there's a kernel update that covers Haswell's power management basically, and likely various ARM SoC, maybe Jaguar power management.
A future version will cover ARMv8, etc. - I assume they keep Windows 8 x86, RT and phone in sync that way. The platform has to be updated more frequently for new tablets etc. so why not push the minor revision to PC each time as well, through downloads.
If you can indeed get it at $30, without needing to previously own Windows then many people may upgrade from their pirated or legit XP. That would help mitigate the end of support of Windows XP - we all risk a worldwide disaster one year from now when all these XP boxes get owned. It can be a shitstorm with massive data and identity theft, sprawling botnets etc. and can make Microsoft look bad as well as people flee to other platforms (be it linux, chromebook, android, or OSX for the well off)
So a cheap or affordable Windows can allow millions of people to install it and show Microsoft is doing something about the issue.
UniversalTruth
23-Mar-2013, 21:36
I doubt Windows Blue will run as well on hardware as Windows XP which on the same system will most probably fly...
Blazkowicz
23-Mar-2013, 21:52
XP was a slow piece of crap when it came out, too, until it just flew on outdated PCs.
XP is on many millions of computers dating from 2005 to 2009, and btw Windows 8 or Blue requires the NX bit, which makes slow PCs incompatible with it.
Only the hard drive is the weak point with these computers (else they're mostly better than an ARM tablet)
XP and Vista both suffered from seriously underspec-ed budget shit PCs. Though I think it was quite hard to build a PC fast enough to make Vista feel fast in 2006. It needs a lot of HDD speed.
I won't buy another windows until they put the god damned start menu back in again, and stop that metro shit from popping up at login.
Let the hate flow through you. :)
Well Blue will bring "boot on desktop" no ? No more metro when you logon.
Let the hate flow through you. :)
It's flowing through me too, in the exact same way.
Is it likely AVX2 will be used for games? Are today's games generally using the latest instruction sets?
Actually a question here: how useful is AVX128 vs SSE 4.2?
SSE3 is supported by over 99% of current installed CPU base according to latest Steam Survey. It's a no brainer to support it at least. I don't think many current generation games are using AVX, since most games are developed first for current generation consoles, and those do not support wider than 128 bit vectors. PC CPUs are so much faster than current console CPUs that the extra work and version management for going beyond SSE3 doesn't pay off. It is much better to spend that time optimizing the GPU stuff (as the draw call / API overhead on PC is still an issue compared to consoles).
I think the biggest single improvement is that AVX is supported by both AMD and Intel. SSE 4.X(a) had many different versions that weren't fully compatible with each other. Jaguar also supports AVX, and is the CPU in PS4. This is good news for PC gaming, since games will be AVX optimized already for the console. No extra work is needed to support it.
Compared to SSE4.X, 128 bit AVX has some extra instructions such as broadcast and mask moves. These do save the count of memory instructions in some cases. Also the VEX prefix allows AVX to write result to a separate register (nondestructive operation), reducing the register pressure (and reducing the extra move operations). Both of these things are actually very good for Jaguar, since unlike Sandy/Ivy Bridge, AMD CPUs do not have "free" moves by register renaming. Also Jaguar can only sustain two uops per cycle. All the extra moves and extra shuffles take away slots that could be used for doing real work (adds and multiplys). AVX helps with that.
256 bit AVX on Jaguar: That's an interesting question that is not yet answered by AMD (as far as I know). Running 256 bit AVX on Bulldozer doesn't help at all. But Bulldozer has a separate shared vector pipeline, so that might yield slightly different results. Bobcat splits the 128 bit vector instructions to two 64 bit instructions in the decoder. 128 bit operations take two cycles to decode (according to Agner Fogs analysis) and are two separate instructions for the rest of the pipeline. So in case of Bobcat 128 bit (vs 64 bit) only helps by reducing the instruction cache usage. I don't see instruction cache being a bottleneck for Jaguar (it has very good L1 caches). Let's wait for the first Jaguar benchmarks (and Agner's analysis). They shouldn't be far away (since there's already some leaked Temash tablet benchmarks around the net).
No TSX for K models apparently. Seriously Intel? What the hell?
No TSX for K models apparently. Seriously Intel? What the hell?
Intel does have some very weird idea of market segmentation.
However, in the case of TSX, at least one of the function (Hardware Lock Elision) is backward compatible (i.e. a code with TSX support runs fine on older CPU, just without the benefit).
Also the VEX prefix allows AVX to write result to a separate register (nondestructive operation), reducing the register pressure (and reducing the extra move operations). Both of these things are actually very good for Jaguar, since unlike Sandy/Ivy Bridge, AMD CPUs do not have "free" moves by register renaming. Also Jaguar can only sustain two uops per cycle. All the extra moves and extra shuffles take away slots that could be used for doing real work (adds and multiplys). AVX helps with that.
BD does have free moves so it's not just intel cpus - only for xmm regs though not ymm (I guess it's not only easier for the implementation, it actually probably makes sense since you rarely need moves with avx anyway). Well the moves aren't entirely free since you still got the uops moving around, but that's the same as Sandy Bridge (only Ivy Bridge can do better). Not sure what Jaguar will do there though, Bobcat certainly wasn't as advanced.
As for SSE3 I'm not convinced it's used. It may be supported on more than 99% of all cpus, but really the additional instructions are so minor (float horizontal add/sub and that's about it) you could as well take care of the remaining 1% of all cpus by just using SSE2 only. SSSE3 is way more interesting (byte shuffle for instance) as is sse4.1. But support for those is less wide-spread.
rpg.314
25-Mar-2013, 04:40
BD does have free moves so it's not just intel cpus - only for xmm regs though not ymm (I guess it's not only easier for the implementation, it actually probably makes sense since you rarely need moves with avx anyway). Well the moves aren't entirely free since you still got the uops moving around, but that's the same as Sandy Bridge (only Ivy Bridge can do better). Not sure what Jaguar will do there though, Bobcat certainly wasn't as advanced.
As for SSE3 I'm not convinced it's used. It may be supported on more than 99% of all cpus, but really the additional instructions are so minor (float horizontal add/sub and that's about it) you could as well take care of the remaining 1% of all cpus by just using SSE2 only. SSSE3 is way more interesting (byte shuffle for instance) as is sse4.1. But support for those is less wide-spread.
They are used for complex mul mostly. Hardly useful for games.
BD does have free moves so it's not just intel cpus - only for xmm regs though not ymm (I guess it's not only easier for the implementation, it actually probably makes sense since you rarely need moves with avx anyway). Well the moves aren't entirely free since you still got the uops moving around, but that's the same as Sandy Bridge (only Ivy Bridge can do better). Not sure what Jaguar will do there though, Bobcat certainly wasn't as advanced.
Moves take up decode bandwidth, but should be otherwise completely free on all physical register file OOOe implementations since the move is completely resolved in the renaming stage.
Bobcat and Jaguar are physical register file OOOe machines too, but given the narrow decoder, moves probably have an impact.
Cheers
John Reynolds
25-Mar-2013, 14:06
Well Blue will bring "boot on desktop" no ? No more metro when you logon.
Start8 from Stardock.com fixes that for $5. I'd still be using Win7 if it weren't for this little app.
Albuquerque
25-Mar-2013, 14:29
No TSX for K models apparently. Seriously Intel? What the hell?
The "K" series have always gone missing certain, specific features to keep them out of serious production systems. Have a look at the i7-2600, i7-2600k, i7-3770, and i7-3770k...
http://ark.intel.com/compare/52213,52214,65719,65523
The "k" series are both missing VT-d and TXT. This new "subtraction" of TSX doesn't seem much different to me.
Exophase
25-Mar-2013, 15:08
The "K" series have always gone missing certain, specific features to keep them out of serious production systems. Have a look at the i7-2600, i7-2600k, i7-3770, and i7-3770k...
http://ark.intel.com/compare/52213,52214,65719,65523
The "k" series are both missing VT-d and TXT. This new "subtraction" of TSX doesn't seem much different to me.
It probably doesn't seem different to the Intel management that made that decision either, but it's a lot different. TSX, unlike VT-d and TXT, is something that can apply to a wide variety of software but needs actual programming effort to utilize. But it's a lot harder to motivate software developers to do this when a lot of their userbase won't have access to it and can't test it.
Moves take up decode bandwidth, but should be otherwise completely free on all physical register file OOOe implementations since the move is completely resolved in the renaming stage.
Bobcat and Jaguar are physical register file OOOe machines too, but given the narrow decoder, moves probably have an impact.
Cheers
Only Ivy Bridge implemented this optimization, meaning on Sandy Bridge moves still flowed through the execution units (and of course Netburst uarchs did as well). Bulldozer only has it for SSE moves. I don't think it's necessarily completely free to allow multiple architectural registers to map to the same physical registers. Would not count it as a given on Bobcat and Jaguar.
Only Ivy Bridge implemented this optimization, meaning on Sandy Bridge moves still flowed through the execution units (and of course Netburst uarchs did as well). Bulldozer only has it for SSE moves. I don't think it's necessarily completely free to allow multiple architectural registers to map to the same physical registers. Would not count it as a given on Bobcat and Jaguar.
That's really odd, I'd consider this very low hanging fruit.
Cheers
Albuquerque
25-Mar-2013, 15:40
It probably doesn't seem different to the Intel management that made that decision either, but it's a lot different. TSX, unlike VT-d and TXT, is something that can apply to a wide variety of software but needs actual programming effort to utilize. But it's a lot harder to motivate software developers to do this when a lot of their userbase won't have access to it and can't test it.
I'm not disagreeing, just stating the obvious.
3dilettante
25-Mar-2013, 16:03
There are at least two non-K SKUs that appear to have TSX disabled as well, at least according to Tomshardware.
That's not including any omissions at i3 and below that might turn up eventually. I'm with most commentators in that I don't see the upside to fragmenting things like this, even though I'm not certain TSX will do much at the the core counts and typical software consumer SKUs are concerned with.
rpg.314
26-Mar-2013, 01:13
It probably doesn't seem different to the Intel management that made that decision either, but it's a lot different. TSX, unlike VT-d and TXT, is something that can apply to a wide variety of software but needs actual programming effort to utilize. But it's a lot harder to motivate software developers to do this when a lot of their userbase won't have access to it and can't test it.
Only Ivy Bridge implemented this optimization, meaning on Sandy Bridge moves still flowed through the execution units (and of course Netburst uarchs did as well). Bulldozer only has it for SSE moves. I don't think it's necessarily completely free to allow multiple architectural registers to map to the same physical registers. Would not count it as a given on Bobcat and Jaguar.
One possibility: they are locking it out of chips that might be used as xeon replacements and will unlock it for Xeons?
BD does have free moves so it's not just intel cpus - only for xmm regs though not ymm (I guess it's not only easier for the implementation, it actually probably makes sense since you rarely need moves with avx anyway). Well the moves aren't entirely free since you still got the uops moving around, but that's the same as Sandy Bridge (only Ivy Bridge can do better). Not sure what Jaguar will do there though, Bobcat certainly wasn't as advanced.
Yes, however Sandy/Ivy core can decode four instructions per cycle. Bulldozer/Piledriver (and Bobcat/Jaguar) can only decode two instructions per cycle (per core). Sandy/Ivy thus have plenty of free decode slots available for decoding the extra moves that will be eliminated by the register renaming mechanism. The wider Intel cores should benefit more from this feature compared to narrow AMD cores.
As for SSE3 I'm not convinced it's used. It may be supported on more than 99% of all cpus, but really the additional instructions are so minor (float horizontal add/sub and that's about it) you could as well take care of the remaining 1% of all cpus by just using SSE2 only. SSSE3 is way more interesting (byte shuffle for instance) as is sse4.1. But support for those is less wide-spread.
I agree that both SSSE3 and SSE4.1 are more interesting than SSE3, but SSE4.1 has only 62% hardware coverage (source: Steam Survey). SSE4.1 is not a good baseline (if you are targeting only a single instruction set).
SSE3 horizontal operations are handy for example in dot product implementation (dot = mul + 2 x horizontal add, or 2 x dot = 2 x mul + 2 x horizontal add). In SSE2 a single dot product costs you six instructions (mul + 2 x add + 3 x shuffle). Games ported from Xbox 360 tend to use (AoS) vector dot products, because dot products are very fast on Xbox 360 CPU (single cycle throughput rate).
According to Steam Survey SSE3 has a 99.4% coverage, while SSE2 has 99.8% (0.4% difference). 0.4% is not a valid reason to choose SSE2. Unless you want to have dot products that require 2x-3x more instructions... or are calculating everything using SoA layout... but that seems to be something that gameplay programmers are not willing to do. You give them a good optimized vector class and that's the lowest level abstraction they are going to use. SoA vector batch processing is only used by low level engine programmers (as far as my experience goes).
Exophase
26-Mar-2013, 14:48
One possibility: they are locking it out of chips that might be used as xeon replacements and will unlock it for Xeons?
That's probably the rationale, and at least fits for other feature segmentation like with VT-d, TXT, AESNI, ECC, etc.. but if they're looking at TSX as an enterprise-class only feature then they're not positioning it well, IMO.
SSE3 horizontal operations are handy for example in dot product implementation (dot = mul + 2 x horizontal add, or 2 x dot = 2 x mul + 2 x horizontal add). In SSE2 a single dot product costs you six instructions (mul + 2 x add + 3 x shuffle). Games ported from Xbox 360 tend to use (AoS) vector dot products, because dot products are very fast on Xbox 360 CPU (single cycle throughput rate).
According to Steam Survey SSE3 has a 99.4% coverage, while SSE2 has 99.8% (0.4% difference). 0.4% is not a valid reason to choose SSE2. Unless you want to have dot products that require 2x-3x more instructions... or are calculating everything using SoA layout... but that seems to be something that gameplay programmers are not willing to do. You give them a good optimized vector class and that's the lowest level abstraction they are going to use. SoA vector batch processing is only used by low level engine programmers (as far as my experience goes).
Well I agree those instructions could be handy. But I still got some doubts that they are really all that useful - personally I've been able to avoid them whenever I first thought they'd be useful (actually that's not quite true but almost). And even if they are a perfect fit for your code (such as that AoS dot product) the performance benefit is most likely close to nonexistent, because the internal implementation is apparently exactly that, ordinary add + shuffle. They generate tons of uops, have high latency and crappy throughput. e.g. Wolfdale lists this as 3 uops, latency 7, throughput 1 every two clock. And it's typically worse for AMD where it looks like you could actually do better by doing the add+shuffle manually for some reason at least with Bulldozer (at least it's not the same order of fail as sse41 DPPS on BD which generates 16! uops and definitely looks like you could always do much better manually).
So those instructions probably don't help as much as you'd think they do by just looking at the instruction count - they look much better on paper than they are. And the workarounds (using shuffles) really are quite trivial, in contrast to the instructions you get with ssse3/sse41 (emulating byte shuffles by hand is hilarious for instance, emulating rounding correctly tricky at best etc.).
Oh and while here I'd like to bring up the other rare sse3 instruction, lddq, a band-aid specifically invented for helping the P4 because it's movdqu implementation was simply unbearable, and completely useless on any other cpu...
max-pain
28-Mar-2013, 20:56
I agree that both SSSE3 and SSE4.1 are more interesting than SSE3, but SSE4.1 has only 62% hardware coverage (source: Steam Survey). SSE4.1 is not a good baseline (if you are targeting only a single instruction set).
Crysis 3 only supports DX11 GPUs and only 58% of Steam users have that. That is lower than SSE4.1's share. Most next-gen games/ports will require a decent DX11 GPU (and a decent CPU). Every Intel CPU (that isn't low-end) from the past 5 years supports SSE4.1. But (not so) old AMD CPUs could be a problem...
What we know (Steam):
May 2012: SSE4.1 - 52.66%, SSE4.2 - 38.56%, DX11 GPUs - 45.83%
November 2012: SSE4.1 - 59.94%, SSE4.2 - 46.70%, DX11 GPUs - 55.50%
February 2013: SSE4.1 - 62.06%, SSE4.2 - 50.08%, DX11 GPUs - 58.32%
Prediction:
November 2013: SSE4.1 - ~70%, SSE4.2 - ~60%, DX11 GPUs - ~70% (when next-gen consoles launch)
May 2014: SSE4.1 - ~75%, SSE4.2 - ~70%, DX11 GPUs - ~75%
Next-gen games/ports could require SSE4.x and i think they should.
Silent_Buddha
28-Mar-2013, 21:35
Crysis 3 only supports DX11 GPUs and only 58% of Steam users have that. That is lower than SSE4.1's share. Most next-gen games/ports will require a decent DX11 GPU (and a decent CPU). Every Intel CPU (that isn't low-end) from the past 5 years supports SSE4.1. But (not so) old AMD CPUs could be a problem...
What we know (Steam):
May 2012: SSE4.1 - 52.66%, SSE4.2 - 38.56%, DX11 GPUs - 45.83%
November 2012: SSE4.1 - 59.94%, SSE4.2 - 46.70%, DX11 GPUs - 55.50%
February 2013: SSE4.1 - 62.06%, SSE4.2 - 50.08%, DX11 GPUs - 58.32%
Prediction:
November 2013: SSE4.1 - ~70%, SSE4.2 - ~60%, DX11 GPUs - ~70% (when next-gen consoles launch)
May 2014: SSE4.1 - ~75%, SSE4.2 - ~70%, DX11 GPUs - ~75%
Next-gen games/ports could require SSE4.x and i think they should.
Crysis 3 is an outlier in that they aren't interested in selling a game to as many people as possible. They are more interested in pushing the tech as far as possible.
Most companies take the Blizzard approach to trying to sell to the largest audience possible.
But yes, the new consoles will change that dynamic. As then the largest pool of people will consist of PS4/Xbox next/PC which means Dx11 class features. But don't be surprised if you still have Dx9 class games with Dx11 added if the developer/publisher wants to target PS3/X360/older PCs in addition to PS4/Xbox next/Dx11 PCs.
Regards,
SB
Intel Haswell chip pic from Vr-zone via Anand
That...is a sexy piece of hardware pr0n. Wow. Haswell is the first non-rectangular (4-core) core i-series CPU ever. That GPU looks to be absolutely massive, unless intel re-jigged the whole layout of the chip.
In previous i-series chips, CPU cores were lined up in a row with L3 beneath them and the GPU tacked on to the side. Now I would assume that the GPU sits on the opposite side of the L3 compared to the cores, thus filling the chip out into a square-ish shape. That'd mean 50+ percent of the die is GPU... Ugh! :D
Off-chip die is fairly large. Wonder what geometry it is manufactured with, I'd assume something coarser than 22nm, probably, seeing as DRAM is quite frugal with power, and older fabs are cheaper to run... Anyhow, damn nice piece of kit. I'm all hot and bothered now! :razz:
Off-chip die is fairly large. Wonder what geometry it is manufactured with, I'd assume something coarser than 22nm, probably, seeing as DRAM is quite frugal with power, and older fabs are cheaper to run... Anyhow, damn nice piece of kit. I'm all hot and bothered now! :razz:
It definitely won't be 22nm, because DRAM does not need characteristics a logic focused process like Intel's process provides.
They probably got existing memory chip and integrated it with package.
Considering a 8Gb DDR-3 DRAM chip can be as small as roughly 10mm x 12mm (including package), I think it's possible that the off-die DRAM is 512MB ~ 1GB or larger. That'd be considerable if they are able to make the bus wide enough for sufficient bandwidth.
UniversalTruth
11-Apr-2013, 15:59
That'd mean 50+ percent of the die is GPU... Ugh! :D
So boring.... complete waste of precious space :cry:
Guys, why don't you tell them to stop screwing us and get their act as expected by normal sane people... :grin:
Some interesting comments from the Xtremes which I do like:
Seems like I will grow old with my 2600K. Does Intel really not want people with a Sandy or Ivy Bridge desktop chip to upgrade? Give us at least a 6 core as a minimum already!!! And I don't need silicon spend on an igpu.
this. Its been over 6 years since Q6600 was introduced. Its 2013 and still no mainstream 6 cores.. AMD's HSA seems more interesting than this :/
even haswell mobo's seem more of the same old no NGFF slot,No SATA Express, no integrated widi / miracast.. 99% of boards dont come with bluetooh 4.0 or wifi, no decent onboard sound.
and people wonder why PC growth is stagnant. :/
http://www.xtremesystems.org/forums/showthread.php?285451-Intel-haswell-i7-4770K-preview-article-TMSHW
Time for everyone to stop buying intel to get there attention. Tell everyone you know that builds, etc, then send intel a nice letter to go f*&k themselves. Maybe get there attention.
http://www.xtremesystems.org/forums/showthread.php?285764-Intel-Xeon-2013-2014-processors-detailed
DRAM does not need characteristics a logic focused process like Intel's process provides.
Presumably intel is capable of tuning their process to manufacture different kinds of devices (they make flash for example in their own fabs, together with micron.) ...But presumably, it would be cheaper and more efficient to use spare fab capacity of a prior node, fabs that have already had all construction costs written off years ago most likely.
They probably got existing memory chip and integrated it with package.
Unpossible. That'd give the memory package far too little bandwidth. There's been talk that there's a 512-bit bus between the dies, indicating something custom-engineered. Also, seeing intel's big focus on power useage and saving these days, you'd think they'd want to cook up their own solution throughout, tailoring both devices on that substrate completely to their own requirements.
So boring.... complete waste of precious space :cry:
Guys, why don't you tell them to stop screwing us and get their act as expected by normal sane people... :grin:
Actually decent majority of the "normal people" do browsing and video playback as big part of their computing.
Catering to the minority would mean extremely small gain in revenue for lot more cost and work.
Exophase
11-Apr-2013, 18:36
Presumably intel is capable of tuning their process to manufacture different kinds of devices (they make flash for example in their own fabs, together with micron.) ...But presumably, it would be cheaper and more efficient to use spare fab capacity of a prior node, fabs that have already had all construction costs written off years ago most likely.
Intel actually had separate flash fabs jointly operated with Micron. Micron has since bought out Intel's share: http://www.eetimes.com/electronics-news/4237169/Micron-buying-Intel-s-stake-in-two-IM-Flash-fabs
You can see at least one of these fabs also produced DRAM. AFAIK most DRAM manufacturers make NAND flash too so there's probably overlap in those processes but I don't think there is with the standard logic processes.
I'm not aware of Intel ever having eDRAM on a product made in one of their CPU fabs.. they could have had the capability but strikes me as a wasted investment for them since they're putting the DRAM off die. They very well could have had the die made by a collaborating party (Micron very well a possibility), of course that doesn't mean they're using off the shelf parts.
Unpossible. That'd give the memory package far too little bandwidth. There's been talk that there's a 512-bit bus between the dies, indicating something custom-engineered. Also, seeing intel's big focus on power useage and saving these days, you'd think they'd want to cook up their own solution throughout, tailoring both devices on that substrate completely to their own requirements.
There is the wide IO standard but that probably offers too little bandwidth per pin to be useful in this case. I don't really know if there's an intrinsic limitation in the technology that prevents companies like Samsung from making much higher clocked versions (that are not nearly as low power).
UniversalTruth
11-Apr-2013, 18:57
Actually decent majority of the "normal people" do browsing and video playback as big part of their computing
That is because they do nothing in order to promote heavy computing/ interesting games, etc, they do not innovate. They just left the desktops on inertia. Nothing new, nothing to get people's attention and money. One big nothing.
If you think that it's normal the most stupid tablet to have more extras than the premium motherboards, or the most stupid smartphone to have a retina display matching in resolution the big 20-30 inch monitors...
pjbliverpool
11-Apr-2013, 19:25
That is because they do nothing in order to promote heavy computing/ interesting games, etc, they do not innovate. They just left the desktops on inertia. Nothing new, nothing to get people's attention and money. One big nothing.
If you think that it's normal the most stupid tablet to have more extras than the premium motherboards, or the most stupid smartphone to have a retina display matching in resolution the big 20-30 inch monitors...
It depends how much the IGP's can be used for GPGPU. There's 400GFLOPs in that GT2 IGP which could do wonders for compute tasks.
Having said that another 4 Haswell cores would offer even more performance and be a lot more versatile so given the choice between 8 cores or 4 cores + GT2 at the same price point I'd certainly take the 8 cores any day.
RudeCurve
11-Apr-2013, 22:09
Welcome to last decade...
http://www.llamma.com/xbox360/news/images/xbox-360-elite/360elite%20048.jpg
pjbliverpool
11-Apr-2013, 22:15
Because they're the same shape? Because that's pretty much where the similarities end!
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.