I Can Hazwell?

Looks like the main pipeline stays the same and there's a lot of quantity improvements to certain structures as well as tons of un-core changes. Some of the more significant changes so far for the CPU side:

uArch:
- Extra integer & extra store port
- AVX2
- TSX
- L1 latency & conflict improvements
- Beefed up cache bandwidth L1 & L2

power:
- Active idle state

No news on clockspeeds yet.
 
Good Lord!

Haswell is indeed taking full advantage of the 22nm tech.
Indeed after AMD presentation of the Streamrollers cores, some here hinted at mild improvements vs Ivy Bridge, I may not get everything but the thing looks like a monster.
They already had a huge advantage in cache bandwidth they are doubling that advantage.
They are doubling the FP throughput vs Ivy bridge 32/16 Sp/DP FLOPs per cycle per core.
They are improving single thread performances, bigger reordering buffer, improved branch prediction, better power management feature should allow the turbo to kick in more often.
They are widening the architecture, it can now execute 8 micro ops on the fly from 6 in previous architectures.

They improved significantly pretty much everything. It looks to me like a freaking monster. If Intel doesn't cut feature and artificially loweer the performances of their "low end" part the next core i3 might be all you need as gamer and more.

They made significant improvement to the GPU, video decode hardware, the video encode hardware.

All of this while lowering power consumption, it's nothing short of amazing.

Then there is compute, their GPU were doing good already they added extra muscles both for the GPU and the CPU and now the gpu in compute operation has 0.5TB/s of bandwidth the last level of cache the thing could literally fly.
 
Last edited by a moderator:
gpu in compute operation has 0.5TB/s of bandwidth the last level of cache the thing could literally fly.
Yeah, that's really good.

Total LLC in current high end i7 models is 8 MB (Haswell should be similar in this regard). That's quite a bit more shared cache for the GPU compared to GCN for example (768KB shared L2 in 7970, 512KB in models with 256 bit memory bus). This is excellent news for some GPU compute algorithms. I wonder if they can also use the 8 MB cache as a render target (like we Xbox 360 programmers use the EDRAM). 8 MB is not enough for a whole 1080p RT, but it would be really good for shadow map rendering for example (depth buffer only, and lots of overdraw). 512x512 shadow map is 1 MB, 1024x1024 is 4 MB (both would fit nicely to LLC).

Transactional memory seems to be really interesting as well :)
 
Are Intel really telling us all this stuff eight months before launch? That seems a bit bonkers, or is there the posibility that they might surprise people with an early launch?

It's not like Intel are up against anything substantial from AMD.
 
Intel's improved fetch bandwidth and widened the back end to handle two branches per cycle.
It doesn't mention predicting two branches per cycle, though.

It's kind of interesting to see a core that could for some reason generate 3 store addresses a cycle. Perhaps the extra port is to keep the store address out of the way of the load calculations as much as possible.

The L1 is much better in terms of bandwidth, and it sounds like they truly dual-ported it given the claimed elimination of bank conflicts. This was a noted constraint with SB and its competition.
Transactional memory is handled by the L1, which sidesteps my concern about earlier speculation of enlisting the larger and more distant L2.
No mention of how gather is handled, though.

There's a lot of stuff done with an eye towards comprehensive power management and SOC-level control, including further empowering the system agent to trick the OS as to the true status of the cores and their thread activity.
 
Are Intel really telling us all this stuff eight months before launch? That seems a bit bonkers, or is there the posibility that they might surprise people with an early launch?

It's not like Intel are up against anything substantial from AMD.

Par for the course for Intel to release new uarch information at IDF just like they always have. When else would they?
 
No mention of how gather is handled, though.
Hopefully they spill the beans soon. I just hope that you don't have to code a loop for it (like you must for Knights Corner). The worst case is that it's just a long microcoded sequence, but that wouldn't make much sense. I am keeping my pessimistic view until Intel proves me otherwise. Efficient gather is almost too good to be true :)

... Intel has added two extra ports, but none of them does load related things. And "no changes to key pipelines" either. No mention about other load related improvements either. So my conclusion is that gather likely takes several cycles to complete (even without cache misses).
 
Looks like the Intel's Oregon team is responsible for building around the basic pipeline flow foundation that its Haifa team lays out. This has been true for several consecutive pairs of architectural "tocks" since they started following that execution plan: Conroe->Nehalem, and SandyBridge->Haswell. I'm expecting Skylake to be the next major retooling of the pipeline by the Israel team.
 
It's kind of interesting to see a core that could for some reason generate 3 store addresses a cycle. Perhaps the extra port is to keep the store address out of the way of the load calculations as much as possible.
Maybe a copy&paste error from SNB/IVB? Since there's now a dedicated store AGU, and only one store data port, it seems like there would be no reason at all to use a shared load/store AGU for calculating the store address. In some way that would be more like Nehalem/Westmere, which also had separate load and store AGUs (but of course just one of each).
But OMG this thing is a beast. AMD thought there's not much point of having a 3rd INT ALU and intel now has 4...
I wonder about the front-end though, no mention of any improvements there. Is that really good enough to feed that monster back-end?
 
ARCS001 on page 12 states that the extra AGU for store alleviates pressure on ports 2 & 3 for loads. I guess that's something they identified from their simulations as a bottleneck, maybe for hyperthreading? They've also directly addressed many bottlenecks that Agner Fog identified on page 174 of his guide.
 
Hmm interesting that both port 0 and port 1 can do FMA and port 1 now can do fp mul too but port 0 can't do fp add. Any ideas why that would be?
 
Maybe in common legacy workloads, FP Adds coincide w/ branches, shifts, and divides.

Edit: I guess more importantly FP Adds coincide w/ FP Mul. The whole point of FMAC is to increase efficiency on that particular combination of instructions, and you still need to take into account legacy code.
 
Last edited by a moderator:
Well, how would it know this RAM is closer...?

Presumably, the BIOS would give hints that good operating systems would record and make use of.

Besides, from what I understand no desktop OS can optimize RAM during runtime; once something's loaded somewhere it pretty much stays there.

This is incorrect. All software only use virtual addresses, the OS is free to relocate the physical pages anywhere it wants. The oldest example is of course swapping to disk, but modern Linux can do things like migrating memory to a closer NUMA node, or migrating pages to merge many small pages into few 2MB ones. AFAIK, Windows does nothing of the sort.
 
Well, how would it know this RAM is closer...? Besides, from what I understand no desktop OS can optimize RAM during runtime; once something's loaded somewhere it pretty much stays there.

Just the way driver knows GPU memory is closer to GPU. It driver will allocate rendertargets in that memory and when it's full, kick them back to CPU RAM.
 
Hopefully they spill the beans soon. I just hope that you don't have to code a loop for it (like you must for Knights Corner). The worst case is that it's just a long microcoded sequence, but that wouldn't make much sense. I am keeping my pessimistic view until Intel proves me otherwise. Efficient gather is almost too good to be true :)

... Intel has added two extra ports, but none of them does load related things. And "no changes to key pipelines" either. No mention about other load related improvements either. So my conclusion is that gather likely takes several cycles to complete (even without cache misses).

Port 6 and 7 provide integer capability and branching, while also keeping the vector pipes unencumbered.
An internal gather loop could utilize the extra integer operand access and branch capability of the extra ports without simultaneously blocking the vector pipes that would make use of a gather instruction. The store AGU all by itself seems unbalanced, unless it's sharing that port with something they've chosen not to discuss yet, maybe the specialized hardware that would scan a gather index register and detect how many belong to the same cache line.
Port 5 has vector shuffles, which might include that permute unit that both gather and vector work would like. It's not zero-sum because a gather would provide data from memory in the desired arrangement.

The data given so far makes Haswell sound much more interesting than Steamroller, although there's always the chance that more is to come on the latter's account. The breadth of the engineering effort for this architecture visually dwarfs the competition. The promotion of integer vector instructions to 256-bit is going to put some serious hurt on one of the few areas BD was not outclassed in.
 
Hopefully they spill the beans soon. I just hope that you don't have to code a loop for it (like you must for Knights Corner). The worst case is that it's just a long microcoded sequence, but that wouldn't make much sense. I am keeping my pessimistic view until Intel proves me otherwise. Efficient gather is almost too good to be true :)

... Intel has added two extra ports, but none of them does load related things. And "no changes to key pipelines" either. No mention about other load related improvements either. So my conclusion is that gather likely takes several cycles to complete (even without cache misses).
A microcoded sequence could still be faster.
 
Back
Top