Haswell should bring an L1 that can support AVX without the bandwith constraint experienced with SB and IB. That's on top of promoting integer SIMD to 256 bits, adding gather, and adding FMA.
Steamroller promises to make its FPU a little smaller.
For concurrent workloads, the lock elision and transactional memory support provide an opportunity for Haswell to scale its performance in heavily multithreaded integer.
AMD thus far has promised that it is not going to improve its cache or memory architecture much.
The transactional memory instructions are a big change, but these instructions could probably be added to cores without affecting the portions of the pipeline aimed at single threaded performance improvements. They could affect how Hyperthreading is implemented, and I eagerly await more details.
For Haswell, the new FMAC and the associated framework to keep it fed could be a huge boost. AMD does seem to be rebalancing its CPU Cores toward relatively more integer to floating-point performance, but it's not fabbing a massive GPU onto the same die for no reason. I think they want people to make use of the GPU when really heavy streams of floating-point calculations arise; the existing FPUs are sufficient to address legacy code and any sporadic floating point math that might come up.
The increase in L1 Icache size is a change whose magnitude is not yet given. I'm not sure how many workloads were complaining about the 64KB on Bulldozer, however.
Double the decoders sounds like an increase in the width of the instruction prefetch or the number of them is in order.
The problem with expanding the L1 size is that the aliasing problem would worsen.
Idle thought:
They could increase the associativity and cache block size to chip away at the index bits.
Perhaps split the L1 internally with a pseudo-associative cache? That would be two 32-64KB halves/banks of 4-ways and 128 byte blocks. The associativity and block length would take down 2 bits of aliasing, then a rule that syonyms on the last bit are split between the halves.
It sounds complicated, but I'm not certain a 128KB 64-byte line 2-way cache is going to do much for this architecture.
The decoupled prediction and fetch logic would help buffer the hiccups inherent to having larger lines and the variability of having a pseudo-associative cache.
The plus side to having two decoders, if they are fed, is that Steamroller might be able to brute force through one known soft spot for Sandy Bridge, where its fetch bandwidth becomes constraining if the uop cache misses. The core that's consuming this instruction bandwidth really isn't that big a bruiser, though.
The loop buffers and expanded predictor for Steamroller would take on significance because it sounds like no matter what the front end of the pipeline is going to be longer, and anything that keeps the rest of the core from feeling it is an improvement. Sandy Bridge probably has a decently long pipeline, part of which is mitigated by the uop cache. The uop cache was hard to engineer, so that probably means AMD can't manage it yet. Loop buffers in the tradition of Nehalem would be consistent with AMD's multiyear lag behind the leader.
It does sound like AMD is steering away slightly from its cluster based approach mainly to give each core more single threaded performance. Giving a decoded-uops cache to each core could take a significant chunk of silicon per core to implement and sounds like it would naturally lead to a splitting the decoder in two to better service each core's uop-cache individually.
Separating the usual ICache into 2 higher-associativity, but smaller pieces seems like it might entail more complication in the Ifetcher or even a split, which might be going far enough to defeat the point of the cluster base approach. It also probably does less for single threaded performance and power savings than a uops cache since the latter's contents are much "closer" pipeline-wise to the execution units than the Icache. From Agner Fog's tests, having only 2-way associativity in its ICache hurts the BD, especially since it's servicing two threads; it sounds like addressing the poor associativity would be a better first step than splitting it up. Whatever AMD did, reducing L1 misses by 30% is a lot...
Last edited by a moderator: