cpu cache?

Haswell is strongly assumed to support two 256-bit loads per cycle. They could use eight 16-byte cache banks, or sixteen 8-byte banks, or stick with eight 8-byte banks. Note that the first two options likely require doubling the cache line length. So I wouldn't be surprised if Haswell did have 128-byte cache lines.

We'd have to evaluate the advantages/disadvantages of each option to see which one is most likely...
Well, I hadn't really been paying attention to the rest of the thread. What I meant was that Nehalem and Sandy Bridge seem to have nailed to cache configuration: 32 KB L1 data and integer cache, 256 KB L2, 2MB/core L3. This seems to be the optimal setup. If they do change it though, I have no doubt it will be for the best.
 
I disagree with your notion that Haswell is "strongly assumed" to support two 256-bit loads per cycle. Nothing really hints that it will have improved load/store (aside from what's really necessary for gather) - could be but I wouldn't be surprised if not neither.
Haswell doubles both the integer and floating-point SIMD throughput, and further increases the processing rate with gather. Sandy/Ivy Bridge only supports one 256-bit read and 128-bit write per cycle. So it would be nothing short of insane to provide that much more processing power and still leave it starved for data.
 
Haswell doubles both the integer and floating-point SIMD throughput, and further increases the processing rate with gather. Sandy/Ivy Bridge only supports one 256-bit read and 128-bit write per cycle. So it would be nothing short of insane to provide that much more processing power and still leave it starved for data.

Doubled throughput is due to FMA3, right? That doesn't necessarily require more data bandwidth for the most common cases where it helps (like matrix multiplies or dot products).
 
It would still be an improvement without using FMA. The gains from using AVX-256 on Sandy Bridge were reduced in kernels that bottlenecked on the load/store units.
There were benchmarks showing improvment of 30% or so, when it really could have been higher if the memory ports could have supported it.

Wider accesses would hit more banks and increase the chance of a conflict if facing unaligned or irregular access.

Doubling line length would have some effect down the pipeline on things that work on line granularity, like the prefetchers, coherency, the stages and latencies for cache fills, or Intel's TSX functionality, which research and profiling would determine where the costs and benefits cross over.
 
Doubled throughput is due to FMA3, right? That doesn't necessarily require more data bandwidth for the most common cases where it helps (like matrix multiplies or dot products).
Sure it does. Why would processing code with many multiplications and additions almost twice as fast not require twice the bandwidth? Note again that Sandy/Ivy Bridge are already running into severe bandwidth bottlenecks, before FMA or 256-bit integer operations. And even twice the bandwidth isn't excessive. Haswell should have three 256-bit vector ALUs with three input operands. That's a peak of nine input and three output operands per cycle, so two 256-bit memory read ports and one 256-bit write port is a very safe assumption.
 
Sure it does. Why would processing code with many multiplications and additions almost twice as fast not require twice the bandwidth?

Because when calculating a (long) dot product, the data you add is already in a register, as it is the result of a previous FMA. I.e. you only need ~2 loads per instruction, just like for an (independent) add or mul instruction.

Note again that Sandy/Ivy Bridge are already running into severe bandwidth bottlenecks, before FMA or 256-bit integer operations. And even twice the bandwidth isn't excessive. Haswell should have three 256-bit vector ALUs with three input operands. That's a peak of nine input and three output operands per cycle, so two 256-bit memory read ports and one 256-bit write port is a very safe assumption.

I'm not disputing *Bridge bottlenecks, since I haven't profiled those. I'm just saying the 2x throughput from FMA isn't like normal 2x throughput from additional or wider units, in that it doesn't shift the bottleneck to data paths to the same degree.
 
Because when calculating a (long) dot product, the data you add is already in a register, as it is the result of a previous FMA. I.e. you only need ~2 loads per instruction, just like for an (independent) add or mul instruction.
You can only compare equivalent code; each dependent mul and add instruction gets replaced by a single FMA instruction. The memory bandwidth will have to double to sustain the peak throughput.
I'm not disputing *Bridge bottlenecks, since I haven't profiled those. I'm just saying the 2x throughput from FMA isn't like normal 2x throughput from additional or wider units, in that it doesn't shift the bottleneck to data paths to the same degree.
I agree that, in practice, FMA doesn't typically double the throughput since only a portion of arithmetic code consists of pairs of dependent mul and add instructions. But still, for peak performance a doubling of the bandwidth is a necessity. And as explained, increasing the bandwidth was already overdue so that doubling it won't be excessive in practice.

I was arguing that Haswell is extremely likely to get twice the load/store width, for a combination of reasons, not that FMA would be the only reason.
 
I was arguing that Haswell is extremely likely to get twice the load/store width, for a combination of reasons, not that FMA would be the only reason.
I guess that makes sense, even if I'm not quite sure how severe the bandwidth bottleneck really is with Sandy Bridge already. Though if the chip were to sustain 2 256bit loads and 1 256bit store per clock it would also need a third agu (if it still resembles core2 that might mean another execution port). I guess another possibility to get more load bandwidth would be to keep the two agus but beef the transfers up to 256bit so either 2 256bit loads or 1 256bit load and 1 256bit store per cycle.
I guess you suggest more banks (or larger banks) and hence larger cache lines (and probably larger cache size) because sticking to 8 eight-byte banks would probably mean it can't really reach the max throughput due to likely bank conflicts?
 
I guess that makes sense, even if I'm not quite sure how severe the bandwidth bottleneck really is with Sandy Bridge already. Though if the chip were to sustain 2 256bit loads and 1 256bit store per clock it would also need a third agu (if it still resembles core2 that might mean another execution port). I guess another possibility to get more load bandwidth would be to keep the two agus but beef the transfers up to 256bit so either 2 256bit loads or 1 256bit load and 1 256bit store per cycle.
Sandy Bridge already has two 128-bit read ports, one 128-bit write port, and two AGUs which can be used either for load or store. So yes the only thing required is to double the port widths.
I guess you suggest more banks (or larger banks) and hence larger cache lines (and probably larger cache size) because sticking to 8 eight-byte banks would probably mean it can't really reach the max throughput due to likely bank conflicts?
I wasn't suggesting anything to be honest. I just considered 128-byte cache lines a possibility, that's all. But indeed, avoiding bank conflicts seems like a good reason not to stick with 64-byte cache lines.
 
Sandy Bridge already has two 128-bit read ports, one 128-bit write port, and two AGUs which can be used either for load or store. So yes the only thing required is to double the port widths.
Ok but this will not give you 2 256bit loads and 1 256bit store per cycle, only 2 256bit load/store operations in total (which is still an improvement, albeit not a very big one), since the "cheap trick" of Sandy (only one agu needed for a 256bit load which requires 2x128bit transfer) isn't going to work.

I wasn't suggesting anything to be honest. I just considered 128-byte cache lines a possibility, that's all. But indeed, avoiding bank conflicts seems like a good reason not to stick with 64-byte cache lines.
I have to agree it makes sense. But who knows what crazy ideas intel has :). And that Haswell screenshot looked ok to me, though it doesn't mean much...
 
You can only compare equivalent code; each dependent mul and add instruction gets replaced by a single FMA instruction. The memory bandwidth will have to double to sustain the peak throughput.

I think I was still unclear. What I meant is: FMA for dependent ops requires no more memory bandwidth than independent add/mul, which must already be supported. If *Bridge has enough bandwidth for that case, FMA doesn't change things. If it doesn't, it probably should have more bandwidth regardless of FMA.

Now, could someone point me to a good source on *Bridge cache bottlenecks? I'm considering a new CPU and use it for HPC programming, so it's important.
 
Now, could someone point me to a good source on *Bridge cache bottlenecks? I'm considering a new CPU and use it for HPC programming, so it's important.
I can hardly think of some significant bottlenecks in SNB's memory subsystem. It's virtually the best you can buy now. All the load/store goodies from Nehalem's architecture have been widened and tuned mostly to support AVX (at least in its current half-baked form). Haswell will bring something truly new.
 
Ok but this will not give you 2 256bit loads and 1 256bit store per cycle, only 2 256bit load/store operations in total (which is still an improvement, albeit not a very big one), since the "cheap trick" of Sandy (only one agu needed for a 256bit load which requires 2x128bit transfer) isn't going to work.
Good point. It wouldn't double the peak bandwidth if they doubled the port widths but stuck with two AGUs. It might suffice though, if bank conflicts limit the number of dual loads anyhow. Also I do think two 256-bit loads or one 256-bit load + one 256-bit store is still way better than one 256-bit load + one 128-bit store. Note that the latter case requires a fixed best case load/store ratio to achieve peak bandwidth, while the former is flexible and thus adapts much better to immediate needs.

On the other hand three AGUs could also help scalar IPC, which is particularly interesting for the low power parts. It would also allow reducing the queue sizes, which helps recover the cost of a third AGU. So it will be interesting to see what Intel opts for.
 
I think I was still unclear. What I meant is: FMA for dependent ops requires no more memory bandwidth than independent add/mul, which must already be supported.
I'm not sure if you realize Haswell will have two FMA units. Indeed a dependent MUL/ADD doesn't use more bandwidth than a single FMA, but Haswell will be capable of two FMA per cycle. So for peak throughput it needs twice the bandwidth.
Now, could someone point me to a good source on *Bridge cache bottlenecks? I'm considering a new CPU and use it for HPC programming, so it's important.
What's your current CPU? You may just want to wait for Haswell.
 
Back
Top