Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 10-Jul-2012, 17:36   #26
CRoland
Member
 
Join Date: Jan 2010
Posts: 114
Default

Quote:
Originally Posted by Nick View Post
Sure it does. Why would processing code with many multiplications and additions almost twice as fast not require twice the bandwidth?
Because when calculating a (long) dot product, the data you add is already in a register, as it is the result of a previous FMA. I.e. you only need ~2 loads per instruction, just like for an (independent) add or mul instruction.

Quote:
Originally Posted by Nick View Post
Note again that Sandy/Ivy Bridge are already running into severe bandwidth bottlenecks, before FMA or 256-bit integer operations. And even twice the bandwidth isn't excessive. Haswell should have three 256-bit vector ALUs with three input operands. That's a peak of nine input and three output operands per cycle, so two 256-bit memory read ports and one 256-bit write port is a very safe assumption.
I'm not disputing *Bridge bottlenecks, since I haven't profiled those. I'm just saying the 2x throughput from FMA isn't like normal 2x throughput from additional or wider units, in that it doesn't shift the bottleneck to data paths to the same degree.
CRoland is offline   Reply With Quote
Old 10-Jul-2012, 20:27   #27
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by CRoland View Post
Because when calculating a (long) dot product, the data you add is already in a register, as it is the result of a previous FMA. I.e. you only need ~2 loads per instruction, just like for an (independent) add or mul instruction.
You can only compare equivalent code; each dependent mul and add instruction gets replaced by a single FMA instruction. The memory bandwidth will have to double to sustain the peak throughput.
Quote:
I'm not disputing *Bridge bottlenecks, since I haven't profiled those. I'm just saying the 2x throughput from FMA isn't like normal 2x throughput from additional or wider units, in that it doesn't shift the bottleneck to data paths to the same degree.
I agree that, in practice, FMA doesn't typically double the throughput since only a portion of arithmetic code consists of pairs of dependent mul and add instructions. But still, for peak performance a doubling of the bandwidth is a necessity. And as explained, increasing the bandwidth was already overdue so that doubling it won't be excessive in practice.

I was arguing that Haswell is extremely likely to get twice the load/store width, for a combination of reasons, not that FMA would be the only reason.
Nick is offline   Reply With Quote
Old 10-Jul-2012, 21:53   #28
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Quote:
Originally Posted by Nick View Post
I was arguing that Haswell is extremely likely to get twice the load/store width, for a combination of reasons, not that FMA would be the only reason.
I guess that makes sense, even if I'm not quite sure how severe the bandwidth bottleneck really is with Sandy Bridge already. Though if the chip were to sustain 2 256bit loads and 1 256bit store per clock it would also need a third agu (if it still resembles core2 that might mean another execution port). I guess another possibility to get more load bandwidth would be to keep the two agus but beef the transfers up to 256bit so either 2 256bit loads or 1 256bit load and 1 256bit store per cycle.
I guess you suggest more banks (or larger banks) and hence larger cache lines (and probably larger cache size) because sticking to 8 eight-byte banks would probably mean it can't really reach the max throughput due to likely bank conflicts?
mczak is offline   Reply With Quote
Old 11-Jul-2012, 03:47   #29
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by mczak View Post
I guess that makes sense, even if I'm not quite sure how severe the bandwidth bottleneck really is with Sandy Bridge already. Though if the chip were to sustain 2 256bit loads and 1 256bit store per clock it would also need a third agu (if it still resembles core2 that might mean another execution port). I guess another possibility to get more load bandwidth would be to keep the two agus but beef the transfers up to 256bit so either 2 256bit loads or 1 256bit load and 1 256bit store per cycle.
Sandy Bridge already has two 128-bit read ports, one 128-bit write port, and two AGUs which can be used either for load or store. So yes the only thing required is to double the port widths.
Quote:
I guess you suggest more banks (or larger banks) and hence larger cache lines (and probably larger cache size) because sticking to 8 eight-byte banks would probably mean it can't really reach the max throughput due to likely bank conflicts?
I wasn't suggesting anything to be honest. I just considered 128-byte cache lines a possibility, that's all. But indeed, avoiding bank conflicts seems like a good reason not to stick with 64-byte cache lines.
Nick is offline   Reply With Quote
Old 11-Jul-2012, 11:27   #30
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Quote:
Originally Posted by Nick View Post
Sandy Bridge already has two 128-bit read ports, one 128-bit write port, and two AGUs which can be used either for load or store. So yes the only thing required is to double the port widths.
Ok but this will not give you 2 256bit loads and 1 256bit store per cycle, only 2 256bit load/store operations in total (which is still an improvement, albeit not a very big one), since the "cheap trick" of Sandy (only one agu needed for a 256bit load which requires 2x128bit transfer) isn't going to work.

Quote:
I wasn't suggesting anything to be honest. I just considered 128-byte cache lines a possibility, that's all. But indeed, avoiding bank conflicts seems like a good reason not to stick with 64-byte cache lines.
I have to agree it makes sense. But who knows what crazy ideas intel has . And that Haswell screenshot looked ok to me, though it doesn't mean much...
mczak is offline   Reply With Quote
Old 11-Jul-2012, 13:20   #31
CRoland
Member
 
Join Date: Jan 2010
Posts: 114
Default

Quote:
Originally Posted by Nick View Post
You can only compare equivalent code; each dependent mul and add instruction gets replaced by a single FMA instruction. The memory bandwidth will have to double to sustain the peak throughput.
I think I was still unclear. What I meant is: FMA for dependent ops requires no more memory bandwidth than independent add/mul, which must already be supported. If *Bridge has enough bandwidth for that case, FMA doesn't change things. If it doesn't, it probably should have more bandwidth regardless of FMA.

Now, could someone point me to a good source on *Bridge cache bottlenecks? I'm considering a new CPU and use it for HPC programming, so it's important.
CRoland is offline   Reply With Quote
Old 11-Jul-2012, 14:09   #32
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

Quote:
Originally Posted by CRoland View Post
Now, could someone point me to a good source on *Bridge cache bottlenecks? I'm considering a new CPU and use it for HPC programming, so it's important.
I can hardly think of some significant bottlenecks in SNB's memory subsystem. It's virtually the best you can buy now. All the load/store goodies from Nehalem's architecture have been widened and tuned mostly to support AVX (at least in its current half-baked form). Haswell will bring something truly new.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 11-Jul-2012, 15:22   #33
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by mczak View Post
Ok but this will not give you 2 256bit loads and 1 256bit store per cycle, only 2 256bit load/store operations in total (which is still an improvement, albeit not a very big one), since the "cheap trick" of Sandy (only one agu needed for a 256bit load which requires 2x128bit transfer) isn't going to work.
Good point. It wouldn't double the peak bandwidth if they doubled the port widths but stuck with two AGUs. It might suffice though, if bank conflicts limit the number of dual loads anyhow. Also I do think two 256-bit loads or one 256-bit load + one 256-bit store is still way better than one 256-bit load + one 128-bit store. Note that the latter case requires a fixed best case load/store ratio to achieve peak bandwidth, while the former is flexible and thus adapts much better to immediate needs.

On the other hand three AGUs could also help scalar IPC, which is particularly interesting for the low power parts. It would also allow reducing the queue sizes, which helps recover the cost of a third AGU. So it will be interesting to see what Intel opts for.
Nick is offline   Reply With Quote
Old 11-Jul-2012, 15:35   #34
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by CRoland View Post
I think I was still unclear. What I meant is: FMA for dependent ops requires no more memory bandwidth than independent add/mul, which must already be supported.
I'm not sure if you realize Haswell will have two FMA units. Indeed a dependent MUL/ADD doesn't use more bandwidth than a single FMA, but Haswell will be capable of two FMA per cycle. So for peak throughput it needs twice the bandwidth.
Quote:
Now, could someone point me to a good source on *Bridge cache bottlenecks? I'm considering a new CPU and use it for HPC programming, so it's important.
What's your current CPU? You may just want to wait for Haswell.
Nick is offline   Reply With Quote
Old 11-Jul-2012, 20:30   #35
sebbbi
Member
 
Join Date: Nov 2007
Posts: 943
Default

Quote:
Originally Posted by CRoland View Post
Now, could someone point me to a good source on *Bridge cache bottlenecks? I'm considering a new CPU and use it for HPC programming, so it's important.
www.agner.org/optimize/microarchitecture.pdf (page 102 has some analysis, and you should check Nehalem section for more analysis, as Sandy is similar in many ways)
sebbbi is online now   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 08:04.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.