Intel Haswell-E Enthusiast Halo Platform

I just Googled some info about the #1 computer of 11/1996's Top 500, CP-PACS/2048, to see how it compared to current desktop computers.

CP-PACS/2048 is a custom designed extension based on Hitachi's SR2201 system. It has 2048 CPU cores, running at 150MHz each, for a peak performance of 614.4GFLOPS. Performance when running LINPACK is 368.2 GFLOPS. According to a test (here), Haswell is able to do 177.1GFLOPS with 4 cores running @ 3.4GHz. Haswell-E should have no problem exceeding 368.2 GFLOPS running @ 4GHz.

This supercomputer has 128GB RAM in total, much more than current desktop computers. However, its storage system is much more moderate, with only 529GB in a RAID-5. Most desktop computers have larger HDD today (though probably not RAID-5).

This computer remained on Top500 till 06/2003, when it's at number 302, and that's only 10 years ago.
 
I just Googled some info about the #1 computer of 11/1996's Top 500, CP-PACS/2048, to see how it compared to current desktop computers.

CP-PACS/2048 is a custom designed extension based on Hitachi's SR2201 system. It has 2048 CPU cores, running at 150MHz each, for a peak performance of 614.4GFLOPS. Performance when running LINPACK is 368.2 GFLOPS. According to a test (here), Haswell is able to do 177.1GFLOPS with 4 cores running @ 3.4GHz. Haswell-E should have no problem exceeding 368.2 GFLOPS running @ 4GHz.

This supercomputer has 128GB RAM in total, much more than current desktop computers. However, its storage system is much more moderate, with only 529GB in a RAID-5. Most desktop computers have larger HDD today (though probably not RAID-5).

This computer remained on Top500 till 06/2003, when it's at number 302, and that's only 10 years ago.

How much memory does Haswell-E support? A quick look at a price comparison engine says I can buy 8×8 = 64GB of DDR3 for less than €400. Less than 128GB, but only by a factor of 2.

I suspect this supercomputer probably had higher aggregate bandwidth, though.
 
How much memory does Haswell-E support? A quick look at a price comparison engine says I can buy 8×8 = 64GB of DDR3 for less than €400. Less than 128GB, but only by a factor of 2.

I suspect this supercomputer probably had higher aggregate bandwidth, though.

I think DDR-4 is one DIMM per channel, and right now the largest sample we have is 16GB per DIMM (expected in 2014), for a single CPU system like Haswell-E, the largest memory capacity is likely to be "only" 64GB at first. :)

The aggregated bandwidth for the supercomputer is something like 600GB/s bi-directional (each node has 300MB/s bi-directional). However, this is not comparable to a single CPU system like Haswell-E, as Haswell-E, having only 8 cores, will require much less inter-core traffic.
 
I think DDR-4 is one DIMM per channel, and right now the largest sample we have is 16GB per DIMM (expected in 2014), for a single CPU system like Haswell-E, the largest memory capacity is likely to be "only" 64GB at first. :)

The aggregated bandwidth for the supercomputer is something like 600GB/s bi-directional (each node has 300MB/s bi-directional). However, this is not comparable to a single CPU system like Haswell-E, as Haswell-E, having only 8 cores, will require much less inter-core traffic.

And much of said traffic will go through the very fast ring bus in the L3 anyway.

It's also amusing to note that the relatively low-end Radeon HD 7770 pulls 1.28TFLOPS (specs).

I wonder how single-threaded performance would compare. I'm sure GCN's IPC would be much lower, but then again it runs at over 6 times the CP-PACS's clock speed.
 
It has same number of PCIe lanes, so no actual change on that front...

"Up to" is weird, but even if it is 40 lanes on all CPUs some of the lanes may be wired to SATA 6Gb controllers, USB3, even Thunderbolt if they find a way to include it.
 
Peripheral controllers should be hooked up to southbridge lanes; besides, how damn many SATA connectors do people really need...? There's 8 ports standard on high-end mobos, even my uATX board has six connectors. Hard to see legitimate need for much more than that.
 
I wonder how single-threaded performance would compare. I'm sure GCN's IPC would be much lower, but then again it runs at over 6 times the CP-PACS's clock speed.

This is an interesting question. However, I'd guess that for "branchy" codes GPU will lose badly even if its clock speed is much higher.
 
Peripheral controllers should be hooked up to southbridge lanes; besides, how damn many SATA connectors do people really need...? There's 8 ports standard on high-end mobos, even my uATX board has six connectors. Hard to see legitimate need for much more than that.

Link between the CPU and southbridge is slow, it's 2 GB/s at least on socket 1155, 2011 and 1150. You can almost realistically saturate this with storage, networking and peripherals now (SSD arrays or a couple 500GB/s ones, USB 3 vid cards etc.)

Now there's Intel willing to dedicate one or two PCIe lane just for one SSD, that's the SATA express standard. 10 Gb ethernet will eventually find its way too.
 
I wonder how single-threaded performance would compare. I'm sure GCN's IPC would be much lower, but then again it runs at over 6 times the CP-PACS's clock speed.

GCN would totally lose in single-threaded. It's not just IPC, the way the SIMD system works is taking a SIMD bundle of 64 and running it pipelined over 4 cycles on 16 units. This means that in a 1GHz GCN chip, a single lane acts as if it ran at 250MHz. And on top of that, that 250MHz "CPU" runs 10 separate threads, and needs to switch often to hide memory latency. In practice, the single-threaded speed of a modern GCN chip is somewhere near a 50-100MHz P5.

The speed of comes from running on 12800 work items at a time (Tahiti).
 
GCN would totally lose in single-threaded. It's not just IPC, the way the SIMD system works is taking a SIMD bundle of 64 and running it pipelined over 4 cycles on 16 units. This means that in a 1GHz GCN chip, a single lane acts as if it ran at 250MHz. And on top of that, that 250MHz "CPU" runs 10 separate threads, and needs to switch often to hide memory latency. In practice, the single-threaded speed of a modern GCN chip is somewhere near a 50-100MHz P5.

The speed of comes from running on 12800 work items at a time (Tahiti).
P5 would definitely beat a single GCN thread (at same clocks). But GCN can execute some pretty complex instructions, like multiply-add in four cycles. If I remember correctly P5 IMUL alone was 10+ cycles. Back in those days only the simplest instructions (adds, shifts, etc) were single cycle. If you are only running a single thread on a GCN (others masked out), you could also use the scalar unit to boost the execution rate to one instruction per 2 cycles. One FP32 multiply-add per two cycles is not that bad if you compare it against the oldies. GCN has proper L1 and L2 caches as well, so it doesn't require as much latency hiding as the last generation of GPUs. P5 was an in order architecture too, so it couldn't hide the memory latency very well either.
 
Back
Top