Intel Haswell-E Enthusiast Halo Platform

pcchen · Jun 23, 2013

I just Googled some info about the #1 computer of 11/1996's Top 500, CP-PACS/2048, to see how it compared to current desktop computers.

CP-PACS/2048 is a custom designed extension based on Hitachi's SR2201 system. It has 2048 CPU cores, running at 150MHz each, for a peak performance of 614.4GFLOPS. Performance when running LINPACK is 368.2 GFLOPS. According to a test (here), Haswell is able to do 177.1GFLOPS with 4 cores running @ 3.4GHz. Haswell-E should have no problem exceeding 368.2 GFLOPS running @ 4GHz.

This supercomputer has 128GB RAM in total, much more than current desktop computers. However, its storage system is much more moderate, with only 529GB in a RAID-5. Most desktop computers have larger HDD today (though probably not RAID-5).

This computer remained on Top500 till 06/2003, when it's at number 302, and that's only 10 years ago.

Alexko · Jun 23, 2013

pcchen said:
I just Googled some info about the #1 computer of 11/1996's Top 500, CP-PACS/2048, to see how it compared to current desktop computers.

CP-PACS/2048 is a custom designed extension based on Hitachi's SR2201 system. It has 2048 CPU cores, running at 150MHz each, for a peak performance of 614.4GFLOPS. Performance when running LINPACK is 368.2 GFLOPS. According to a test (here), Haswell is able to do 177.1GFLOPS with 4 cores running @ 3.4GHz. Haswell-E should have no problem exceeding 368.2 GFLOPS running @ 4GHz.

This supercomputer has 128GB RAM in total, much more than current desktop computers. However, its storage system is much more moderate, with only 529GB in a RAID-5. Most desktop computers have larger HDD today (though probably not RAID-5).

This computer remained on Top500 till 06/2003, when it's at number 302, and that's only 10 years ago.

How much memory does Haswell-E support? A quick look at a price comparison engine says I can buy 8×8 = 64GB of DDR3 for less than €400. Less than 128GB, but only by a factor of 2.

I suspect this supercomputer probably had higher aggregate bandwidth, though.

pcchen · Jun 23, 2013

Alexko said:
How much memory does Haswell-E support? A quick look at a price comparison engine says I can buy 8×8 = 64GB of DDR3 for less than €400. Less than 128GB, but only by a factor of 2.

I suspect this supercomputer probably had higher aggregate bandwidth, though.

I think DDR-4 is one DIMM per channel, and right now the largest sample we have is 16GB per DIMM (expected in 2014), for a single CPU system like Haswell-E, the largest memory capacity is likely to be "only" 64GB at first.

The aggregated bandwidth for the supercomputer is something like 600GB/s bi-directional (each node has 300MB/s bi-directional). However, this is not comparable to a single CPU system like Haswell-E, as Haswell-E, having only 8 cores, will require much less inter-core traffic.

Alexko · Jun 23, 2013

pcchen said:
I think DDR-4 is one DIMM per channel, and right now the largest sample we have is 16GB per DIMM (expected in 2014), for a single CPU system like Haswell-E, the largest memory capacity is likely to be "only" 64GB at first.

The aggregated bandwidth for the supercomputer is something like 600GB/s bi-directional (each node has 300MB/s bi-directional). However, this is not comparable to a single CPU system like Haswell-E, as Haswell-E, having only 8 cores, will require much less inter-core traffic.

And much of said traffic will go through the very fast ring bus in the L3 anyway.

It's also amusing to note that the relatively low-end Radeon HD 7770 pulls 1.28TFLOPS (specs).

I wonder how single-threaded performance would compare. I'm sure GCN's IPC would be much lower, but then again it runs at over 6 times the CP-PACS's clock speed.

Homeles · Jun 23, 2013

rpg.314 said:
When does Broadwell come to mobile then?

I have no idea. I'm almost wondering if Intel's going to scrap Broadwell. Perhaps we'll get clarification on the subject at Fall IDF 2013.

Blazkowicz · Jun 23, 2013

Grall said:
It has same number of PCIe lanes, so no actual change on that front...

"Up to" is weird, but even if it is 40 lanes on all CPUs some of the lanes may be wired to SATA 6Gb controllers, USB3, even Thunderbolt if they find a way to include it.

Grall · Jun 23, 2013

Peripheral controllers should be hooked up to southbridge lanes; besides, how damn many SATA connectors do people really need...? There's 8 ports standard on high-end mobos, even my uATX board has six connectors. Hard to see legitimate need for much more than that.

pcchen · Jun 23, 2013

Alexko said:
I wonder how single-threaded performance would compare. I'm sure GCN's IPC would be much lower, but then again it runs at over 6 times the CP-PACS's clock speed.

This is an interesting question. However, I'd guess that for "branchy" codes GPU will lose badly even if its clock speed is much higher.

Blazkowicz · Jun 24, 2013

Grall said:
Peripheral controllers should be hooked up to southbridge lanes; besides, how damn many SATA connectors do people really need...? There's 8 ports standard on high-end mobos, even my uATX board has six connectors. Hard to see legitimate need for much more than that.

Link between the CPU and southbridge is slow, it's 2 GB/s at least on socket 1155, 2011 and 1150. You can almost realistically saturate this with storage, networking and peripherals now (SSD arrays or a couple 500GB/s ones, USB 3 vid cards etc.)

Now there's Intel willing to dedicate one or two PCIe lane just for one SSD, that's the SATA express standard. 10 Gb ethernet will eventually find its way too.

tunafish · Jun 25, 2013

Alexko said:
I wonder how single-threaded performance would compare. I'm sure GCN's IPC would be much lower, but then again it runs at over 6 times the CP-PACS's clock speed.

GCN would totally lose in single-threaded. It's not just IPC, the way the SIMD system works is taking a SIMD bundle of 64 and running it pipelined over 4 cycles on 16 units. This means that in a 1GHz GCN chip, a single lane acts as if it ran at 250MHz. And on top of that, that 250MHz "CPU" runs 10 separate threads, and needs to switch often to hide memory latency. In practice, the single-threaded speed of a modern GCN chip is somewhere near a 50-100MHz P5.

The speed of comes from running on 12800 work items at a time (Tahiti).

sebbbi · Jun 25, 2013

tunafish said:
GCN would totally lose in single-threaded. It's not just IPC, the way the SIMD system works is taking a SIMD bundle of 64 and running it pipelined over 4 cycles on 16 units. This means that in a 1GHz GCN chip, a single lane acts as if it ran at 250MHz. And on top of that, that 250MHz "CPU" runs 10 separate threads, and needs to switch often to hide memory latency. In practice, the single-threaded speed of a modern GCN chip is somewhere near a 50-100MHz P5.

The speed of comes from running on 12800 work items at a time (Tahiti).

P5 would definitely beat a single GCN thread (at same clocks). But GCN can execute some pretty complex instructions, like multiply-add in four cycles. If I remember correctly P5 IMUL alone was 10+ cycles. Back in those days only the simplest instructions (adds, shifts, etc) were single cycle. If you are only running a single thread on a GCN (others masked out), you could also use the scalar unit to boost the execution rate to one instruction per 2 cycles. One FP32 multiply-add per two cycles is not that bad if you compare it against the oldies. GCN has proper L1 and L2 caches as well, so it doesn't require as much latency hiding as the last generation of GPUs. P5 was an in order architecture too, so it couldn't hide the memory latency very well either.

Intel Haswell-E Enthusiast Halo Platform

pcchen

Moderator

Alexko

pcchen

Moderator

Alexko

Homeles

Blazkowicz

Grall

Invisible Member

pcchen

Moderator

Blazkowicz

tunafish

sebbbi

Similar threads