Intel showed off the Xeon Phi chip pushing 1 teraflops at double-precision math running the DGEMM matrix math benchmark. That's only one card running one benchmark, and the real test is how a server cluster equipped with hundreds or thousands of Xeon Phi coprocessors will do running Linpack and other benchmarks, and then real workloads. Intel showed off a single Knights Corner coprocessor running Linpack at 1 teraflops peak (not sustained) performance at ISC on Monday.
To counter the MIC skeptics, Intel slapped together a cluster called "Discovery", which came in at 150 on the latest Top 500 rankings. This machine uses eight-core Xeon E5-2670 processors running at 2.6GHz in its two-socket server nodes. The nodes are lashed together with 56Gbs FDR InfiniBand cards and switches, and have Knights Corner coprocessors dropped into the servers as well. The exact feeds and speeds of the Discovery cluster were not divulged, but the machine has a total of 9,800 processor cores and a source at Intel tells El Reg it is "significantly lower than 100 nodes."
If you play around with some numbers (and El Reg can't resist) and assume you have two Knights Corner coprocessors per server node with 54 cores activated and two Xeon E5-2670s, you can get 9,796 cores across 79 server nodes. That would be 158 teraflops of raw peak Linpack performance from the aggregate MIC cards, and another 26.3 teraflops peak from the 1,264 Xeon cores.
This jibes almost perfectly with Intel's own peak performance with the Discovery cluster, which came in at 180.99 teraflops peak and 118.6 teraflops sustained on the Linpack test.
The important thing is that with what we presume were 158 MIC cards, Intel was able to get a computational efficiency of 65.5 per cent (meaning that share of cycles that could do work across the ceepie-phibie did work) and only burned 100.8 kilowatts.