As I've said: I am aware that system build contributes a good deal of the efficiency. And of course does the top500 calculate all the available computing ressources. If there are a certain amount of nodes only for synchronization, that's exactly what this is all about.For matrix operations there isn't if you compare it with Fermi, as I said Loewe-CSC is claiming 70% efficiency for that HPL benchmark on their website.
Edit:
I may have stumbled over at least one source of the discrepancy. In the top500 list, the theoretical peak is for the full machine, which probably includes also the 40 nodes with quad CPU/48 cores and 24 dual CPU/24 cores nodes, both types without a GPU (reserved for jobs with no use for GPUs). The actual benchmark run and the efficiency numbers I cited were only done on the part with GPUs.
You have to have them in order for your system-wide level of performance you want to achieve, even if they only organize the data stream or provide user-interfaces. You have to pay for them, you have to provide power for them (thus paying again). And if your system is particularly hard to optimize for, that's also a problem you'll have to deal with in it's lifetime over and over again, because this optimization has to be done for each kind of workload, probably for each and every algorithm.
So they developed a method in order to hide DMA transfer time almost completely and ran HPL on a portion of LOEWE-CSC with several hundred nodes with their specifically optimized code achieving 70% of theoretical peak. They don't say though, if this is purely with the GPUs or with the full computing ressources of the nodes. LOEWE-CSCs compute power consists of roughly 39% Opteron-CPUs, only 61% is provided by Cypress GPUs.LOEWE-CSC said:Um die GPU-Leistung voll auszunutzen, wurde eine spezielle Methode implementiert, die Transferzeiten fast vollständig zu verstecken. Ein paralleler Linpack Lauf auf mehreren hundert GPU-Knoten erreicht bei weitem, die oft nur 50% erreichen.
So, if those 70% efficiency are system-wide, the GPU-portion of it must be significantly lower, because the CPUs are providing roughly 39% of the ressources and those are able to work a mid-to-high eighties at least. If it's GPU-only, they still are a far cry away from homogeneous HPCs with over 90% - better than fermi yes, but still I'd consider the lower efficiency a common architectural trait.