PDA

View Full Version : Debunking the 100X GPU vs. CPU myth:...


Jawed
25-Jun-2010, 13:58
an evaluation of throughput computing on CPU and GPU

http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=94608761&CFTOKEN=50783980&ret=1#Fulltext

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.
The PDF is currently available, dunno how long that will last.

Davros
25-Jun-2010, 16:39
the conclusion seems to be your still better off with a gpu
strange thing for intel to put out

arjan de lumens
25-Jun-2010, 16:58
As it looks to me, several of the algorithms described suffer inefficiencies on the GPU side due to shortcomings in the GTX280 architecture used for comparison: lack of coherency, lack of large on-chip caches (of the algorithms in the paper, "Solv" and "Hist" seem to be particularly hard hit by these shortcomings). It would be interesting to see this exercise repeated on more modern DirectX11-class GPU architectures like Fermi or Evergreen, where these shortcomings are absent or at least much less prominent.

Also, as Ars Technica notes (http://arstechnica.com/business/news/2010/06/intel-scores-own-goal-against-core-i7-in-nvidia-spat.ars), even if the overall speedup when going from Core-i7 to GTX280 is only 2.5x instead of 100x, it's still a major win for the GPU side of the story - and even more so when considering the price of Core-i7 versus GTX280.

3dilettante
25-Jun-2010, 17:23
The conventional wisdom is that a non-standard or exotic architecture needs at least a 10x advantage to justify itself.
There are certain tests where there is at least a 10x advantage, something Fermi could improve on measurably. In those areas, the GPU looks more compelling than in areas where a lower multiplier means that Moore's law or the slow ratcheting down of prices as SKUs age could significantly reduce the gains before the needed software customization and tools are completed.

Andrew Lauritzen
25-Jun-2010, 17:49
strange thing for intel to put out
This isn't marketing... this is a peer-reviewed paper. The guys who did it are some of the best and they optimized both the CPU and GPU implementations very well and presented the results. This is how work *should* be done and results presented everywhere...

The question of which level of performance (CPU or GPU) is easier to come by for varying levels of programmer skill is not clear, but these results definitely represent the "expert" end on both platforms.

Jawed
25-Jun-2010, 18:14
The other factor is AVX is incoming and core counts per x86 socket are now rising healthily - at least for a while.

Essentially you need to be really sure that an application/library/component you move to GPU "now" is going to speed-up according to Moore's Law - as in general it will definitely do so on CPU - over the next few years (unless you'll never want more performance than you can get today :razz: ).

One could argue that CPUs are going to get a double-shot of scaling, AVX and 16-cores per socket, and then that will be it (wild-card being bandwidth per socket). After that the "why more x86 cores?" question gets tough to answer. What do consumer applications want to do that will chew through x86 style cores in preference to GPU cores?

Then, well, there's Fusion and the like - eventually there'll be no choice to make.

Jawed

rpg.314
25-Jun-2010, 18:24
a) AVX's lack of predication and scatter/gather gives it, ironically, a much more restricted programming model than modern GPU's. Frankly they should just ditch AVX for LRBni.

b) Last 4 years we have seen cores/socket growth slower than predicted by moore's law. I am not sure why it would change in the next 2 years.

3dilettante
25-Jun-2010, 18:44
16 per socket could happen with Bulldozer, and perhaps a Sandy Bridge EX variant. The Westmere EX was a notable outlier, with only 10 cores compared to Nehalem EX's 8 but that could be because the higher count is waiting on the new core design.

Unfortunately for Bulldozer, the FP resources at least for the first generation scale at 1/2 per core, and the 16-core variant would be an MCM, with those cores individually being somewhat diminished due to other shared resources.
On the other hand, it would get FMAC, which Sandy Bridge's AVX implementation would not.
On the other other hand, it's going to be an AMD extension, which is going to be ignored by most.


One could argue that these could still scale with Moore's law, just that at their respective nodes they are slightly too expensive for full implementations.
So now it's more of a "scale with Moore's law-n iterations" with n being some offset often ranging between 1 and 1.5, possibly 2 corresponding to some initial one-time implementation cost.

The interesting part is that the offset is much more crucial now than it was previously. At least for silicon, we might be running out of room for n...

Jawed
25-Jun-2010, 18:45
a) AVX's lack of predication and scatter/gather gives it, ironically, a much more restricted programming model than modern GPU's. Frankly they should just ditch AVX for LRBni.
Vectors are much bigger on the GPUs - i.e. they're much more dependent on those things to get anywhere. I'm certainly not saying GPUs don't have an advantage though. Merely that when you're divvying-up the space, AVX puts a sizeable dent into GPU advantages in certain places.

But yeah, LRBni for the win. Pity that Intel appears to intend to put it into the "add-in card ghetto" (not enough RAM, PCI Express barrier...) instead of mainstream x86. Oh well.

b) Last 4 years we have seen cores/socket growth slower than predicted by moore's law.
2 cores -> 12 cores.

I am not sure why it would change in the next 2 years.
NVidia's certainly struggling. Obviously ATI isn't - yet, anyway.

Jawed

RecessionCone
25-Jun-2010, 22:19
The paper is very short on detail with regards to a lot of the tests they ran. For example, what kind of SpMV did they benchmark - I'm assuming it was Double Precision, since their performance was so low, but what data structure did they use, on what type of Sparse Matrix?

It's also rather cute to compare a 65nm processor released in June 2008 with a 45nm processor released in October 2009. I'd like to see them rerun their tests with Fermi.

Jawed
25-Jun-2010, 22:31
The paper is very short on detail with regards to a lot of the tests they ran. For example, what kind of SpMV did they benchmark - I'm assuming it was Double Precision, since their performance was so low, but what data structure did they use, on what type of Sparse Matrix?

8. SpMV or sparse matrix vector multiplication is at the heart of many iterative solvers. There are several storage formats of sparse matrices, compressed row storage being the most common. Computation in this format is characterized by regular access patterns over non-zero elements and irregular access patterns over the vector, based on column index. When the matrix is large and does not fit into on-die storage, a well optimized kernel is usually bandwidth bound.

The referred papers contains the best previous reported performance numbers on CPU/GPU platforms. Our optimized performance numbers are at least on par or better than those numbers.

[8] N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC ’09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing, 2009.
[47] F. Vazquez, E. M. Garzon, J.A.Martinez, and J.J.Fernandez. The sparse matrix vector product on GPUs. Technical report, University of Almeria, June 2009.
[50] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC ’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1–12, New York, NY, USA, 2007. ACM.

It's also rather cute to compare a 65nm processor released in June 2008 with a 45nm processor released in October 2009. I'd like to see them rerun their tests with Fermi.
Fermi wasn't frying bacon in public in time for the paper's submission, back in November last year.

RecessionCone
25-Jun-2010, 23:16
The quote you included from the paper implied that they used CSR, but didn't state it explicitly. If they did use CSR, that's a little disappointing, since it's actually not a very good format for GPUs.

Additionally, they didn't state what the characteristics of the matrix they were testing were. SpMV performance varies tremendously based on the data structure and properties of the matrix. If they used a vector CSR format, there's a good chance that the hybrid ELL/COO format which Bell and Garland propose would have performed significantly (~50%) better.

I'm assuming they used a Double Precision, Vector CSR SpMV for this example, but that's just guesswork. Their assertion that their results are better than the literature is useless without knowing exactly what they did, especially since the literature has such a wide performance range on this benchmark. Bell and Garland show performance ranging from 1 GFLOP/s to 40 GFLOP/s - it all depends on the data structure and properties of the matrix. Without that information, we're kind of left adrift.

I didn't say their paper was broken because they didn't have Fermi in November, I just pointed out that Intel is benefiting here from a 16 month technology development advantage. 16 months is a big deal - the difference between the 5870 and the 4870 was 14 months, which translated into about 60% in performance. My single precision DIA SpMV code runs about 50% faster on Fermi than on GTX280 - and that's without any tuning for Fermi's caches.

Taking a step back, I agree with the paper's conclusion that many GPU papers with 100x speedup results are misleading, and if we're trying to decide what architecture to use for a problem, it's important to normalize the comparisons. I'm just not 100% sure I believe Intel's normalization here, and it's hard to convince myself they did the right thing when the benchmarks were so shallowly described.

[1] http://graphics.cs.uiuc.edu/~wnbell/publications/2009-08-SC-SpMV/sc09-spmv-throughput.pdf

Jawed
26-Jun-2010, 01:21
The quote you included from the paper implied that they used CSR, but didn't state it explicitly. If they did use CSR, that's a little disappointing, since it's actually not a very good format for GPUs.
But apparently it's the most popular and general format: unavoidable. If the GPU wants another format then the cost of translation to/from that format needs to be included in overall performance. Does the hybrid format include that cost?

Additionally, they didn't state what the characteristics of the matrix they were testing were. SpMV performance varies tremendously based on the data structure and properties of the matrix. If they used a vector CSR format, there's a good chance that the hybrid ELL/COO format which Bell and Garland propose would have performed significantly (~50%) better.
Perhaps figure 16 is the locus.

I'm assuming they used a Double Precision, Vector CSR SpMV for this example, but that's just guesswork. Their assertion that their results are better than the literature is useless without knowing exactly what they did, especially since the literature has such a wide performance range on this benchmark. Bell and Garland show performance ranging from 1 GFLOP/s to 40 GFLOP/s - it all depends on the data structure and properties of the matrix. Without that information, we're kind of left adrift.
Resulting in 50 or 100 pages for the entire paper, I suppose.

I didn't say their paper was broken because they didn't have Fermi in November, I just pointed out that Intel is benefiting here from a 16 month technology development advantage. 16 months is a big deal - the difference between the 5870 and the 4870 was 14 months, which translated into about 60% in performance. My single precision DIA SpMV code runs about 50% faster on Fermi than on GTX280 - and that's without any tuning for Fermi's caches.
And Magny-Cours arrived around the same time as Fermi.

Taking a step back, I agree with the paper's conclusion that many GPU papers with 100x speedup results are misleading, and if we're trying to decide what architecture to use for a problem, it's important to normalize the comparisons. I'm just not 100% sure I believe Intel's normalization here, and it's hard to convince myself they did the right thing when the benchmarks were so shallowly described.

[1] http://graphics.cs.uiuc.edu/~wnbell/publications/2009-08-SC-SpMV/sc09-spmv-throughput.pdf
Here's another survey:

http://www.prace-project.eu/documents/christadler.pdf

Jawed

RecessionCone
26-Jun-2010, 05:21
But apparently it's the most popular and general format: unavoidable. If the GPU wants another format then the cost of translation to/from that format needs to be included in overall performance. Does the hybrid format include that cost?
SpMV applications don't often have matrix conversion on the critical path, for several reasons.
1. Most often, the application has to spend quite a bit of computation to derive the matrix. These computations arise from the particular problem, for example in the image processing application I've been working on, approximately 55% of the time is spent deriving a matrix from the image (the other 45% is dominated by SpMV). Once I've gone through all this work to compute the elements of the matrix, converting it to a different data structure is fairly trivial.
2. SpMV applications (linear and non-linear solvers, eigensolvers) require repeated SpMV computations, with the same matrix sequentially applied to 1000s or 100000s of vectors. Converting the matrix involves loading it and storing it from memory a few times - since each SpMV computation necessarily involves loading the entire matrix, the conversion cost is just the cost of a few SpMV computations. This is completely negligable in the overall solver.

Additionally, CSR is actually a non-trivial format to construct - if you have computed the data for each non-zero separately, you have to use scans in order to compute the indices for the CSR data structure. Some of the other SpMV data structures, like DIA and ELL, can be constructed just by scattering data to the appropriate place in memory.

Finally, unlike some algorithms, there's no benefit to using CSR because it's "the standard". There's nothing at all unavoidable or required about using CSR. Conversion routines between sparse matrix formats are freely available and easy to write [1]. People who know what they're doing with SpMV are accustomed to using the right data structure for the job at hand, since performance can vary by 10x simply by choosing the wrong one.

Regarding Magny-Cours - I'm not sure I get your point. Technology is constantly evolving. But if you say you're going to make fair comparisons, you can't give your own team a 16-month handicap.

And if they were constrained by paper length, it would have been helpful to publish a tech report with more data, so that other researchers could reproduce their results. In this case, I would have been much happier if they had shown: 1. datatype (SP or DP). 2. data structure used. It would have been even better if they had used the standard matrices from the Williams paper (which Bell also used), so that there was some continuity for comparison.



[1] http://math.nist.gov/MatrixMarket/formats.html

Ninjaprime
26-Jun-2010, 07:52
Regarding Magny-Cours - I'm not sure I get your point. Technology is constantly evolving. But if you say you're going to make fair comparisons, you can't give your own team a 16-month handicap.

I think the point was, fermi was released at the same time as Magny-Cours so if you want to compare a direct timestamp of products, that would be it. I dont think the results would be any different, in fact likely more advanatage would go to the CPU. Fermi slightly more than doubling performace(in DP, less in SP) vs a 4 core --> 12 core, likely yielding more than that.

CarstenS
26-Jun-2010, 10:03
What i don't get from the intel analysis: Why?
I mean, obviously, Nvidia et al. are comparing highly optimized, newly written algorithms to the previously used ones, thus generating the huge speedups. Intel is now debunking this two orders of magnitude stuff, but at the same time, they are saying, that they need to carefully optimize for given problem sizes, allocate caches manually (at least it sounded like that) and so on.

The point is: What they have done is rewriting the algorithms used in much the same way, people would have to do when going to GPU-space. The main problem being: If you need to rewrite your stuff, manually optimizing for parallelism, the main advantage of CPUs (drop-in-replacements) is gone and now you're rewriting everyhting and still are not at GPU-Perf-Levels by quite a margin.

aaronspink
26-Jun-2010, 11:25
What i don't get from the intel analysis: Why?
I mean, obviously, Nvidia et al. are comparing highly optimized, newly written algorithms to the previously used ones, thus generating the huge speedups. Intel is now debunking this two orders of magnitude stuff, but at the same time, they are saying, that they need to carefully optimize for given problem sizes, allocate caches manually (at least it sounded like that) and so on.

The point is: What they have done is rewriting the algorithms used in much the same way, people would have to do when going to GPU-space. The main problem being: If you need to rewrite your stuff, manually optimizing for parallelism, the main advantage of CPUs (drop-in-replacements) is gone and now you're rewriting everyhting and still are not at GPU-Perf-Levels by quite a margin.

I highly suggest you look at the paper that Jawed linked as well as http://www.prace-project.eu/documents/02_wp8prototypes_hh.pdf. Look at the time to solution numbers, delivered performance, and %peak efficiency. Quite striking really.

aaronspink
26-Jun-2010, 11:37
And if they were constrained by paper length, it would have been helpful to publish a tech report with more data, so that other researchers could reproduce their results. In this case, I would have been much happier if they had shown: 1. datatype (SP or DP). 2. data structure used. It would have been even better if they had used the standard matrices from the Williams paper (which Bell also used), so that there was some continuity for comparison.

Like most research papers, they embed by reference to other papers for algorithms and such. AFAIK, they document where the algorithms they use can be found. Its also not unusual for researchers to share information results with other researchers who are doing follow-up or related research.

As far as datatype, I would assume most of the datatypes were DP as that is generally considered the standard in scientific applications.

CarstenS
26-Jun-2010, 13:14
Aaron,

I see what you mean. If I only look at those pages, I might get quite a negative view on using GPUs for HPCs (btw: did they use dual-socket systems for Nehalem-EP or did they count Hyperthreading toward the cores?)

But there's a few thing, I'd consider too. First: GPU-Computing infrastructure is just in its infancy. Second: All the benchmarks used DP - kind of worst case for C1060 (where where the Firestream processors touted at in the beginning of the paper?), let alone Fermi, giving an 8x boost to theoretical DP plus an unknown amount due to architectural improvements.

And what about GPU-HMMER? It's results looked promising and what I found alarming though I don't know what to make of it: why did it basically stop scaling beyond 16 CPU cores?

edit:
Just discovered the detailed results here: http://www.prace-project.eu/documents/d8-3-2.pdf
Now reading…

edit 2:
from the above, 3.1.1 Reference performance, p.20:
"the performance values obtained on a standard dual-socket Intel Nehalem-EP
2.53 GHz processor platform (E5540) was used as reference performance baseline"

Jawed
26-Jun-2010, 15:46
SpMV applications don't often have matrix conversion on the critical path, for several reasons.
1. Most often, the application has to spend quite a bit of computation to derive the matrix. These computations arise from the particular problem, for example in the image processing application I've been working on, approximately 55% of the time is spent deriving a matrix from the image (the other 45% is dominated by SpMV). Once I've gone through all this work to compute the elements of the matrix, converting it to a different data structure is fairly trivial.
2. SpMV applications (linear and non-linear solvers, eigensolvers) require repeated SpMV computations, with the same matrix sequentially applied to 1000s or 100000s of vectors. Converting the matrix involves loading it and storing it from memory a few times - since each SpMV computation necessarily involves loading the entire matrix, the conversion cost is just the cost of a few SpMV computations. This is completely negligable in the overall solver.
I'm a little unclear on what you're saying here. Is the set of matrix-vector multiplies purely serially-dependent, i.e. all but the first SpMV is dependent upon all the prior SpMVs?

If not, then this is clearly a sparse-matrix multiplied by a dense-matrix problem. SpMV would be redundant.

Additionally, CSR is actually a non-trivial format to construct - if you have computed the data for each non-zero separately, you have to use scans in order to compute the indices for the CSR data structure. Some of the other SpMV data structures, like DIA and ELL, can be constructed just by scattering data to the appropriate place in memory.
I guess if the sparse matrix was generated on the GPU then it makes little sense to encode in CSR.

Finally, unlike some algorithms, there's no benefit to using CSR because it's "the standard". There's nothing at all unavoidable or required about using CSR. Conversion routines between sparse matrix formats are freely available and easy to write [1]. People who know what they're doing with SpMV are accustomed to using the right data structure for the job at hand, since performance can vary by 10x simply by choosing the wrong one.
So it being apparently the most popular is despite it being ill-suited, or some other factor?

Regarding Magny-Cours - I'm not sure I get your point. Technology is constantly evolving. But if you say you're going to make fair comparisons, you can't give your own team a 16-month handicap.
When one is building a system to do stuff, one can choose between one or more sockets, one or more boards and then all the combinations + some mixture of GPUs. For example, instead of buying a 2-socket/2-GPU board, one could choose a 4-socket board. The GPUs aren't notably cheap in this kind of setup.

And the programming-model fracture is tough.

One could argue that as x86 goes into dense multi-core and NUMA gets in the way more and more, it suffers a programming-model fracture too.

So then the question is whether you want to tackle that fracture with OpenMP/MPI/TBB/Cilk++/Ct blah blah blah or OpenCL (which encompasses CPU and GPU programming models) or whether you want to do both. I don't see CUDA as having any life expectancy except for those who are building a single application system.

I'm not arguing that GPUs are decisively a waste of time, by the way. If you're building a system to run a single application (or application type) and you have at least one order of magnitude better performance today the GPU is an easy buy I reckon.

By using a single 4-core processor Intel didn't give itself any notable advantage. In my view two sockets versus a single socket plus a GPU is an appropriate baseline - one can't run a GPU without a CPU, so adding a second socket in place of a GPU is entirely justified, as both systems are "dual-processor". (Obviously not for a consumer application.)

And if they were constrained by paper length, it would have been helpful to publish a tech report with more data, so that other researchers could reproduce their results. In this case, I would have been much happier if they had shown: 1. datatype (SP or DP). 2. data structure used. It would have been even better if they had used the standard matrices from the Williams paper (which Bell also used), so that there was some continuity for comparison.
CSR/DP seems a more than sensible assumption. The PRACE technical report also uses CSR/DP for SpMV (from EuroBen).

What's interesting is that the people who publish GPU-centric speed-ups almost never have optimal x86 implementations as a reference point. I've read dozens of these papers. You'll find some NVidia people who are embarrassed by this stuff.

Though there are genuinely huge speed-ups out there...

Jawed

Rolf N
27-Jun-2010, 16:57
The conventional wisdom is that a non-standard or exotic architecture needs at least a 10x advantage to justify itself.
There are certain tests where there is at least a 10x advantage, something Fermi could improve on measurably. In those areas, the GPU looks more compelling than in areas where a lower multiplier means that Moore's law or the slow ratcheting down of prices as SKUs age could significantly reduce the gains before the needed software customization and tools are completed.This falsely assumes GPU performance doesn't scale with process advances.

3dilettante
28-Jun-2010, 13:40
Until continuity is stronger in GPGPU, the next scaling of hardware with Moore's law is just as likely to require junking established software and tools as not.
Given the overall state of GPU hardware and software for HPC (this includes AMD's non-efforts thus far, but Nvidia's products have some shortcomings as well), establishing continuity would be less than optimal using the current baseline.

rpg.314
28-Jun-2010, 18:03
Vectors are much bigger on the GPUs - i.e. they're much more dependent on those things to get anywhere. I'm certainly not saying GPUs don't have an advantage though. Merely that when you're divvying-up the space, AVX puts a sizeable dent into GPU advantages in certain places.

But yeah, LRBni for the win. Pity that Intel appears to intend to put it into the "add-in card ghetto" (not enough RAM, PCI Express barrier...) instead of mainstream x86. Oh well.


I'd say the niche where AVX makes a dent against gpu's is pretty small. Adding highly restrictive extensions to a quite general and malleable architecture like multi-cores is not a good idea, given the evolution of your competitors.


2 cores -> 12 cores.

Now you are going to normalize it for price (or area), or should I?

Further, I should add that this is a symptom of a larger problem. Vendors like to compare per core. Customers like to compare per node.

Jawed
28-Jun-2010, 22:51
I'd say the niche where AVX makes a dent against gpu's is pretty small. Adding highly restrictive extensions to a quite general and malleable architecture like multi-cores is not a good idea, given the evolution of your competitors.
x86 is incumbent, so the alternative has got to have a huge advantage.

Now you are going to normalize it for price (or area), or should I?
I'm raising the question of scaling. GPU scaling isn't a given. x86 scaling, beyond 16 cores isn't, either. Fusion musses things up. These are just factors to consider.

Further, I should add that this is a symptom of a larger problem. Vendors like to compare per core. Customers like to compare per node.
Then you have things like Clearspeed - had great potential, some sites invested pretty heavily, buying into the future scaling that was promised, etc. The road to hell is paved with good intentions.

Jawed

rpg.314
29-Jun-2010, 08:48
x86 is incumbent, so the alternative has got to have a huge advantage.If you used intrinsics/hand made classes to wrap sse, then you are screwed - aka massive rewrites.

If you are relying on a vectorizing compiler, then without predication and scatter gather, the performance falloff will be pretty rapid for a large number of applications.

Either way, IMHO, avx is much less useful as you increase the vector width.

If intel's long term plans are to add LRBni to cpu's, then we might see 2 different sets of vector ISAs, the SSE/AVX lineage and LRBni on the same core. :grin:

I'm raising the question of scaling. GPU scaling isn't a given. x86 scaling, beyond 16 cores isn't, either. Fusion musses things up. These are just factors to consider.
Workload may not scale, the hw certainly can. And if your workload isn't scaling, there's no point blaming the hw.

Jawed
29-Jun-2010, 09:49
If intel's long term plans are to add LRBni to cpu's, then we might see 2 different sets of vector ISAs, the SSE/AVX lineage and LRBni on the same core. :grin:
Yes, a long time ago I referred to this as the "CPU with drivers" problem.

In theory OpenCL levels the playing field substantially, i.e. its openness to the vagaries of hardware (rather than pretending they don't exist) and the extensions mechanism saves us from writing ASM/intrinsics. Provided the compiler works, of course.

Sadly, I suspect OpenCL won't be taken seriously. It seems to me people are distracted by the search for a holy grail, a language that makes parallelism easy, automatic and optimal regardless of hardware platform.

Anyway, you aren't going to beat things like MKL, which are coded to the metal, if writing direct equivalents.

Workload may not scale, the hw certainly can. And if your workload isn't scaling, there's no point blaming the hw.
Stuff that doesn't scale prolly doesn't see any benefit from GPGPU, then. Or it needs a new algorithm.

Jawed

Lux_
29-Jun-2010, 13:14
In theory OpenCL levels the playing field substantially, i.e. its openness to the vagaries of hardware (rather than pretending they don't exist) and the extensions mechanism saves us from writing ASM/intrinsics. Provided the compiler works, of course. Sadly, I suspect OpenCL won't be taken seriously.

A quote from a document (http://www.prace-project.eu/documents/public-deliverables/d6-6.pdf) from this bunch (http://www.prace-project.eu/documents/public-deliverables-1/) seems relevant (written at the end of 2009):The language on which most people pinned their hopes is probably OpenCL. Testing the performance of the first available OpenCL compilers has revealed that it is currently insufficient. The language specification is not even a year old, so it might not be a surprise that the performance is very poor. More severe is the fact that the design of the language seems to be insufficient as well. The HPC community hoped that OpenCL could be the language that allows running code on different accelerators which would solve the problem of code maintainability and portability across accelerator devices, but it turned out that to program OpenCL many choices depend on the underlying hardware which prevents a seamless use of other devices. The language is very similar to CUDA and makes some implicit assumptions which are only true for graphic cards. So it seems that the language will stay a GPU language for another couple of years. An HPC compiler expert claimed that it is only feasible to use OpenCL as an intermediate language; higher level languages should be used on top of it and ensure at least portability across GPUs. Which bodes well with the fact that Apple had a specific goal (DirectCompute/CUDA) by specific deadline (Snow Leopard) in short timeframe (specced in 5 months).

It seems to me people are distracted by the search for a holy grail, a language that makes parallelism easy, automatic and optimal regardless of hardware platform.I wouldn't paint it in negative colours. Although there probably is some research for research's sake, the underlying problems are very real: locality of data (at least 6 levels), turnover times on new algorithms or (partial) system upgrades, ability to use various parallel programming models etc.

Coding to bare metal by respective experts (both computation and I/O domain) will be always the fastest way. But it shouldn't be a requirement; there is no law that constraints parallel compilers into Visual-Basic-for-HPC corner.

Reading the abovelinked documents (http://www.prace-project.eu/documents/public-deliverables-1/) gave me a glimpse of diversity of the zoo of HPC. The search for more optimal ways to execute code on increasing number of (and heterogenous in nature) compute nodes is not naive nor pseudo-problem, it's needed.

rpg.314
29-Jun-2010, 14:09
Sadly, I suspect OpenCL won't be taken seriously. I think it will be, but at any rate by the time people port production codes in any meanigful quantity, cpu's and gpu's will likely have fused. For real. :???:

It seems to me people are distracted by the search for a holy grail, a language that makes parallelism easy, automatic and optimal regardless of hardware platform.
I am pessimistic if it even exists. Haskell come close though. Something haskell-ish for GPU's would be super cool.

aaronspink
29-Jun-2010, 15:12
If you used intrinsics/hand made classes to wrap sse, then you are screwed - aka massive rewrites.

If you are relying on a vectorizing compiler, then without predication and scatter gather, the performance falloff will be pretty rapid for a large number of applications.

Either way, IMHO, avx is much less useful as you increase the vector width.

Its likely that the programming is the same as it is now. Currently for most of the stuff that is computationally intensive, people use MKL. One would assume, though I have no knowledge, that it would be the same.

rpg.314
29-Jun-2010, 15:22
IMHO, MKL being essentially a dense linear algebra library, is useful only in a small niche. Or a part of a typical workload.

aaronspink
29-Jun-2010, 18:21
IMHO, MKL being essentially a dense linear algebra library, is useful only in a small niche. Or a part of a typical workload.

MKL supports Dense, sparse, prob/statistics, vectors, and FFT.

larrabee
30-Jun-2010, 03:21
The Nvidia GTX280 is composed of an array of multiprocessors
(a.k.a. SM). Each SM has 8 scalar processing units running
in lockstep1, each at 1.3 GHz.
last time i checked the gtx280's stock frequency was 1400MHz. they used the theoretical gflops of the 1.4GHz part while benchmarking the 1.3GHz part.even though it's an 8% difference that can make a significant change in measured efficiency.

also i found the average of the speed ups to be ~3.6x faster, not 2.5. i guess the order of magnitude speed ups dont count?

either they are trying to make gpu's look bad or the arithmetic skills of these people is at a 3rd grade level. i'll go with the latter.

Andrew Lauritzen
30-Jun-2010, 04:16
last time i checked the gtx280's stock frequency was 1400MHz
Check again (http://en.wikipedia.org/wiki/GeForce_200_Series): the gtx280 is 602/1296.


also i found the average of the speed ups to be ~3.6x faster, not 2.5. i guess the order of magnitude speed ups dont count?
The average given is a geomean (http://en.wikipedia.org/wiki/Geometric_mean) which is appropriate for speedup factors; read the paper again.


either they are trying to make gpu's look bad or the arithmetic skills of these people is at a 3rd grade level. i'll go with the latter.
Option 3: you're wrong and didn't do your research... but I guess it's more fun to call all of the authors and peer reviewers stupid. Slower to insults, please.

Billy Idol
17-Jul-2010, 03:55
I am just sitting here, after the last day of a conference where I participated a minisymposium on gpu cpu hpc stuff.

What is rather difficult for me to judge:

How can I really make a fair comparison between CPU and GPU?!


Here the philosophical part:

If I would go the naive route (which in fact probably is even the right one?!) I would say that in the end, whatever platform gives me the shortest overall wallclock time is the best one for me personally-- -- and that's it!

The problem is that I am 100% certain that the answer to this question will drastically vary between different persons applying and doing the hpc, for instance depending on their personal experience, capability and willingness (which is probably the most important part) to get deep in touch with the topic!


Here the scientific part:

why even compare performance numbers based on FLOPS, when typically the implementation thus the algorithms are different on different platforms?

why compare a one GPU to a parallel quad core CPU implementation?

why compare a single thread CPU to a inherently massively parallel GPU implementation?

why compare for a given scientific problem two identic solution methods, when there are probably other solution methods available which are better suited for certain platforms?
on the other hand: if you optimize the solution method and make it architecture aware...what about the results, i.e. the accuracy?

Hm, I think that it is a rather difficult topic ... at least it is for me, as I have to decide if it pays off to go the GPGPU route at all...or follow the massively parallel HPC route?!

Jawed
17-Jul-2010, 10:02
If it's any consolation, people with billions to spend are in the same quandry.

This is rather amusing:

http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf
http://crd.lbl.gov/~dhbailey/dhbtalks/dhb-12ways.pdf

Billy Idol
22-Jul-2010, 00:46
If it's any consolation, people with billions to spend are in the same quandry.


Yeah, I know...but have to spend my life time...which is way more than billions ;-)

Simon F
22-Jul-2010, 09:35
If it's any consolation, people with billions to spend are in the same quandry.

This is rather amusing:

http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf

I started reading that and all I could think of was Simon & Garfunkel penning a new version one of their old songs...
"Quote 32-bit, Mick,
Compare against a Cray, Ray..."

rpg.314
22-Jul-2010, 12:23
An excellent read. No matter how many times I read it, it is still funny.