If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
|
|
#1 | |
|
Regular
|
an evaluation of throughput computing on CPU and GPU
http://portal.acm.org/citation.cfm?i...ret=1#Fulltext Quote:
__________________
Can it play WoW? |
|
|
|
|
|
|
#2 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,489
|
the conclusion seems to be your still better off with a gpu
strange thing for intel to put out
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#3 |
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
|
This isn't marketing... this is a peer-reviewed paper. The guys who did it are some of the best and they optimized both the CPU and GPU implementations very well and presented the results. This is how work *should* be done and results presented everywhere...
The question of which level of performance (CPU or GPU) is easier to come by for varying levels of programmer skill is not clear, but these results definitely represent the "expert" end on both platforms.
__________________
The content of this message is my personal opinion only. |
|
|
|
|
|
#4 |
|
Senior Member
Join Date: Feb 2002
Location: gjethus, Norway
Posts: 1,256
|
As it looks to me, several of the algorithms described suffer inefficiencies on the GPU side due to shortcomings in the GTX280 architecture used for comparison: lack of coherency, lack of large on-chip caches (of the algorithms in the paper, "Solv" and "Hist" seem to be particularly hard hit by these shortcomings). It would be interesting to see this exercise repeated on more modern DirectX11-class GPU architectures like Fermi or Evergreen, where these shortcomings are absent or at least much less prominent.
Also, as Ars Technica notes, even if the overall speedup when going from Core-i7 to GTX280 is only 2.5x instead of 100x, it's still a major win for the GPU side of the story - and even more so when considering the price of Core-i7 versus GTX280. |
|
|
|
|
|
#5 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
The conventional wisdom is that a non-standard or exotic architecture needs at least a 10x advantage to justify itself.
There are certain tests where there is at least a 10x advantage, something Fermi could improve on measurably. In those areas, the GPU looks more compelling than in areas where a lower multiplier means that Moore's law or the slow ratcheting down of prices as SKUs age could significantly reduce the gains before the needed software customization and tools are completed.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#6 |
|
Regular
|
The other factor is AVX is incoming and core counts per x86 socket are now rising healthily - at least for a while.
Essentially you need to be really sure that an application/library/component you move to GPU "now" is going to speed-up according to Moore's Law - as in general it will definitely do so on CPU - over the next few years (unless you'll never want more performance than you can get today One could argue that CPUs are going to get a double-shot of scaling, AVX and 16-cores per socket, and then that will be it (wild-card being bandwidth per socket). After that the "why more x86 cores?" question gets tough to answer. What do consumer applications want to do that will chew through x86 style cores in preference to GPU cores? Then, well, there's Fusion and the like - eventually there'll be no choice to make. Jawed
__________________
Can it play WoW? |
|
|
|
|
|
#7 | |
|
Recurring Membmare
Join Date: Aug 2003
Location: yes
Posts: 2,494
|
Quote:
|
|
|
|
|
|
|
#8 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
Until continuity is stronger in GPGPU, the next scaling of hardware with Moore's law is just as likely to require junking established software and tools as not.
Given the overall state of GPU hardware and software for HPC (this includes AMD's non-efforts thus far, but Nvidia's products have some shortcomings as well), establishing continuity would be less than optimal using the current baseline.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#9 |
|
Senior Member
|
a) AVX's lack of predication and scatter/gather gives it, ironically, a much more restricted programming model than modern GPU's. Frankly they should just ditch AVX for LRBni.
b) Last 4 years we have seen cores/socket growth slower than predicted by moore's law. I am not sure why it would change in the next 2 years. |
|
|
|
|
|
#10 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
16 per socket could happen with Bulldozer, and perhaps a Sandy Bridge EX variant. The Westmere EX was a notable outlier, with only 10 cores compared to Nehalem EX's 8 but that could be because the higher count is waiting on the new core design.
Unfortunately for Bulldozer, the FP resources at least for the first generation scale at 1/2 per core, and the 16-core variant would be an MCM, with those cores individually being somewhat diminished due to other shared resources. On the other hand, it would get FMAC, which Sandy Bridge's AVX implementation would not. On the other other hand, it's going to be an AMD extension, which is going to be ignored by most. One could argue that these could still scale with Moore's law, just that at their respective nodes they are slightly too expensive for full implementations. So now it's more of a "scale with Moore's law-n iterations" with n being some offset often ranging between 1 and 1.5, possibly 2 corresponding to some initial one-time implementation cost. The interesting part is that the offset is much more crucial now than it was previously. At least for silicon, we might be running out of room for n...
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#11 | |||
|
Regular
|
Quote:
But yeah, LRBni for the win. Pity that Intel appears to intend to put it into the "add-in card ghetto" (not enough RAM, PCI Express barrier...) instead of mainstream x86. Oh well. Quote:
Quote:
Jawed
__________________
Can it play WoW? |
|||
|
|
|
|
|
#12 |
|
Member
Join Date: Feb 2010
Posts: 170
|
The paper is very short on detail with regards to a lot of the tests they ran. For example, what kind of SpMV did they benchmark - I'm assuming it was Double Precision, since their performance was so low, but what data structure did they use, on what type of Sparse Matrix?
It's also rather cute to compare a 65nm processor released in June 2008 with a 45nm processor released in October 2009. I'd like to see them rerun their tests with Fermi. |
|
|
|
|
|
#13 | ||||
|
Regular
|
Quote:
Quote:
Quote:
Quote:
__________________
Can it play WoW? |
||||
|
|
|
|
|
#14 |
|
Member
Join Date: Feb 2010
Posts: 170
|
The quote you included from the paper implied that they used CSR, but didn't state it explicitly. If they did use CSR, that's a little disappointing, since it's actually not a very good format for GPUs.
Additionally, they didn't state what the characteristics of the matrix they were testing were. SpMV performance varies tremendously based on the data structure and properties of the matrix. If they used a vector CSR format, there's a good chance that the hybrid ELL/COO format which Bell and Garland propose would have performed significantly (~50%) better. I'm assuming they used a Double Precision, Vector CSR SpMV for this example, but that's just guesswork. Their assertion that their results are better than the literature is useless without knowing exactly what they did, especially since the literature has such a wide performance range on this benchmark. Bell and Garland show performance ranging from 1 GFLOP/s to 40 GFLOP/s - it all depends on the data structure and properties of the matrix. Without that information, we're kind of left adrift. I didn't say their paper was broken because they didn't have Fermi in November, I just pointed out that Intel is benefiting here from a 16 month technology development advantage. 16 months is a big deal - the difference between the 5870 and the 4870 was 14 months, which translated into about 60% in performance. My single precision DIA SpMV code runs about 50% faster on Fermi than on GTX280 - and that's without any tuning for Fermi's caches. Taking a step back, I agree with the paper's conclusion that many GPU papers with 100x speedup results are misleading, and if we're trying to decide what architecture to use for a problem, it's important to normalize the comparisons. I'm just not 100% sure I believe Intel's normalization here, and it's hard to convince myself they did the right thing when the benchmarks were so shallowly described. [1] http://graphics.cs.uiuc.edu/~wnbell/...throughput.pdf |
|
|
|
|
|
#15 | ||
|
Senior Member
|
Quote:
Quote:
Further, I should add that this is a symptom of a larger problem. Vendors like to compare per core. Customers like to compare per node. |
||
|
|
|
|
|
#16 | |||
|
Regular
|
Quote:
Quote:
Quote:
Jawed
__________________
Can it play WoW? |
|||
|
|
|
|
|
#17 | ||
|
Senior Member
|
Quote:
If you are relying on a vectorizing compiler, then without predication and scatter gather, the performance falloff will be pretty rapid for a large number of applications. Either way, IMHO, avx is much less useful as you increase the vector width. If intel's long term plans are to add LRBni to cpu's, then we might see 2 different sets of vector ISAs, the SSE/AVX lineage and LRBni on the same core. Quote:
|
||
|
|
|
|
|
#18 |
|
Senior Member
|
What i don't get from the intel analysis: Why?
I mean, obviously, Nvidia et al. are comparing highly optimized, newly written algorithms to the previously used ones, thus generating the huge speedups. Intel is now debunking this two orders of magnitude stuff, but at the same time, they are saying, that they need to carefully optimize for given problem sizes, allocate caches manually (at least it sounded like that) and so on. The point is: What they have done is rewriting the algorithms used in much the same way, people would have to do when going to GPU-space. The main problem being: If you need to rewrite your stuff, manually optimizing for parallelism, the main advantage of CPUs (drop-in-replacements) is gone and now you're rewriting everyhting and still are not at GPU-Perf-Levels by quite a margin.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
#19 | |
|
Senior Member
Join Date: Jun 2003
Posts: 2,570
|
Quote:
__________________
Aaron Spink speaking for myself inc. |
|
|
|
|
|
|
#20 |
|
Senior Member
|
IMHO, MKL being essentially a dense linear algebra library, is useful only in a small niche. Or a part of a typical workload.
|
|
|
|
|
|
#21 |
|
Senior Member
Join Date: Jun 2003
Posts: 2,570
|
MKL supports Dense, sparse, prob/statistics, vectors, and FFT.
__________________
Aaron Spink speaking for myself inc. |
|
|
|
|
|
#22 | |
|
Junior Member
Join Date: Dec 2009
Posts: 29
|
Quote:
also i found the average of the speed ups to be ~3.6x faster, not 2.5. i guess the order of magnitude speed ups dont count? either they are trying to make gpu's look bad or the arithmetic skills of these people is at a 3rd grade level. i'll go with the latter. |
|
|
|
|
|
|
#23 | |
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
|
Check again: the gtx280 is 602/1296.
Quote:
Option 3: you're wrong and didn't do your research... but I guess it's more fun to call all of the authors and peer reviewers stupid. Slower to insults, please.
__________________
The content of this message is my personal opinion only. Last edited by Andrew Lauritzen; 30-Jun-2010 at 04:32. |
|
|
|
|
|
|
#24 |
|
Senior Member
Join Date: Mar 2009
Location: Europe
Posts: 2,601
|
I am just sitting here, after the last day of a conference where I participated a minisymposium on gpu cpu hpc stuff.
What is rather difficult for me to judge: How can I really make a fair comparison between CPU and GPU?! Here the philosophical part: If I would go the naive route (which in fact probably is even the right one?!) I would say that in the end, whatever platform gives me the shortest overall wallclock time is the best one for me personally-- -- and that's it! The problem is that I am 100% certain that the answer to this question will drastically vary between different persons applying and doing the hpc, for instance depending on their personal experience, capability and willingness (which is probably the most important part) to get deep in touch with the topic! Here the scientific part: why even compare performance numbers based on FLOPS, when typically the implementation thus the algorithms are different on different platforms? why compare a one GPU to a parallel quad core CPU implementation? why compare a single thread CPU to a inherently massively parallel GPU implementation? why compare for a given scientific problem two identic solution methods, when there are probably other solution methods available which are better suited for certain platforms? on the other hand: if you optimize the solution method and make it architecture aware...what about the results, i.e. the accuracy? Hm, I think that it is a rather difficult topic ... at least it is for me, as I have to decide if it pays off to go the GPGPU route at all...or follow the massively parallel HPC route?!
__________________
I bid farewell with a rebel yell... |
|
|
|
|
|
#25 |
|
Regular
|
If it's any consolation, people with billions to spend are in the same quandry.
This is rather amusing: http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf http://crd.lbl.gov/~dhbailey/dhbtalks/dhb-12ways.pdf
__________________
Can it play WoW? |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|