Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 25-Jun-2010, 13:58   #1
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default Debunking the 100X GPU vs. CPU myth:...

an evaluation of throughput computing on CPU and GPU

http://portal.acm.org/citation.cfm?i...ret=1#Fulltext

Quote:
Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.
The PDF is currently available, dunno how long that will last.
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 25-Jun-2010, 16:39   #2
Davros
Darlek ******
 
Join Date: Jun 2004
Posts: 9,489
Default

the conclusion seems to be your still better off with a gpu
strange thing for intel to put out
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™
Davros is offline   Reply With Quote
Old 25-Jun-2010, 17:49   #3
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
Default

Quote:
Originally Posted by Davros View Post
strange thing for intel to put out
This isn't marketing... this is a peer-reviewed paper. The guys who did it are some of the best and they optimized both the CPU and GPU implementations very well and presented the results. This is how work *should* be done and results presented everywhere...

The question of which level of performance (CPU or GPU) is easier to come by for varying levels of programmer skill is not clear, but these results definitely represent the "expert" end on both platforms.
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 25-Jun-2010, 16:58   #4
arjan de lumens
Senior Member
 
Join Date: Feb 2002
Location: gjethus, Norway
Posts: 1,256
Default

As it looks to me, several of the algorithms described suffer inefficiencies on the GPU side due to shortcomings in the GTX280 architecture used for comparison: lack of coherency, lack of large on-chip caches (of the algorithms in the paper, "Solv" and "Hist" seem to be particularly hard hit by these shortcomings). It would be interesting to see this exercise repeated on more modern DirectX11-class GPU architectures like Fermi or Evergreen, where these shortcomings are absent or at least much less prominent.

Also, as Ars Technica notes, even if the overall speedup when going from Core-i7 to GTX280 is only 2.5x instead of 100x, it's still a major win for the GPU side of the story - and even more so when considering the price of Core-i7 versus GTX280.
arjan de lumens is offline   Reply With Quote
Old 25-Jun-2010, 17:23   #5
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
Default

The conventional wisdom is that a non-standard or exotic architecture needs at least a 10x advantage to justify itself.
There are certain tests where there is at least a 10x advantage, something Fermi could improve on measurably. In those areas, the GPU looks more compelling than in areas where a lower multiplier means that Moore's law or the slow ratcheting down of prices as SKUs age could significantly reduce the gains before the needed software customization and tools are completed.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 25-Jun-2010, 18:14   #6
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

The other factor is AVX is incoming and core counts per x86 socket are now rising healthily - at least for a while.

Essentially you need to be really sure that an application/library/component you move to GPU "now" is going to speed-up according to Moore's Law - as in general it will definitely do so on CPU - over the next few years (unless you'll never want more performance than you can get today ).

One could argue that CPUs are going to get a double-shot of scaling, AVX and 16-cores per socket, and then that will be it (wild-card being bandwidth per socket). After that the "why more x86 cores?" question gets tough to answer. What do consumer applications want to do that will chew through x86 style cores in preference to GPU cores?

Then, well, there's Fusion and the like - eventually there'll be no choice to make.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 27-Jun-2010, 16:57   #7
Rolf N
Recurring Membmare
 
Join Date: Aug 2003
Location: yes
Posts: 2,494
Default

Quote:
Originally Posted by 3dilettante View Post
The conventional wisdom is that a non-standard or exotic architecture needs at least a 10x advantage to justify itself.
There are certain tests where there is at least a 10x advantage, something Fermi could improve on measurably. In those areas, the GPU looks more compelling than in areas where a lower multiplier means that Moore's law or the slow ratcheting down of prices as SKUs age could significantly reduce the gains before the needed software customization and tools are completed.
This falsely assumes GPU performance doesn't scale with process advances.
Rolf N is offline   Reply With Quote
Old 28-Jun-2010, 13:40   #8
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
Default

Until continuity is stronger in GPGPU, the next scaling of hardware with Moore's law is just as likely to require junking established software and tools as not.
Given the overall state of GPU hardware and software for HPC (this includes AMD's non-efforts thus far, but Nvidia's products have some shortcomings as well), establishing continuity would be less than optimal using the current baseline.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 25-Jun-2010, 18:24   #9
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

a) AVX's lack of predication and scatter/gather gives it, ironically, a much more restricted programming model than modern GPU's. Frankly they should just ditch AVX for LRBni.

b) Last 4 years we have seen cores/socket growth slower than predicted by moore's law. I am not sure why it would change in the next 2 years.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 25-Jun-2010, 18:44   #10
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
Default

16 per socket could happen with Bulldozer, and perhaps a Sandy Bridge EX variant. The Westmere EX was a notable outlier, with only 10 cores compared to Nehalem EX's 8 but that could be because the higher count is waiting on the new core design.

Unfortunately for Bulldozer, the FP resources at least for the first generation scale at 1/2 per core, and the 16-core variant would be an MCM, with those cores individually being somewhat diminished due to other shared resources.
On the other hand, it would get FMAC, which Sandy Bridge's AVX implementation would not.
On the other other hand, it's going to be an AMD extension, which is going to be ignored by most.


One could argue that these could still scale with Moore's law, just that at their respective nodes they are slightly too expensive for full implementations.
So now it's more of a "scale with Moore's law-n iterations" with n being some offset often ranging between 1 and 1.5, possibly 2 corresponding to some initial one-time implementation cost.

The interesting part is that the offset is much more crucial now than it was previously. At least for silicon, we might be running out of room for n...
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 25-Jun-2010, 18:45   #11
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by rpg.314 View Post
a) AVX's lack of predication and scatter/gather gives it, ironically, a much more restricted programming model than modern GPU's. Frankly they should just ditch AVX for LRBni.
Vectors are much bigger on the GPUs - i.e. they're much more dependent on those things to get anywhere. I'm certainly not saying GPUs don't have an advantage though. Merely that when you're divvying-up the space, AVX puts a sizeable dent into GPU advantages in certain places.

But yeah, LRBni for the win. Pity that Intel appears to intend to put it into the "add-in card ghetto" (not enough RAM, PCI Express barrier...) instead of mainstream x86. Oh well.

Quote:
b) Last 4 years we have seen cores/socket growth slower than predicted by moore's law.
2 cores -> 12 cores.

Quote:
I am not sure why it would change in the next 2 years.
NVidia's certainly struggling. Obviously ATI isn't - yet, anyway.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 25-Jun-2010, 22:19   #12
RecessionCone
Member
 
Join Date: Feb 2010
Posts: 170
Default

The paper is very short on detail with regards to a lot of the tests they ran. For example, what kind of SpMV did they benchmark - I'm assuming it was Double Precision, since their performance was so low, but what data structure did they use, on what type of Sparse Matrix?

It's also rather cute to compare a 65nm processor released in June 2008 with a 45nm processor released in October 2009. I'd like to see them rerun their tests with Fermi.
RecessionCone is offline   Reply With Quote
Old 25-Jun-2010, 22:31   #13
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by RecessionCone View Post
The paper is very short on detail with regards to a lot of the tests they ran. For example, what kind of SpMV did they benchmark - I'm assuming it was Double Precision, since their performance was so low, but what data structure did they use, on what type of Sparse Matrix?
Quote:
8. SpMV or sparse matrix vector multiplication is at the heart of many iterative solvers. There are several storage formats of sparse matrices, compressed row storage being the most common. Computation in this format is characterized by regular access patterns over non-zero elements and irregular access patterns over the vector, based on column index. When the matrix is large and does not fit into on-die storage, a well optimized kernel is usually bandwidth bound.
Quote:
The referred papers contains the best previous reported performance numbers on CPU/GPU platforms. Our optimized performance numbers are at least on par or better than those numbers.

[8] N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC ’09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing, 2009.
[47] F. Vazquez, E. M. Garzon, J.A.Martinez, and J.J.Fernandez. The sparse matrix vector product on GPUs. Technical report, University of Almeria, June 2009.
[50] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC ’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1–12, New York, NY, USA, 2007. ACM.
Quote:
It's also rather cute to compare a 65nm processor released in June 2008 with a 45nm processor released in October 2009. I'd like to see them rerun their tests with Fermi.
Fermi wasn't frying bacon in public in time for the paper's submission, back in November last year.
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 25-Jun-2010, 23:16   #14
RecessionCone
Member
 
Join Date: Feb 2010
Posts: 170
Default

The quote you included from the paper implied that they used CSR, but didn't state it explicitly. If they did use CSR, that's a little disappointing, since it's actually not a very good format for GPUs.

Additionally, they didn't state what the characteristics of the matrix they were testing were. SpMV performance varies tremendously based on the data structure and properties of the matrix. If they used a vector CSR format, there's a good chance that the hybrid ELL/COO format which Bell and Garland propose would have performed significantly (~50%) better.

I'm assuming they used a Double Precision, Vector CSR SpMV for this example, but that's just guesswork. Their assertion that their results are better than the literature is useless without knowing exactly what they did, especially since the literature has such a wide performance range on this benchmark. Bell and Garland show performance ranging from 1 GFLOP/s to 40 GFLOP/s - it all depends on the data structure and properties of the matrix. Without that information, we're kind of left adrift.

I didn't say their paper was broken because they didn't have Fermi in November, I just pointed out that Intel is benefiting here from a 16 month technology development advantage. 16 months is a big deal - the difference between the 5870 and the 4870 was 14 months, which translated into about 60% in performance. My single precision DIA SpMV code runs about 50% faster on Fermi than on GTX280 - and that's without any tuning for Fermi's caches.

Taking a step back, I agree with the paper's conclusion that many GPU papers with 100x speedup results are misleading, and if we're trying to decide what architecture to use for a problem, it's important to normalize the comparisons. I'm just not 100% sure I believe Intel's normalization here, and it's hard to convince myself they did the right thing when the benchmarks were so shallowly described.

[1] http://graphics.cs.uiuc.edu/~wnbell/...throughput.pdf
RecessionCone is offline   Reply With Quote
Old 28-Jun-2010, 18:03   #15
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Jawed View Post
Vectors are much bigger on the GPUs - i.e. they're much more dependent on those things to get anywhere. I'm certainly not saying GPUs don't have an advantage though. Merely that when you're divvying-up the space, AVX puts a sizeable dent into GPU advantages in certain places.

But yeah, LRBni for the win. Pity that Intel appears to intend to put it into the "add-in card ghetto" (not enough RAM, PCI Express barrier...) instead of mainstream x86. Oh well.
I'd say the niche where AVX makes a dent against gpu's is pretty small. Adding highly restrictive extensions to a quite general and malleable architecture like multi-cores is not a good idea, given the evolution of your competitors.

Quote:
2 cores -> 12 cores.
Now you are going to normalize it for price (or area), or should I?

Further, I should add that this is a symptom of a larger problem. Vendors like to compare per core. Customers like to compare per node.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 28-Jun-2010, 22:51   #16
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by rpg.314 View Post
I'd say the niche where AVX makes a dent against gpu's is pretty small. Adding highly restrictive extensions to a quite general and malleable architecture like multi-cores is not a good idea, given the evolution of your competitors.
x86 is incumbent, so the alternative has got to have a huge advantage.

Quote:
Now you are going to normalize it for price (or area), or should I?
I'm raising the question of scaling. GPU scaling isn't a given. x86 scaling, beyond 16 cores isn't, either. Fusion musses things up. These are just factors to consider.

Quote:
Further, I should add that this is a symptom of a larger problem. Vendors like to compare per core. Customers like to compare per node.
Then you have things like Clearspeed - had great potential, some sites invested pretty heavily, buying into the future scaling that was promised, etc. The road to hell is paved with good intentions.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 29-Jun-2010, 08:48   #17
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Jawed View Post
x86 is incumbent, so the alternative has got to have a huge advantage.
If you used intrinsics/hand made classes to wrap sse, then you are screwed - aka massive rewrites.

If you are relying on a vectorizing compiler, then without predication and scatter gather, the performance falloff will be pretty rapid for a large number of applications.

Either way, IMHO, avx is much less useful as you increase the vector width.

If intel's long term plans are to add LRBni to cpu's, then we might see 2 different sets of vector ISAs, the SSE/AVX lineage and LRBni on the same core.
Quote:
I'm raising the question of scaling. GPU scaling isn't a given. x86 scaling, beyond 16 cores isn't, either. Fusion musses things up. These are just factors to consider.
Workload may not scale, the hw certainly can. And if your workload isn't scaling, there's no point blaming the hw.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 26-Jun-2010, 10:03   #18
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

What i don't get from the intel analysis: Why?
I mean, obviously, Nvidia et al. are comparing highly optimized, newly written algorithms to the previously used ones, thus generating the huge speedups. Intel is now debunking this two orders of magnitude stuff, but at the same time, they are saying, that they need to carefully optimize for given problem sizes, allocate caches manually (at least it sounded like that) and so on.

The point is: What they have done is rewriting the algorithms used in much the same way, people would have to do when going to GPU-space. The main problem being: If you need to rewrite your stuff, manually optimizing for parallelism, the main advantage of CPUs (drop-in-replacements) is gone and now you're rewriting everyhting and still are not at GPU-Perf-Levels by quite a margin.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 26-Jun-2010, 11:25   #19
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,570
Default

Quote:
Originally Posted by CarstenS View Post
What i don't get from the intel analysis: Why?
I mean, obviously, Nvidia et al. are comparing highly optimized, newly written algorithms to the previously used ones, thus generating the huge speedups. Intel is now debunking this two orders of magnitude stuff, but at the same time, they are saying, that they need to carefully optimize for given problem sizes, allocate caches manually (at least it sounded like that) and so on.

The point is: What they have done is rewriting the algorithms used in much the same way, people would have to do when going to GPU-space. The main problem being: If you need to rewrite your stuff, manually optimizing for parallelism, the main advantage of CPUs (drop-in-replacements) is gone and now you're rewriting everyhting and still are not at GPU-Perf-Levels by quite a margin.
I highly suggest you look at the paper that Jawed linked as well as http://www.prace-project.eu/document...totypes_hh.pdf. Look at the time to solution numbers, delivered performance, and %peak efficiency. Quite striking really.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 29-Jun-2010, 15:22   #20
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

IMHO, MKL being essentially a dense linear algebra library, is useful only in a small niche. Or a part of a typical workload.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2010, 18:21   #21
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,570
Default

Quote:
Originally Posted by rpg.314 View Post
IMHO, MKL being essentially a dense linear algebra library, is useful only in a small niche. Or a part of a typical workload.
MKL supports Dense, sparse, prob/statistics, vectors, and FFT.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 30-Jun-2010, 03:21   #22
larrabee
Junior Member
 
Join Date: Dec 2009
Posts: 29
Default

Quote:
The Nvidia GTX280 is composed of an array of multiprocessors
(a.k.a. SM). Each SM has 8 scalar processing units running
in lockstep1, each at 1.3 GHz.
last time i checked the gtx280's stock frequency was 1400MHz. they used the theoretical gflops of the 1.4GHz part while benchmarking the 1.3GHz part.even though it's an 8% difference that can make a significant change in measured efficiency.

also i found the average of the speed ups to be ~3.6x faster, not 2.5. i guess the order of magnitude speed ups dont count?

either they are trying to make gpu's look bad or the arithmetic skills of these people is at a 3rd grade level. i'll go with the latter.
larrabee is offline   Reply With Quote
Old 30-Jun-2010, 04:16   #23
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
Default

Quote:
Originally Posted by larrabee View Post
last time i checked the gtx280's stock frequency was 1400MHz
Check again: the gtx280 is 602/1296.

Quote:
Originally Posted by larrabee View Post
also i found the average of the speed ups to be ~3.6x faster, not 2.5. i guess the order of magnitude speed ups dont count?
The average given is a geomean which is appropriate for speedup factors; read the paper again.

Quote:
Originally Posted by larrabee View Post
either they are trying to make gpu's look bad or the arithmetic skills of these people is at a 3rd grade level. i'll go with the latter.
Option 3: you're wrong and didn't do your research... but I guess it's more fun to call all of the authors and peer reviewers stupid. Slower to insults, please.
__________________
The content of this message is my personal opinion only.

Last edited by Andrew Lauritzen; 30-Jun-2010 at 04:32.
Andrew Lauritzen is offline   Reply With Quote
Old 17-Jul-2010, 03:55   #24
Billy Idol
Senior Member
 
Join Date: Mar 2009
Location: Europe
Posts: 2,601
Default

I am just sitting here, after the last day of a conference where I participated a minisymposium on gpu cpu hpc stuff.

What is rather difficult for me to judge:

How can I really make a fair comparison between CPU and GPU?!


Here the philosophical part:

If I would go the naive route (which in fact probably is even the right one?!) I would say that in the end, whatever platform gives me the shortest overall wallclock time is the best one for me personally-- -- and that's it!

The problem is that I am 100% certain that the answer to this question will drastically vary between different persons applying and doing the hpc, for instance depending on their personal experience, capability and willingness (which is probably the most important part) to get deep in touch with the topic!


Here the scientific part:

why even compare performance numbers based on FLOPS, when typically the implementation thus the algorithms are different on different platforms?

why compare a one GPU to a parallel quad core CPU implementation?

why compare a single thread CPU to a inherently massively parallel GPU implementation?

why compare for a given scientific problem two identic solution methods, when there are probably other solution methods available which are better suited for certain platforms?
on the other hand: if you optimize the solution method and make it architecture aware...what about the results, i.e. the accuracy?

Hm, I think that it is a rather difficult topic ... at least it is for me, as I have to decide if it pays off to go the GPGPU route at all...or follow the massively parallel HPC route?!
__________________
I bid farewell with a rebel yell...
Billy Idol is offline   Reply With Quote
Old 17-Jul-2010, 10:02   #25
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

If it's any consolation, people with billions to spend are in the same quandry.

This is rather amusing:

http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf
http://crd.lbl.gov/~dhbailey/dhbtalks/dhb-12ways.pdf
__________________
Can it play WoW?
Jawed is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 00:48.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.