Go Back   Beyond3D Forum > Core Forums > 3D Hardware, Software & Output Devices

Reply
 
Thread Tools Display Modes
Old 28-Oct-2008, 22:17   #1
swaaye
Resident Sasquatch
 
Join Date: Mar 2003
Location: WI, USA
Posts: 4,129
Send a message via AIM to swaaye
Default FaH and ATI: Curiousnesses

http://www.techpowerup.com/reviews/P...D_4830/20.html

Why is the GF9600 ahead of 4870? This is odd, I'd say.
swaaye is offline   Reply With Quote
Old 28-Oct-2008, 22:33   #2
Scali
Naughty Boy!
 
Join Date: Nov 2003
Posts: 2,127
Send a message via ICQ to Scali Send a message via MSN to Scali
Default

Well, I've always had this theory that ATi's instructionset wasn't very efficient, and graphics were actually a pretty good case, many GPGPU tasks would be worse.
Perhaps that's what we're seeing now, nVidia's scalar threading approach paying off.
Scali is offline   Reply With Quote
Old 28-Oct-2008, 22:48   #3
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 8,532
Send a message via Skype™ to Jawed
Default

I think it's nothing more than the very low performance of Brook+ since the compiler isn't "optimised" apparently.

Also, because ATI hardware is the reference for ATI and NVidia performance, any deficit in ATI performance below what the hardware is theoretically capable of is a multiplier for the performance of NVidia.

In other words, if the ATI hardware when running optimally compiled F@H is taken as the reference, but Brook+ code can only achieve 50% of that, then NVidia hardware automatically gets a free 2x multiplier if comparing NVidia points against ATI points. It's then a matter of how efficiently coded the NVidia core is, in terms of what that hardware is theoretically capable of.

Jawed
Jawed is offline   Reply With Quote
Old 28-Oct-2008, 23:07   #4
entity279
Member
 
Join Date: May 2008
Location: Romania
Posts: 117
Send a message via Yahoo to entity279
Default

I am under the impresion that Ati Fah client is just made to work with the 4800 series, and isn't optimized .. Am I at least partially correct?
entity279 is offline   Reply With Quote
Old 28-Oct-2008, 23:09   #5
Dave Baumann
Gamerscore Wh...
 
Join Date: Jan 2002
Posts: 12,196
Default

Mainly things are single thread CPU bound so we're not using the graphics engine to the maximum. The latest core update (which would have been applicable in time for this review) is actually pushing through smaller protiens (hence smaller shaders) on our solution - the previous core was pushing through larger protiens (hence larger shaders) and these score better for us because we were still primarily CPU bound but we used more of the engine, hence we were doing harder WU's, hence higher scoring, in the same time. There's a CAL update that is presently being qualified that should partially reduce the overhead but still won't get us to the maximum the engine can do.
__________________
Expand. Accelerate. Dominate.
ATI Radeon HD 5800 Series Graphics Cards - Designed by the Community
Dave Baumann is offline   Reply With Quote
Old 28-Oct-2008, 23:09   #6
Scali
Naughty Boy!
 
Join Date: Nov 2003
Posts: 2,127
Send a message via ICQ to Scali Send a message via MSN to Scali
Default

Quote:
Originally Posted by Jawed View Post
I think it's nothing more than the very low performance of Brook+ since the compiler isn't "optimised" apparently.
Either that, or it's just too hard to get efficient code out of ATi's architecture. I mean, do they mean they didn't bother to try and make the compiler optimize the code it outputs, or do they mean that the optimized output isn't as good as they would like it to be?
Quite a difference there.
nVidia's approach is simpler, so you don't rely as much on compiler optimization in the first place.

But I don't really think that the compiler doesn't do ANY optimization, so I doubt that they could make up for the large performance deficiency. At any rate the ball is in ATi's court, because nVidia already delivered a good SDK and compiler for Cuda. ATi needs to do the same if they want to compete in GPGPU.
Scali is offline   Reply With Quote
Old 28-Oct-2008, 23:13   #7
Scali
Naughty Boy!
 
Join Date: Nov 2003
Posts: 2,127
Send a message via ICQ to Scali Send a message via MSN to Scali
Default

Quote:
Originally Posted by Dave Baumann View Post
Mainly things are single thread CPU bound so we're not using the graphics engine to the maximum.
How come nVidia doesn't suffer from that same problem then?
Scali is offline   Reply With Quote
Old 28-Oct-2008, 23:14   #8
Dave Baumann
Gamerscore Wh...
 
Join Date: Jan 2002
Posts: 12,196
Default

Quote:
Originally Posted by Scali View Post
How come nVidia doesn't suffer from that same problem then?
They did something different.
__________________
Expand. Accelerate. Dominate.
ATI Radeon HD 5800 Series Graphics Cards - Designed by the Community
Dave Baumann is offline   Reply With Quote
Old 28-Oct-2008, 23:34   #9
Tchock
Member
 
Join Date: Mar 2008
Location: Jurong West
Posts: 544
Default

I'm still waiting for the updated Brook+ and F@HCore to see if it works as I expected. If it does, there's probably going to be an undervolted RV770 doing some churning.
__________________
As a kid I thought the Cray-1 was a futuristic piece of furniture with computers inside.
Tchock is offline   Reply With Quote
Old 28-Oct-2008, 23:47   #10
Scali
Naughty Boy!
 
Join Date: Nov 2003
Posts: 2,127
Send a message via ICQ to Scali Send a message via MSN to Scali
Default

Quote:
Originally Posted by Dave Baumann View Post
They did something different.
Like what?
Scali is offline   Reply With Quote
Old 28-Oct-2008, 23:56   #11
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 8,532
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Scali View Post
Either that, or it's just too hard to get efficient code out of ATi's architecture. I mean, do they mean they didn't bother to try and make the compiler optimize the code it outputs, or do they mean that the optimized output isn't as good as they would like it to be?
Quite a difference there.
nVidia's approach is simpler, so you don't rely as much on compiler optimization in the first place.
I'm talking about the Brook+ compiler - nothing to do with converting IL into assembly.

It's my impression that Brook+ -> IL is not tuned for ATI's memory architecture.

Brook+ even for simple things like matrix multiply, is far from optimal. Of course MM isn't as trivial as it first appears once you have to start programming for the cache architecture to get optimal performance.

http://ati.amd.com/technology/stream...08Tutorial.pdf

Since Brook+ doesn't expose any of the memory system (unless you consider explicit usage of upto 8 inputs and 8 outputs, "memimport" and "memexport" per kernel as explicit), it's all down to the compiler.

Quote:
But I don't really think that the compiler doesn't do ANY optimization, so I doubt that they could make up for the large performance deficiency. At any rate the ball is in ATi's court, because nVidia already delivered a good SDK and compiler for Cuda. ATi needs to do the same if they want to compete in GPGPU.
I suspect by abandoning Brook+ and using OpenCL they'll get there.

Jawed
Jawed is offline   Reply With Quote
Old 29-Oct-2008, 00:01   #12
Scali
Naughty Boy!
 
Join Date: Nov 2003
Posts: 2,127
Send a message via ICQ to Scali Send a message via MSN to Scali
Default

Quote:
Originally Posted by Jawed View Post
I'm talking about the Brook+ compiler - nothing to do with converting IL into assembly.

It's my impression that Brook+ -> IL is not tuned for ATI's memory architecture.

Brook+ even for simple things like matrix multiply, is far from optimal. Of course MM isn't as trivial as it first appears once you have to start programming for the cache architecture to get optimal performance.
I'm just saying that ATi needs to provide a decent solution, I don't care how they do it. It seems that they're currently trailing Cuda by quite a margin.
Scali is offline   Reply With Quote
Old 29-Oct-2008, 00:07   #13
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 8,532
Send a message via Skype™ to Jawed
Default

Did you read the PDF already?

Jawed
Jawed is offline   Reply With Quote
Old 29-Oct-2008, 00:08   #14
Arnold Beckenbauer
Member
 
Join Date: Oct 2006
Location: Goettingen, Germany
Posts: 697
Default

No folders here?
Quote:
We used version 6.20r1 to download a work unit and fold it using whatever GPU acceleration is available. Due to the different GPU designs, different types of work unit were used. However, for most Folding users the PPD (Points Per Day) metric is the most important because that's what determines their ranking in the system
I'll try to explain these numbers:
If you have for example an HD3850 and get P4742 with 1254 atoms, your PPD will be ~2.000 or 548 points per WU, every frame takes ~ 4 minutes. With NV-GPU2-Client you will still get "test proteins" with 480 points per WU (576 atoms), with a 8800GT you will need 2 min or less for one frame. So if you want points only, a geforce8/9 is a better choice.

But one month ago NV folders got larger projects with 1254 atoms and their PPDs went strongly down. For every WU they got 430 points (and not 548 points like ATi folders): http://foldingforum.org/viewtopic.php?f=52&t=5452
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke
Eta Kooram Nah Smech!

Find Chuck Norris.

Last edited by Arnold Beckenbauer; 21-Dec-2008 at 15:19.
Arnold Beckenbauer is online now   Reply With Quote
Old 29-Oct-2008, 00:54   #15
Dave Baumann
Gamerscore Wh...
 
Join Date: Jan 2002
Posts: 12,196
Default

Quote:
Originally Posted by Scali View Post
Like what?
Personally I don't know the in's-and-out's of what NVIDIA have done with their client, and I'm sure they don't want to tell us. But, what I say is the simple fact of the matter that is fairly well known for those that have followed GPU F@H.

Arnold has pointed out the trend of what happened with NVIDIA, but from our side when the GPU2 client was introduced the smaller test protiens were very, very CPU bound, then a new core came through and larger proteins came through and our score rose significantly because we were executing them in a similar timeframe still, but just using more the the GPU processing power; conversly NVIDIA went down because they were already GPU bound in even on the smaller proteins. The latest core that came through is looking at mid sized proteins so our socres have decreased a little again.

Quote:
Originally Posted by Arnold Beckenbauer View Post
No folders here?
Yes, there are
__________________
Expand. Accelerate. Dominate.
ATI Radeon HD 5800 Series Graphics Cards - Designed by the Community
Dave Baumann is offline   Reply With Quote
Old 29-Oct-2008, 07:47   #16
Scali
Naughty Boy!
 
Join Date: Nov 2003
Posts: 2,127
Send a message via ICQ to Scali Send a message via MSN to Scali
Default

Quote:
Originally Posted by Jawed View Post
Did you read the PDF already?

Jawed
You have to realize that I don't believe ANYTHING that AMD/ATi says anymore, given their past.
I'll just wait and see what happens, if anything happens at all.
Scali is offline   Reply With Quote
Old 29-Oct-2008, 08:00   #17
hoom
Senior Member
 
Join Date: Sep 2003
Posts: 1,670
Default

The PPD discrepancy is because they benchmark with a 3850.

Its very notable that the client stats page shows ATI consistently doing slightly more 'Actual Terraflops' per active client.

Divide the Actual Terraflops by the number of active processors & you get:
0.1100 for ATI
0.1099 for NV

If they are both doing nearly the same amount of actual work per processor (with ATI even slightly in the lead) the PPD difference is simply because the benchmark is on the ATI side.
The benchmark machine goes faster when any ATI side improvement happens so there is no/less PPD increase on the ATI side.

I feel the benchmark should really be a CPU running the same simulation model/work unit.
Then as long as the end result is the same, the GPU that does more actual work ie finishes the same work unit faster will be the one that gets most PPD.
__________________
However, the above is the heart of the foreskin capacitance
hoom is offline   Reply With Quote
Old 29-Oct-2008, 10:34   #18
Florin
Member
 
Join Date: Aug 2003
Posts: 814
Default

Quote:
Originally Posted by Arnold Beckenbauer View Post
But one month ago NV folders got larger projects with 1254 atoms and their PPDs went strongly down. For every WU they got 430 points (and not 548 points like ATi folders): http://foldingforum.org/viewtopic.php?f=52&t=5452
Still curious though how a G92 gets about the same or slightly more PPD than a RV770 even with those larger projects. And that with considerably less CPU use.

And how G200 gets more points than anything out there, even quad cores running A2s.
Florin is offline   Reply With Quote
Old 29-Oct-2008, 12:00   #19
Scali
Naughty Boy!
 
Join Date: Nov 2003
Posts: 2,127
Send a message via ICQ to Scali Send a message via MSN to Scali
Default

Quote:
Originally Posted by Arnold Beckenbauer View Post
But one month ago NV folders got larger projects with 1254 atoms and their PPDs went strongly down. For every WU they got 430 points (and not 548 points like ATi folders): http://foldingforum.org/viewtopic.php?f=52&t=5452
Why exactly would NV folders get less points for processing WUs with the same number of atoms as the WUs done by ATi folders?
Scali is offline   Reply With Quote
Old 29-Oct-2008, 12:18   #20
Arnold Beckenbauer
Member
 
Join Date: Oct 2006
Location: Goettingen, Germany
Posts: 697
Default

Quote:
Originally Posted by Scali View Post
Why exactly would NV folders get less points for processing WUs with the same number of atoms as the WUs done by ATi folders?
Because the NV-GPU2-Client was not as efficient as the reference machine.
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke
Eta Kooram Nah Smech!

Find Chuck Norris.
Arnold Beckenbauer is online now   Reply With Quote
Old 29-Oct-2008, 12:19   #21
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 8,532
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Scali View Post
You have to realize that I don't believe ANYTHING that AMD/ATi says anymore, given their past.
Ah, that explains why you were quizzing Dave.

Jawed
Jawed is offline   Reply With Quote
Old 29-Oct-2008, 12:34   #22
Scali
Naughty Boy!
 
Join Date: Nov 2003
Posts: 2,127
Send a message via ICQ to Scali Send a message via MSN to Scali
Default

Quote:
Originally Posted by Arnold Beckenbauer View Post
Because the NV-GPU2-Client was not as efficient as the reference machine.
I don't understand that?
People measure performance in Points-Per-Day (PPD), right?
So that would have to be a measure of the sum of workunits * weight per time unit.
So why would they vary the weight of workunits that are the same size, depending on what processor is being used? If the system is not as efficient, it will not be able to complete as many workunits per day anyway, lowering its score.
Scali is offline   Reply With Quote
Old 29-Oct-2008, 12:39   #23
Scali
Naughty Boy!
 
Join Date: Nov 2003
Posts: 2,127
Send a message via ICQ to Scali Send a message via MSN to Scali
Default

Quote:
Originally Posted by Jawed View Post
Ah, that explains why you were quizzing Dave.
Well, if you don't think most of Dave Baumann's statements are overly vague or easily misinterpreted, I suppose you're just gullible.
Scali is offline   Reply With Quote
Old 29-Oct-2008, 12:46   #24
Dave Baumann
Gamerscore Wh...
 
Join Date: Jan 2002
Posts: 12,196
Default

And yet here we are just explaining the simple fact of the matter that anyone that has followed folding knows Scali.
__________________
Expand. Accelerate. Dominate.
ATI Radeon HD 5800 Series Graphics Cards - Designed by the Community
Dave Baumann is offline   Reply With Quote
Old 29-Oct-2008, 13:09   #25
Scali
Naughty Boy!
 
Join Date: Nov 2003
Posts: 2,127
Send a message via ICQ to Scali Send a message via MSN to Scali
Default

"They did something different" isn't exactly an explanation.
And what if you haven't followed Folding, can't you ask some questions then?
I could care less about folding myself, to be perfectly honest with you.
So no, I don't run the client myself, never have, and haven't kept up-to-date with its development.
However, I am interested in the fact that it's one of the few applications where GPGPU is applied on both ATi and NV hardware, so we can more or less have some kind of comparison between the two approaches to GPGPU.

Apparently one is CPU-limited, the other not. And for some reason the scoring is 'adjusted' to processor.
Scali is offline   Reply With Quote
Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 17:23.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.