The one and only Folding @ Home thread

First, the CUDA and the Brook clients use a different code base, different algorithms and calculate different classes of proteins, it's hard to make good comparisons on that base.

This isn't true. The F@H GPU2 client distributes the same work units to NV and ATi hardware. The client works differently on Radeons and Geforces, but it is processing the same dataset.

Furthermore the ATI version of Folding was developed with an ancient Brook release without support for local memory. AFAIK that is the reason it does actually twice the number of calculations as it is cheaper to redo it than to store it somewhere in memory and load it again later. Newer Brook releases support the local memory of RV770 GPUs, but Stanford never bothered to update their code. The scaling from RV670 -> RV770 -> RV870 ist extremely bad (virtually non-existent), not exactly a sign of an optimal and forward looking coding.

That sounds like what I alluded to hearing previously.

For the record, I'm not here to evangelize NV hardware, I'd rather have a good option from both IHVs (or more, if Intel ever joins the playing field).

I guess this should not become a discussion of how AMD's devrel department works or should be working, so I won't say anything to those points.

I think it's a relevant observation to make though, given the state of the market. We've all heard the stories about NV devrel working closely with devs to make sure their software runs properly on Geforces but we rarely hear about ATi doing the same.
 
I don't believe this is a fair characterization. I don't think Pande Group does less work on ATi GPUs because they want to.

It is a fair characterization. The AMD f@h client was capped to use at most ~320 shader processors, hence why the only performance increases come from clock increases.
 
Untrue. See Folding@Home, where GT200b stomps all over anything ATi has, including Hemlock. Math rate isn't the only factor in solving any problem, GPGPU or not.

In computational molecular dynamics, it is. The reasons for AMD's trailing F@H perf are something else, tool and code base maturity being one of them.
 
Is F@H somekind of Holly Grail of GPGPU?
Look at this:
Milkyway@home
Collatz@home
distributed.net
Perhaps so, but previous ATi hardware is deficient as the F@H client often has to redo calculations just to ensure correctness, something that doesn't occur when running on NV hardware.
Is it so?
http://foldingforum.org/viewtopic.php?f=51&t=10442#p103025
mhouston said:
It's a difference in algorithm choice and scaling. The easiest way to think about things is to look at the physics. Looking at two particles A and B. To calculate the force on A, you add up all the partial forces on A from all of the particles, including B. To calculate the force on B, you add up all the partial forces on B from all of the particles, including A. Now, you have calculated the force between A and B twice.

There are tradeoffs with calculating the force twice and storing the calculation and reloading it, i.e. a tradeoff between ALU load and memory system load. Now, that being said, not everything is calculated twice since there is other math besides just the force pair (like acceleration and velocity calculations done after the partial forces are calculated, as well as the update to the particle position). There is also a different constant factor for each of the algorithms. If you look at really massive proteins, the performance difference between ATI and Nvidia is small, ~18% comparing a GTX280 and a 4870 despite us "doing 2X the work".

You can read about the difference in implementations and the performance scaling on different proteins in a paper from Vijay's group that Scott LeGrand (Nvidia) and I work on as well: "Accelerating molecular dynamic simulation on graphics processing units" in the Journal of Computational Chemistry, Volume 30, Issue 6, Pages 864 - 872
 
i remember some things from the fah whitepapers. you have to pay to view it now. the nvidia client was a port of the ATi client. the ATi client used an earlier version of brook and older hardware. random numbers had to be generated on cpu due to no integer support in brook and i think their was some restriction with gathers/scatters from kernels that nvidia didnt have. nvidia helped add some optimizations like use of LDS which gives them a nice performance boost. its more accurate too because ATi reuses random numbers to save time. they want to use opencl to have one codebase so they can use all hardware to do the same projects. hopefully ATi can get a boost with evergreen's LDS. fermi's caches could help a lot though.
 
It is a fair characterization. The AMD f@h client was capped to use at most ~320 shader processors, hence why the only performance increases come from clock increases.

Right, but this decision by Pande Group/ATi seems rather short-sighted now. Due to ATi's lack of continued involvement in F@H, application performance doesn't scale with new hardware. NV doesn't have this problem, F@H scales nicely across generations of hardware.

You're suggesting that Dirt2, Mass Effect 2, Dragon Age Origins, CoD: Modern Warefare 2, Dawn of War 2, Battlefield Bad Company 2, etc., etc. are not some of the "most popular" PC titles over the past few months?

I'm suggesting that NV has better software support and flexibility. To this day my roommate gets to "enjoy" Borderlands sans Vsync and AA, whereas I run Vsync and 16xQ CSAA on my GTX 285, with nary an FPS hiccup to boot.

In computational molecular dynamics, it is. The reasons for AMD's trailing F@H perf are something else, tool and code base maturity being one of them.

That's part of my argument. NV has a superior toolset and it shows.

Is F@H somekind of Holly Grail of GPGPU?
Look at this:
Milkyway@home
Collatz@home
distributed.net

It's not a holy grail by any means, but it is the most popular GPGPU application, and by far the most popular distributed computing app. The hundreds of thousands of active F@H users outweighs every other DC community, to my knowledge.


The performance just isn't there. ATi GPUs are easily outdistanced by NV GPUs in F@H, regardless of the work unit in question. For example, my roommate's 4870x2 has produced 3000-4000 points per day (PPD) at most, while my GT200b farm produces 8000-10000 PPD, per GPU, on average.
 
Sorry? Its their code, not ours! We don't re-write games for developers, do we!? We have dev rel, we assist and get can help optimise and there is no difference between game and stream apps there.
Obviously, I see your point. But please try and see the customer's point of view also: From that purely customers point of view, I won't care who's to take the blame. All I'd care for in this case is that I have an application that I'd like to run and not only run, but run fast, really fast.

And I know from all the reviews over the web that the chip's got the raw power to really own that application - as we see in things Gipsel regularly mentions like Milkyway@Home or Collatz@Home. But 'my' app just won't run fast. That's disappointing for me, so i quit folding for the time being.
 
There are client-specific work units but these don't differentiate between NV and ATi hardware, the same work unit runs on both. The algorithm may be different, but the workload is not, neither is the output.
Buddy, if the algorithm is different, then the workload is different. If I have one program doing bubble sort and another doing quicksort, and both are fed the same random sequence of numbers to sort, then the hardware sees a different workload from each.

GT200b is nearly twice as fast as G80, and GF100 is about 50% faster than GT200b, despite not having any optimizations in the client. ATi hardware only gains performance in F@H through clockspeed increases. Cypress is no faster than RV770 @ the same clockspeed, despite having twice the ALUs and better register/cache architecture for GPGPU.
That's because none of the features of Cypress are being used. Heck, none of the features of even RV770 are even being used, including basic scatter, according to their paper. The GPU2 client was developed for R600 hardware.
Whatever the reason, ATi hardware does not scale in F@H, and NV hardware does.
For some reason, it only uses a hardcoded number of stream processors on ATI. I think a change was eventually made to make the client use 800 SPs instead of 320 so that RV770 was notably faster than R600, but AFAIK the client is still only using half the SPs on Cypress.
 
but this decision by Pande Group seems rather short-sighted now

Fixed. Which brings us back to the original point, don't blame AMD's hardware (or in this case I would even argue software) for Pande's shortcomings.

Due to ATi's lack of continued involvement in F@H, application performance doesn't scale with new hardware.

But I don't think this is true. How does AMD's "lack of continued involvement" have anything to do with Pande not removing the cap for the F@H client? It's their code (not AMD's code)! Regardless if AMD is helping or not, there's no excuse for Pande to not update and maintain their code. The notion that the only way developers can improve their code base is with an IHV's help is ridiculous (and that goes for anything, not just F@H).
 
Buddy, if the algorithm is different, then the workload is different. If I have one program doing bubble sort and another doing quicksort, and both are fed the same random sequence of numbers to sort, then the hardware sees a different workload from each.

They're getting the same work units and outputting the same results. The only difference I'm aware of is the fact that ATi needs to redo some calculations.

That's because none of the features of Cypress are being used. Heck, none of the features of even RV770 are even being used, including basic scatter, according to their paper. The GPU2 client was developed for R600 hardware.
For some reason, it only uses a hardcoded number of stream processors on ATI. I think a change was eventually made to make the client use 800 SPs instead of 320 so that RV770 was notably faster than R600, but AFAIK the client is still only using half the SPs on Cypress.

Yes, that is my understanding of the current situation WRT the GPU client V2. So we're back at the beginning again. NV hardware is faster than ATi hardware for F@H.
 
Fixed. Which brings us back to the original point, don't blame AMD's hardware (or in this case I would even argue software) for Pande's shortcomings.

Why not? NV has no such limitations.

But I don't think this is true. How does AMD's "lack of continued involvement" have anything to do with Pande not removing the cap for the F@H client? It's their code (not AMD's code)! Regardless if AMD is helping or not, there's no excuse for Pande to not update and maintain their code. The notion that the only way developers can improve their code base is with an IHV's help is ridiculous (and that goes for anything, not just F@H).

NV worked closely w/Pande Group to update the GPU V2 client to run well on Geforces. ATi did not. At least until the next version of the client based upon OpenCL is released, existing versions haven't had a vendor-agnostic API to work through, and so have run much closer to the metal, requiring interaction from the IHV to achieve decent stability and performance.
 
Doing some F@H myself, that's not exactly what I'd have liked to hear as one of your customers.

Honestly? The less resources ATI wastes on F@H the better. They are better off working on infrastructure and tools aimed at a larger community. The reality is that the F@H group should port their codebase to opencl and then it will just work with whatever.
 
On the official copy I have right here AA is running fine.

Using what graphics card? My roommate is running his copy through Steam. Running Vista 64.

In the meantime, along with the other work going on, how much DX11 content do you think there would have been in Dirt2, AvP, Battlefield, Battleforge, STALKER, and the other titles coming up - do you think there was no work put in there?

I don't know what ATi's involvement has been in the specific games you mentioned, but I understand it is greater than many titles in the past. It's nice to see this, but pardon me if I'm a bit hesitant to believe ATi's devrel will be on-par with NV's any time soon, especially WRT F@H.
 
Honestly? The less resources ATI wastes on F@H the better. They are better off working on infrastructure and tools aimed at a larger community. The reality is that the F@H group should port their codebase to opencl and then it will just work with whatever.

I agree. I don't like the current performance level ATi GPUs provide for F@H, but I think this is the best solution in the long run.
 
Incorrect. Mike would be exceedingly upset to see that. Lots of effort was put into the GPU2 client for Brook+.

You're right, I shouldn't be stating ATi hasn't done any work WRT F@H, it just appears that way to me based on the performance with successive hardware generations. Do you know why the client appears to have a cap on how many SIMDs it runs upon on ATi hardware?
 
Well it is the most popular GPGPU application...

Unlikely. As much as it pains me to say it, I would suggest that video transcoding likely is used more than F@H and even that isn't used that much.


GT200b is nearly twice as fast as G80, and GF100 is about 50% faster than GT200b, despite not having any optimizations in the client. ATi hardware only gains performance in F@H through clockspeed increases. Cypress is no faster than RV770 @ the same clockspeed, despite having twice the ALUs and better register/cache architecture for GPGPU.

Whatever the reason, ATi hardware does not scale in F@H, and NV hardware does.

Might have something to do with when the clients were written.
 
It's not a holy grail by any means, but it is the most popular GPGPU application, and by far the most popular distributed computing app. The hundreds of thousands of active F@H users outweighs every other DC community, to my knowledge.

So?


The performance just isn't there. ATi GPUs are easily outdistanced by NV GPUs in F@H, regardless of the work unit in question. For example, my roommate's 4870x2 has produced 3000-4000 points per day (PPD) at most, while my GT200b farm produces 8000-10000 PPD, per GPU, on average.

Maybe you should stop worrying about epeening with PPD and just actually do something useful with your computer.
 
Unlikely. As much as it pains me to say it, I would suggest that video transcoding likely is used more than F@H and even that isn't used that much.

That would be the logical alternative, but I'm not sure how you could prove it. Active users of transcoding software aren't tracked like F@H.

Might have something to do with when the clients were written.

I imagine the GPU v2 client was written and optimized for both IHVs hardware at approximately the same time. The installation executable for the GPU V2 client supports both ATi & NV hardware.
 
Back
Top