The one and only Folding @ Home thread

Indeed, and we're all behind their movement to OpenCL.

Which is nice and all but Radeon users still don't have any OpenCL support, despite it being advertised on the box since release last September. The SDK doesn't count because users aren't supposed to be installing that, developers are. If it's not included in Catalyst it may as well not exist in a user perspective.

I actually don't understand what is taking so long for users to get OpenCL support, like has been mentioned in other threads in the past Nvidia has had OpenCL support in their drivers for months, including full ICD support fairly recently.

The Catalyst 10.4 drivers are due in the next few weeks, it's unlikely there will be OpenCL support in those and with ATI tending to be working months ahead on new drivers and no mention of OpenCL support in the future it isn't looking likely users will have access to OpenCL in the near future, I'd bet on it being at least another 6 months.
 
I disagree. While it'd be nicer to include the OpenCL ICD into the regular WHQL Catalysts, downloading the SDK is actually no problem for end users any more and installation not any more complicated than installation of normal drivers.
 
I disagree. While it'd be nicer to include the OpenCL ICD into the regular WHQL Catalysts, downloading the SDK is actually no problem for end users any more and installation not any more complicated than installation of normal drivers.

Maybe if the SDK was stripped down to just install the required components for OpenCL and was linked on the Catalyst driver page I'd agree with you, but currently if you don't know where to find the SDK you don't get OpenCL support as an end user. But even if you extract the SDK and just install the OpenCL part and don't install the profiler or samples it still takes up something like 80MB which still seems too much for just the OpenCL dll's so I'm guessing there is still development stuff even in just the bare OpenCL installer.
 
Maybe if the SDK was stripped down to just install the required components for OpenCL and was linked on the Catalyst driver page I'd agree with you, but currently if you don't know where to find the SDK you don't get OpenCL support as an end user. But even if you extract the SDK and just install the OpenCL part and don't install the profiler or samples it still takes up something like 80MB which still seems too much for just the OpenCL dll's so I'm guessing there is still development stuff even in just the bare OpenCL installer.
Are you looking at a 32-bit or 64-bit system? The OpenCL components are about 42 MB for each of 32-bit and 64-bit so if you're on a 64-bit system, you will need about 84 MB of drivers to support 32- and 64-bit apps. That's just for the required runtime components and no extra development stuff.
 
Are you looking at a 32-bit or 64-bit system? The OpenCL components are about 42 MB for each of 32-bit and 64-bit so if you're on a 64-bit system, you will need about 84 MB of drivers to support 32- and 64-bit apps. That's just for the required runtime components and no extra development stuff.

I'm on a 64-bit system. Good to know, larger than I expected though. So the ATIStreamSDK_dev.exe file that's within the SDK when extracted just installs the bare runtime without extra dev stuff then?
 
No, the client isn't artificially limited to a certain number of stream processors.

I of course didn't mean to imply there was a literal cap. :p My point was the work load was (more or less) capped (for what ever reason) which in turn only allowed about 320 stream processors to be fully utilized.
 
I'm on a 64-bit system. Good to know, larger than I expected though. So the ATIStreamSDK_dev.exe file that's within the SDK when extracted just installs the bare runtime without extra dev stuff then?
I don't know exactly what's in there, but I'd expect the executables, libs and headers in a file called "dev". The headers don't take up much space (400k or so). You need most of the libs, even if you're not building OpenCL programs yourself, so ~84MB is about as small as you can get right now.
 
Well that pretty much clears the air. Thanks Mike. I was wondering if you could go into a bit more detail what about the current GPU v2 client limits performance on Radeons to a seemingly fixed number of ALUs?
 
Shaidar, I saw this in his post:
mhouston said:
No, the client isn't artificially limited to a certain number of stream processors. However, smaller work units will not fully utilize newer chips.

Not exactly sure what that means, but it sounds like there is some sort of bottleneck that prevents benefitting from more ALUs.

Overall, OpenCL needs to work out well and the old clients need to go away.
 
Different algorithms are being used on each GPU. Moreover, Nvidia is a narrower architecture with faster ALUs so they can do a little better with less parallelism. For example, I'll bet that the ultra small proteins won't scale all that great from a GT200 to a Fermi because Fermi is a wider chip.

And no, we are not talking about VLIW vs scalar. This is the vector width of each core and the number of cores, i.e. how many work-items you need to be able to put in flight.
 
You're talking about the width of the SIMDs themselves, then? i.e. 80 ALUs per SIMD for Radeons and 16/24/32 for Geforces.
 
For example, I'll bet that the ultra small proteins won't scale all that great from a GT200 to a Fermi because Fermi is a wider chip.
The then-current preview version of F@H which supports the GF100 seems to prove your point:
http://www.pcgameshardware.com/aid,...Fermi-performance-benchmarks/Reviews/?page=18

Roughly between 55 and 69% more performance compared to GTX 285 is less than what could have hoped for with all the fancy GPU-Computing-Stuff inside Fermi. :)
 
You're talking about the width of the SIMDs themselves, then? i.e. 80 ALUs per SIMD for Radeons and 16/24/32 for Geforces.

It's not the ALUs, but the width of the SIMDs and the number of wavefronts/warps in flight per SIMD, and the number of SIMDs. Moreover, Nvidia runs an N^2/2 algorithm and ATI is running an N^2 algorithm. The N^2 algorithm also aligned well with the Brook programming model (streaming). Part of the algorithm choice was to try to scale out better, but restrictions on earlier hardware and the programming model also drove that choice.
 
Different algorithms are being used on each GPU. Moreover, Nvidia is a narrower architecture with faster ALUs so they can do a little better with less parallelism. For example, I'll bet that the ultra small proteins won't scale all that great from a GT200 to a Fermi because Fermi is a wider chip.

And no, we are not talking about VLIW vs scalar. This is the vector width of each core and the number of cores, i.e. how many work-items you need to be able to put in flight.

this is even true today. on some of the new smaller wu's you might see more ppd than the average size wu's on a g92 based card but you lose ppd on gt200 cards. the inverse is true for large wu's. on the SW side running smaller wu's is important too. they recently proved that their algorithm can accurately simulate proteins on a millisecond timescale, a pretty big feat.

here it is:
http://www.youtube.com/watch?v=gFcp2Xpd29I
 
Back
Top