PDA

View Full Version : The state of AMD's GPGPU implementation?


wingless
21-Aug-2008, 18:09
We all know CUDA seems to be more useable than AMD's GPGPU methods as of now. With that said, what do we know about AMD's possible success in this arena. Fusion is due out next year and GPGPU will probably be AMD's only saving grace in performance against Core i7 IF they can get some GPGPU enabled software on the market. Besides Cyberlink's transcoding software, what other ATI compatible GPGPU software do we know of coming out in the next year?

willardjuice
21-Aug-2008, 19:20
Pretty much nothing until OpenCL it seems.

Dave Baumann
21-Aug-2008, 20:20
First off, read this press release (http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~127451,00.html). As pointed out there we do have a C/C++ compiler environment in Brook+ and there is an ongoing development path for that.

In regards to consumer utilisation of GPGPU then bear in mind that we were the first to deliver Folding@Home on the GPU (two versions in fact) and had demonstrated transcoding with Adobe last year (http://www.hexus.net/content/item.php?item=11725). With regards to transcoding we are not just saying to ISV's "here's a compiler, go nuts" but rather we have "Cobra" with is a Transcoding API that operates via CAL and we'll make this API available to ISV's.

willardjuice
21-Aug-2008, 21:40
As pointed out there we do have a C/C++ compiler environment in Brook+ and there is an ongoing development path for that.

No Vista support though (from what I can tell). :sad:

wingless
21-Aug-2008, 21:41
First off, read this press release (http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~127451,00.html). As pointed out there we do have a C/C++ compiler environment in Brook+ and there is an ongoing development path for that.

In regards to consumer utilisation of GPGPU then bear in mind that we were the first to deliver Folding@Home on the GPU (two versions in fact) and had demonstrated transcoding with Adobe last year (http://www.hexus.net/content/item.php?item=11725). With regards to transcoding we are not just saying to ISV's "here's a compiler, go nuts" but rather we have "Cobra" with is a Transcoding API that operates via CAL and we'll make this API available to ISV's.

Wow...I got an answer from ATI's Technical Marketing Manager! I feel special (this is not sarcasm).

Thank you for your response. What efforts is AMD taking to market OpenCL and Brook+ to the software developers? It seems that news about CUDA is around every corner. I really want to see AMD's GPGPU software take off to aid Fusion CPUs in the performance wars with Intel. If Fusion's onboard graphics core can accelerate a number of software types then it may outshine Core i7. Yes, I'm a fan of AMD. I also have a $299+tax HD 4870 that I got one week after it was released. I got it mostly for GPGPU possibilities.

Is there a list of software devs that are onboard with AMD's GPGPU campaign?

Arnold Beckenbauer
21-Aug-2008, 23:55
First off, read this press release (http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~127451,00.html). As pointed out there we do have a C/C++ compiler environment in Brook+ and there is an ongoing development path for that.

This press release made Theo Valich think "AMD ditches Close-To-Metal, focuses on DX11 and OpenCL" (http://www.tgdaily.com/content/view/38764/140/)

In regards to consumer utilisation of GPGPU then bear in mind that we were the first to deliver Folding@Home on the GPU (two versions in fact) and had demonstrated transcoding with Adobe last year (http://www.hexus.net/content/item.php?item=11725). With regards to transcoding we are not just saying to ISV's "here's a compiler, go nuts" but rather we have "Cobra" with is a Transcoding API that operates via CAL and we'll make this API available to ISV's.

Will GPU accelerated trancoding be limited to HD4000 series?

PS: Badaboom's Media Converter isn't good, http://forum.doom9.org/showthread.php?p=1169851#post1169851

Unknown Soldier
22-Aug-2008, 07:08
PS: Badaboom's Media Converter isn't good

From what I saw when I ripped using the Badaboom demo version, it's definitely much, much faster although I agree quality is an issue, although would it really look that bad on an Apple iPhone/iPod etc.?

Another thing to remember, while quality wasn't good imo. it's still a new application and other better CPU applications atm. also were crappy applications when they started.

Badaboom certainly has a bright future imo.

US

entity279
22-Aug-2008, 09:20
No Vista support though (from what I can tell). :sad:

Not really. I've written some Brook + code on Vista (and in 1-2 days I will start something a bit more serious than examples), so it's ok. I know on their site it says is not suported, but it just works.

pcchen
22-Aug-2008, 10:55
Is Vista 64 supported to? I'm considering upgrading to Vista 64 as I'm changing my broken motherboard.

entity279
22-Aug-2008, 11:23
Is Vista 64 supported to? I'm considering upgrading to Vista 64 as I'm changing my broken motherboard.

Yeah I have Vista 64. There is a quite ok code optimiser from the fire stream package that I couldn't install, but the CAL / Brook+ work fine. At least the Cpu emulation, cause I just have an x800 so I can only run emulated code..

Dave Baumann
22-Aug-2008, 20:09
Is there a list of software devs that are onboard with AMD's GPGPU campaign?

I suggest you take a nose around the Stream Compute area of the website. Much of this is geared towards enterprise type solutions, but you can see things like a list of our ecosystem partners there. For the consumer side, yes we are working with dev's but we're not announcing anything other than the ISV's that we have already talked about, right now.

This press release made Theo Valich think "AMD ditches Close-To-Metal, focuses on DX11 and OpenCL" (http://www.tgdaily.com/content/view/38764/140/)
I don't think it was that press release, rather a conference held before HD 4870 X2's launch - there was an element of sensationalism in the headline. In reality CTM evolved into CAL some time ago, and CAL will remain as the enabler for our Stream compute ecosystem by being the interface to the hardware - OpenCL, Cobra, Brook+, 3rd party toolsets will layer on top of this.

Will GPU accelerated trancoding be limited to HD4000 series?
Right now HD 4000 is the primary target mainly because of the hardware changes. Remember that there is 5x the integer bitshift operation performance per shader array in RV770 compared to RV6xx and transcoding leverages this quite heavily.

willardjuice
22-Aug-2008, 20:12
At least the Cpu emulation

Yeah I wonder if I can natively run it though.

Arnold Beckenbauer
22-Aug-2008, 21:31
...
Right now HD 4000 is the primary target mainly because of the hardware changes. Remember that there is 5x the integer bitshift operation performance per shader array in RV770 compared to RV6xx and transcoding leverages this quite heavily.

Do I understand you right: no AVT for HD3800 users?



US, I converted a small video with Badaboom and then with Nero Record for my iPhone. And the Badaboom video wasn't worth to be transfered to my iPhone.

entity279
22-Aug-2008, 22:35
Yeah I wonder if I can natively run it though.

Well, if you can compile it (for the gpu) i guess you can also run it. But ofcourse, can't be really and absolutely sure about it and the web doesn't provide answers to that AFAIK.

MfA
24-Aug-2008, 15:53
The example CAL apps from the XP64 install don't run under vista.

pcchen
24-Aug-2008, 20:26
The example CAL apps from the XP64 install don't run under vista.

That's too bad. Fortunately, the shop I went does not have Vista x64 available for sale, so I went back to Windows XP for now, with a small price of about 600MB memory. :)

Unknown Soldier
24-Aug-2008, 20:49
US, I converted a small video with Badaboom and then with Nero Record for my iPhone. And the Badaboom video wasn't worth to be transfered to my iPhone.

Really? Weird since I had a look at the Badaboom '300' version I ripped again yesterday and thought it looked very good. Real clear, sound quality good etc. I did rip it bigger than normal, I think normal is 320x480 or some thing, I used the 720 size, the second option.

I can upload it to Rapidshare if you want although it's 1.15Gb? Maybe, I should rip a smaller size version(10min. or so into the movie) and the 320(default setting) version.

US

Davros
25-Aug-2008, 12:34
any news for us gamers ?

wingless
26-Aug-2008, 00:08
any news for us gamers ?

Good question. Will we have Havok physics arriving sooner than later?

ahu
26-Aug-2008, 07:31
Good question. Will we have Havok physics arriving sooner than later?

I remember a hint a couple of weeks ago from an AMD representative at some forum about the AMD Havok implementation. The idea was to intercept Havok API calls to be run on the GPU only when it makes sense. So it would be a mix of CPU & GPU implementation.

But I got the impression that the strategy was just shaping up, so probably the answer to your question is later:wink:

wingless
27-Aug-2008, 23:42
I remember a hint a couple of weeks ago from an AMD representative at some forum about the AMD Havok implementation. The idea was to intercept Havok API calls to be run on the GPU only when it makes sense. So it would be a mix of CPU & GPU implementation.

But I got the impression that the strategy was just shaping up, so probably the answer to your question is later:wink:

Sounds good. We can still have performance with one GPU or keep near-Crossfire performance with 2+ GPUs. My friend and I loaded up the PhysX demos on his 790i+ 2x8800GT system and it ran like a dog. We made sure GPU Physics was selected in the PhysX driver too. I wasn't impressed by the 8800GT. I'd gamble that a 4850 or my 4870 would fare better.

rpg.314
30-Aug-2008, 11:51
The new version of their sdk is coming (http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=97868&enterthread=y)in 2 weeks

rpg.314
30-Aug-2008, 11:58
There's a hint here (http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=98904&enterthread=y)from one of AMD guys that CUDA like shared memory is there on r7xx chips which will be exposed in this (upcoming) version of SDK. I guess the implementation is going to look a lot like CUDA too.

Odin
30-Aug-2008, 19:30
There's a hint here (http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=98904&enterthread=y)from one of AMD guys that CUDA like shared memory is there on r7xx chips which will be exposed in this (upcoming) version of SDK. I guess the implementation is going to look a lot like CUDA too.
No, he's talking about being able to write in CAL (AMD's intermediate assembly format, which then compiles to RV770 bytecode) and use shared memory - a far cry from CUDA.

Tchock
31-Aug-2008, 15:43
That would definitely signify a big change for RV770's current performance in F@H if implemented properly. :grin:

MfA
31-Aug-2008, 22:33
a far cry from CUDA.
A fairly simple compilation step away from CUDA ... CUDA isn't exactly high level.

Unknown Soldier
01-Sep-2008, 09:33
US, I converted a small video with Badaboom and then with Nero Record for my iPhone. And the Badaboom video wasn't worth to be transfered to my iPhone.

Arnold, this is a small clip I converted using Badaboom.

http://rapidshare.com/files/141738589/300_-_HD-iPhone.mp4

Complete name : 300 - HD-iPhone.mp4
Format : MPEG-4
Format/Info : ISO 14496-1 Base Media
Format/Family : MPEG-4
File size : 67.6 MiB
PlayTime : 6mn 56s
Bit rate : 1361 Kbps
StreamSize : 316 KiB
Encoded date : UTC 2008-08-27 04:31:14
Tagged date : UTC 2008-08-27 04:31:14

Video #0
Codec : AVC
Codec/Family : AVC
Codec/Info : H.264 (3GPP)
PlayTime : 6mn 57s
Bit rate : 1292 Kbps
Width : 320 pixels
Height : 180 pixels
Display Aspect ratio : 16/9
Frame rate : 24.000 fps
Minimum frame rate : 12.500 fps
Maximum frame rate : 25.000 fps
Bits/(Pixel*Frame) : 0.935
StreamSize : 64.2 MiB
Encoded date : UTC 2008-08-27 04:31:14
Tagged date : UTC 2008-08-27 04:31:14

Audio #0
Codec : AAC LC
Codec/Family : AAC
Codec/Info : AAC Low Complexity
PlayTime : 6mn 56s
Bit rate mode : VBR
Bit rate : 62 Kbps
Channel(s) : 2 channels
Channel positions : L R
Sampling rate : 48 KHz
Resolution : 16 bits
StreamSize : 3.06 MiB
Encoded date : UTC 2008-08-27 04:31:14
Tagged date : UTC 2008-08-27 04:31:14

See, quality isn't bad imo. Remember, this is made to view on an iPhone and like I said before, it looks pretty good.

The original mp4 movie I made was at a higher resolution.

Note, this 7min. clip took me about 15 secs to make at over 200FPS according to Badaboom.

US

pcchen
01-Sep-2008, 12:45
1.3Mbps for a 320x180 video clip is a bit on the high side. A better comparison would be at around 500Kbps. Personally I use about 1Mbps for 640x360 video for iPod 3G (encoded with x264).

MfA
01-Sep-2008, 15:39
That's too bad. Fortunately, the shop I went does not have Vista x64 available for sale, so I went back to Windows XP for now, with a small price of about 600MB memory. :)
Dunno if you read the AMD forum, but they say HD4000 and Vista support is in beta test now so it shouldn't take too long.

pcchen
01-Sep-2008, 22:42
Dunno if you read the AMD forum, but they say HD4000 and Vista support is in beta test now so it shouldn't take too long.

That'd be great. I'm also looking forward to the shared memory support in new SDKs. Shared memory can save a lot of bandwidth in some situations, and current solution in Brook+ (based on multiple writes) is not exactly satisfying.

ahu
03-Sep-2008, 12:35
Could shared memory present a significant boost in GPGPU applications for ATI?

I'm just wondering what would be the major hindrance for ATI not getting the major GFLOPS advantage in applications like Folding@home. It's not VLIW utilization according to Mike Houston.

Another puzzling example is the recent SiSoft Sandra 2009 GPGPU benchmark, where Nvidia 9600 GT beats ATI 4870 in single precision:
http://www.sisoftware.co.uk/index.html?dir=qa&location=cpu_vs_gpu_proc&langx=en&a=

willardjuice
03-Sep-2008, 16:29
Dunno if you read the AMD forum, but they say HD4000 and Vista support is in beta test now so it shouldn't take too long.

That's fantastic news. :grin:

Rufus
03-Sep-2008, 17:30
Could shared memory present a significant boost in GPGPU applications for ATI?

I'm just wondering what would be the major hindrance for ATI not getting the major GFLOPS advantage in applications like Folding@home. It's not VLIW utilization according to Mike Houston.

Another puzzling example is the recent SiSoft Sandra 2009 GPGPU benchmark, where Nvidia 9600 GT beats ATI 4870 in single precision:
http://www.sisoftware.co.uk/index.html?dir=qa&location=cpu_vs_gpu_proc&langx=en&a=
They don't specify what SiSoft's benchmark is written in, but I have a feeling that if it runs on both NV and ATI it's written in GL or DX, in which case shared memory won't be exposed for either card.

For F@H adding shared memory should help a lot, assuming someone ports/rewrites the code to take advantage of it. It might be as easy as adding a few shared bits to the current ATI folding code, or it might mean completely rewriting a new algorithm to take advantage of shared mem (in which case you get 2 ATI code paths + 1 NV code path in F@H).

Arnold Beckenbauer
03-Sep-2008, 19:08
They don't specify what SiSoft's benchmark is written in, but I have a feeling that if it runs on both NV and ATI it's written in GL or DX, in which case shared memory won't be exposed for either card.

For F@H adding shared memory should help a lot, assuming someone ports/rewrites the code to take advantage of it. It might be as easy as adding a few shared bits to the current ATI folding code, or it might mean completely rewriting a new algorithm to take advantage of shared mem (in which case you get 2 ATI code paths + 1 NV code path in F@H).
Here is their press release:
http://www.sisoftware.co.uk/index.html?dir=news&location=gpgpu_release&langx=en&a=

Key features


4 architectures natively supported (x86, x64/AMD64/EM64T, IA64/Itanium2, ARM)
6 languages supported (English, French, German, Italian, Japanese, Russian)
AMD CTM (STREAM) GPGPU engine 1.1 and later1
nVidia CUDA GPGPU engine 2.0 and later1
Multi-GPGPUs supported, up to 8 in parallel.

The strange thing is: They say, it's released, but it's not.
http://www.sisoftware.co.uk/index.html?dir=news&location=2009_release&langx=en&a=


US:
My video:

Complete name : C:\Dokumente und Einstellungen\Arnoldie\Eigene Dateien\Eigene Videos\VTS_01_1-iPhone.mp4
Format : MPEG-4
Format profile : Base Media
Codec ID : isom
File size : 21.1 MiB
Duration : 3mn 20s
Overal bit rate : 886 Kbps


Video
Format : AVC
Format/Info : Advanced Video Codec
Format profile : Baseline@L3.1
Format settings, CABAC : No
Format settings, ReFrames : 1 frame
Codec ID : avc1
Duration : 3mn 19s
Bit rate mode : Variable
Bit rate : 819 Kbps
Maximum bit rate : 1892 Kbps
Width : 640 pixels
Height : 360 pixels
Display aspect ratio : 16/9
Frame rate mode : Constant
Frame rate : 25.000 fps
Colorimetry : 4:2:0
Scan type : Progressive
Bits/(Pixel*Frame) : 0.142
Stream size : 19.5 MiB

Audio
Format : AAC
Format/Info : Advanced Audio Codec
Format version : Version 4
Format profile : LC
Format settings, SBR : No
Codec ID : 40
Duration : 3mn 20s
Bit rate mode : Variable
Bit rate : 61.5 Kbps

Unknown Soldier
03-Sep-2008, 20:09
Just remember that Badaboom is still a young application(ver. 0.9), nero is up to version 8 or whatever.

And the version I used was a demo. It's expired now, so not sure if I can do more tests. Will check it out.

I think it's got a great future.

How long did it take to encode those 3min.?

US

Arnold Beckenbauer
03-Sep-2008, 20:25
35 fps (Badaboom&8600GT) vs. 28 fps (Nero& X2 3600+).

aaronspink
03-Sep-2008, 23:07
Just remember that Badaboom is still a young application(ver. 0.9), nero is up to version 8 or whatever.

which means nothing. All the algorithms are publicly available for even the best encoders out there.


I think it's got a great future.

How long did it take to encode those 3min.?

US

Its got little to no future as an encoder. It will likely be slower for encoding within a year and still have significantly bad quality.

The only area where their flow makes sense is on the DECODE side where HW accelerated decode removes load from the CPU and offers significant speedups vs CPU decode. I wouldn't be surprised that x264 with hardware decode could beat hardware transcode in performance.

wingless
06-Sep-2008, 16:58
which means nothing. All the algorithms are publicly available for even the best encoders out there.



Its got little to no future as an encoder. It will likely be slower for encoding within a year and still have significantly bad quality.

The only area where their flow makes sense is on the DECODE side where HW accelerated decode removes load from the CPU and offers significant speedups vs CPU decode. I wouldn't be surprised that x264 with hardware decode could beat hardware transcode in performance.

You have to remember. This is just the first round of software coming out for home GPGPU use. Just like the first 3D games, software implementations were poor quality in the beginning but got better over the next half decade. If we have 1 TeraFLOPS+ processors in our rigs why not use them? I see GPGPU taking off after both AMD and Intel come out with CPU+GPU processors.

Also lets keep this post on topic. I'm asking about the state of AMD's GPGPU platform, not Nvidia's.

rpg.314
10-Sep-2008, 15:43
Also lets keep this post on topic. I'm asking about the state of AMD's GPGPU platform, not Nvidia's.

Disaster. :oops:

Just downloaded the sdk to go through the new version of docs. I cannot pretend that they are any better than their predecessors. To summarize,

1) CAL and Brook guide have been merged into one. Big change:evil:. There is hardly anything new in the docs. Just as confusing as their previous version.

2) Mention of support for 4870x2 is conspicuous by it's absence. (though it could be supported, I don't know)

3) In just a casual reading, I found 2 separate contradictory statements. Though I could be wrong on this, but docs are incomplete/confusing in various places. I wont be surprised if a knowledgeable person goes through it and points out several ambiguities/inconsistencies

4) On chip memories, Local and global data share are not mentioned at all, even though they are there. WHY???

5) Looks like either

a) AMD doesn't care

b) Lack of resources

I vote for (b)

I am extremely disappointed with their new sdk. I was particularly interested considering 4870x2 is much faster than gtx280 on raw compute basis. I guess, AMD has really ditched their current platform for DX11 compute shaders/openCL.

I hope not.

3dilettante
10-Sep-2008, 18:08
I am extremely disappointed with their new sdk. I was particularly interested considering 4870x2 is much faster than gtx280 on raw compute basis. I guess, AMD has really ditched their current platform for DX11 compute shaders/openCL.


Don't know if that's true, but the market would rather that they did. CUDA already has the vendor-specific niche thing going, and it's not that widely adopted.
Compute shaders, and to a greater extent OpenCL will provide the market with targets that are much more stable and more widely applicable.

Given AMD's spotty execution and uncertainties of its long-term existence, why would anyone risk much on a primarily AMD-driven initiative?

MfA
10-Sep-2008, 18:46
Hyperbole much?

The local data share is there, just not enough information given to understand how contention will affect performance.

randomhack
10-Sep-2008, 20:19
The documentation is still not good.
The new features seem to be
a) Local data share is exposed through CAL. Some info is present in IL reference but thats it. NO mention of LDS anywhere else in docs. Global data share is not exposed?
b) Vista is now supported.
c) Dx 9/10 interop is now supported. OpenGL is not mentioned?
d) Some sync primitives.

rpg.314
11-Sep-2008, 05:27
Don't know if that's true, but the market would rather that they did. CUDA already has the vendor-specific niche thing going, and it's not that widely adopted.
Compute shaders, and to a greater extent OpenCL will provide the market with targets that are much more stable and more widely applicable.

Given AMD's spotty execution and uncertainties of its long-term existence, why would anyone risk much on a primarily AMD-driven initiative?

Fair point.

randomhack
12-Sep-2008, 06:09
I understand that RV730 is not officiallyu supported by the SDK. But I am wondering if RV730 supports features like double precision, global buffer and the new fangled LDS etc. Comments?

Dave Baumann
12-Sep-2008, 12:42
To the thread - the V1.2 SDK update brings official (albeit Beta, at this stage) Vista support, which is what was mentioned earlier in this thread.

I understand that RV730 is not officiallyu supported by the SDK. But I am wondering if RV730 supports features like double precision, global buffer and the new fangled LDS etc. Comments?
RV730 has all the same compute feature set as RV770 other than no Double Precision support.

randomhack
12-Sep-2008, 22:13
Ahh thanks for the info. As I am playing more with this new version of CAL, I am getting more excited. The "compute shaders" are very interesting. Lots of stuff to play with. LDS and shared registers are especially interesting.

It would be much nicer though if AMD can disclose some information about how thread groups are mapped to hardware and also some info about caches wont hurt either :)

wingless
21-Sep-2008, 15:45
Ahh thanks for the info. As I am playing more with this new version of CAL, I am getting more excited. The "compute shaders" are very interesting. Lots of stuff to play with. LDS and shared registers are especially interesting.

It would be much nicer though if AMD can disclose some information about how thread groups are mapped to hardware and also some info about caches wont hurt either :)

I hope AMD engineers are reading all of your thoughts on this subject. You said you're getting excited about AMD's features, but the documentation is lacking. AMD could have a real opportunity to please the developer community if they get some more info out there. I have faith in the hardware.

On a side note, I wonder if AMD is planning on tossing in some ECC capabilities to allow more efficiency in multi-gpu configurations. David Kanter says that the RV770 isn't as well designed for multi-gpu configurations as GT200 is. This is shameful given the immense processing potential of the 4870X2.

AlexV
21-Sep-2008, 16:08
David Kanter says that the RV770 isn't as well designed for multi-gpu configurations as GT200 is. This is shameful given the immense processing potential of the 4870X2.

Hmm? Is David really saying that? Would you mind pointing me to the exact quote?

wingless
21-Sep-2008, 19:33
Hmm? Is David really saying that? Would you mind pointing me to the exact quote?

http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=5

The GT200 is designed as a monolithic GPU that starts at the extreme high end of the market and will eventually cascade down across all product lines with successive compactions, just like the G80. This is in direct contrast to ATI’s strategy, which is more focused on the volume performance segment. ATI’s RV770 addresses the performance market, but requires two dice in a card for the highest performance.

There are definite trade-offs to each approach. Using a single monolithic die should lead to a performance advantage for graphics, but with lower yields and higher unit costs. Additionally, a very large die area GPU (such as the GT200) cannot span as much of the market as a smaller die and a dual die card. Ultimately for graphics, the question of monolithic integration versus dual-die packaging is pretty ambiguous. There are advantages to each and it really depends on the implementation.

For general purpose computing, the answer is much more clear cut. A single monolithic GPU is much more useful than two smaller GPUs packaged together. In the world of CPUs, multi-processing is a relatively small change in the programming model. CPUs already have coherent caches (coherent with respect to I/O), so making CPUs cache coherent with each other is a small change. In the world of x86, multi-processors have been common since the P6. In contrast, GPUs eschew the overhead of coherency for caches in a single chip. Since there is no coherency even on a single GPU, there is certainly no coherency between multiple GPUs. This means that there is no way to use multiple GPUs for a general purpose application, unless the developer is willing to manually manage the sharing of data and communication; and while CUDA is an excellent programming model, it does little help developers here. Undoubtedly, NVIDIA's interest in GPUs as a computational device was a key motivator for NVIDIA to pursue a monolithic GPU and is a tangible demonstration of the importance of compute oriented GPUs for NVIDIA

Well, his criticism is about multi gpu setups from both Nvidia and AMD. It is harder for the developer to deal with multiple GPUs to begin with. I misunderstood the first time I read this so I apologize to Mr. Kanter and everybody else. I really would like to see AMD be the first to solve this cache coherency issue between multiple GPUs. It seems that would give them a leg up on CUDA and also make it A LOT easier to program for this multi-gpu configs.

Arun
23-Sep-2008, 01:33
I really would like to see AMD be the first to solve this cache coherency issue between multiple GPUs. It seems that would give them a leg up on CUDA and also make it A LOT easier to program for this multi-gpu configs.I think pixie dust could do the job ;) More seriously, I doubt there can even theoretically be a very attractive answer to that without CMOS photonics.

randomhack
23-Sep-2008, 05:10
I think pixie dust could do the job ;) More seriously, I doubt there can even theoretically be a very attractive answer to that without CMOS photonics.

Pardon my noobness .. but why is it that hard to do?
Cache coherency on SMP systems for CPUs has been done for many years now.
Cache coherency b/w various cores on GPU is being done by Larrabee.
So why is there a huge jump from these 2 to cache coherency b/w multi-GPUs on the same PCB?

pcchen
23-Sep-2008, 11:21
Because the interconnection requirement is quite different. Normally a CPU has only a fraction of bandwidth of a GPU. The interconnection between two (or more) CPU does not really require much more bandwidth than their main memory bandwidth. Actually, latency is probably more important for them.

However, in the case of GPU, you will want a lot of bandwidth between two "cache coherent" GPU, roughly the same as their memory bandwidth. That means you are probably looking at a about 50GB/s bandwidth requirement for an interconnection between two GPU. This is probably doable if these two GPU are on the same board, but it's still expensive. It will get much more expensive if you want an interconnection like this to work across multiple boards.

CarstenS
23-Sep-2008, 12:10
Plus, to fit in AMDs strategy, it would have to reside in every (performance) ASIC, which costs die space even on those dies which are not used in mGPU-scenarios.

3dilettante
23-Sep-2008, 16:11
It's not clear that even Intel will push inter-chip cache coherency strongly when it comes to graphics products.
I haven't seen any inter-chip bandwidth numbers, and it may be that Larrabee won't even try multichip early on.

The traffic problem might in the end vindicate Nvidia, when coherent caches are implemented.

MfA
23-Sep-2008, 16:17
Snooping cache coherency doesn't scale well, but directory based cache coherency does. IMO this just obfuscates the underlying architecture and promotes poor programming though, just use message passing.

3dilettante
23-Sep-2008, 16:33
The problem as described so far indicates that there is a genuine need for very high bandwidth between all the chips.
In what way can a directory minimize traffic that isn't related to coherence?

wingless
25-Sep-2008, 03:49
Would the Sideport interconnect on the HD 4870X2 qualify as this high bandwidth interconnect that yall are talking about? Of course this is only between two chipsets on one board, but its 2x better than a single RV770. They gotta start somewhere...

Cache coherency seems to really be a big issue with these GPUs. I can't wait to see what creative and innovative solution these companies come up with to make multi GPU GPGPU farms work well together. MfA mentioned something about message passing to get around this issue in software ( I think that is what he meant, please elaborate, MfA). Do you all know of ways to get around this problem purely with creative coding?

Lux_
25-Sep-2008, 09:03
Would the Sideport interconnect on the HD 4870X2 qualify as this high bandwidth interconnect that yall are talking about? As Eric Demers said in recent Rage3D interview, Sideport is write-only. The interconnect offers the same features as the PCIe interconnect, plus it allows for a GPU to broadcast memory writes to its own memory and to the other GPU's. The data exchangeable is any data that the driver would desire.

[...]In our case with sideport, it's a write only memory, which much less bandwidth than local GDDR memory. Consequently, some things can be accelerated (like generating the same render target on both, using the broadcast write), but, in general, you need to have copies of everything needed in each memory -- You need to be able to read it, and have enough BW reading it to not slow you down too much.

But to "truly" share memory, each ASIC would need full read and write access to the "other" memory, as well as having a significant amount of the bandwidth made available (say equal amounts remote as local). If they only had read/write, but much less bw, then it would be a numa type system, and it's not clear how useful that would be, performance-wise.

MfA
29-Sep-2008, 15:05
Cache coherency seems to really be a big issue with these GPUs. I can't wait to see what creative and innovative solution these companies come up with to make multi GPU GPGPU farms work well together. MfA mentioned something about message passing to get around this issue in software ( I think that is what he meant, please elaborate, MfA). Do you all know of ways to get around this problem purely with creative coding?
This is not really for GPU processing, but for more generalized massively parallel processing. With message passing data being carted around has an explicit destination, there is no need for cache coherency because there is no duplicate storage of the same address.

When you rely on cache coherency for communication, data has an implicit destination. With snooping the receiver picks data for addresses it has cached from broadcast traffic (obviously doesn't scale well) and with directory based caching you have special controllers which remember which processors have which data cached so you can do it without broadcasting (can scale well, with a lot of caveats).

For the moment though talking about cache coherency for GPGPU programming is missing the point somewhat.

wingless
30-Sep-2008, 03:34
For the moment though talking about cache coherency for GPGPU programming is missing the point somewhat.

Gotcha. Back to the original topic of my post. All in all it seems that AMD's GPGPU software isn't as mature as CUDA which is hindering it's present-day market adoption. CUDA is getting into anything and everything now and it pains me to see AMD lagging behind when we know the hardware is up to the task.

rpg.314
03-Oct-2008, 09:25
The only competitive thing with AMD right now (features wise) is CAL, which is assembly. Hardly developer friendly in 2008.

Dave Baumann
03-Oct-2008, 12:55
The only competitive thing with AMD right now (features wise) is CAL, which is assembly. Hardly developer friendly in 2008.
Thats hardly accurate given that we do have Brook+ for high level access. ISV's are taking this and giving good results.

rpg.314
05-Oct-2008, 11:05
Umm.....

I have gone through the documentation many times, but have found no mention whatsoever for data sharing/syncing in Brook+. LDS and GDS appear to be exclusive to CAL atm. I could be wrong but it seems that Brook+ doesn't expose LDS and GDS at all.

Constant and texture caches are exposed in CUDA but they are opaque to Brook+ programmer. We don't get to choose which data merits what kind of caching.

And Direct3D/OpenGL interoperability with Brook+ is definitely missing.

Jawed
05-Oct-2008, 15:55
[Guessing this is a leak from the other thread, where I have recently posted on the subject of LDS/GDS]
Umm.....

I have gone through the documentation many times, but have found no mention whatsoever for data sharing/syncing in Brook+. LDS and GDS appear to be exclusive to CAL atm. I could be wrong but it seems that Brook+ doesn't expose LDS and GDS at all.


The way I interpret this:
the programmer needs to create an explicit synchronisation point. So if a kernel has two phases, separated by a synchronisation point then in Brook+ the programmer needs to make two kernels with the output of kernel 1 feeding into kernel 2 as input
The data that the programmer wants to share amongst "threads" needs to be in the output of kernel 1, e.g. as an auxiliary stream or packed into the vec4-format. Kernel 2 can use the "index" feature of stream addressing to read from one or more foreign threads as well as its own intrinsically indexed stream data elementsSo Brook+ doesn't provide an explicit concept of sharing. The whole problem of kernel chaining, bifurcation, serialisation and the rest is all stuff that I, as non-Brook+ programmer, can't say much about. I hope I'm not leading you astray (I just browse these topics to get an overview)...

Constant and texture caches are exposed in CUDA but they are opaque to Brook+ programmer. We don't get to choose which data merits what kind of caching.
No. Brook+ is very much "un-optimised" right now as far as I can tell. Correctness appears to be a much higher priority than performance. The stream model is also sufficiently abstracted from data layout in memory that it's prolly very hard to expose memory programming to the Brook+ programmer in a meaningful way. e.g. how are streams interleaved in memory?

One could say that CUDA provides a "close to the metal" memory-hierarchy programming model. I think some would argue that it's too close.

I think D3D11-CS and OpenCL are going to provide some interesting alternatives to this question. I expect there'll be a lot of arguments over memory programming for GPGPU during the next few years.

Jawed

rpg.314
05-Oct-2008, 16:26
Jawed, while your post in other thread was very helpful, my previous post here is not some kind of "info transfer". What you are suggesting is ping-ponging, which every body did (prior to cuda) via d3d/ogl shaders. Here you are sharing and/or syncing via global memory which is very slow(latency wise). I am refering to sharing/syncing via the on chip cache (shared memory on nv gpu's, LDS/GDS on ati gpu's)

I think CUDA is ok (for now atleast) as far as how "close to metal" the programming model should be.

I looked at existing docs for D3D11 CS and OpenCL shders (whatever ppt's were avl) and they look to be exactly CUDA with a different terminology and different API on cpu-side to access the gpu. But yes, they will evolve over time and we will see interesting debates on how to take it forward.

Jawed
05-Oct-2008, 19:39
Jawed, while your post in other thread was very helpful, my previous post here is not some kind of "info transfer". What you are suggesting is ping-ponging, which every body did (prior to cuda) via d3d/ogl shaders. Here you are sharing and/or syncing via global memory which is very slow(latency wise). I am refering to sharing/syncing via the on chip cache (shared memory on nv gpu's, LDS/GDS on ati gpu's)
I'm suggesting a stream of sub-blocks, if the kernel (two kernels separated by a synch) as a whole is operating on a block. The sub-blocks would need an overlap radius. I presume that giving each sub-block a separate CAL context you can create this stream of sub-blocks and therefore hide the latency of memory operations and synchronisations.

Honestly, I don't know if Brook+ already supports this technique. I'm envisaging that LDS/GDS provides IL with a smoother, more finely-grained, way of using memory and it's just a matter of time for Brook+ to take advantage of this. Older GPUs have the "read/write memory" whose function is rather vague - I interpret it as a small predecessor of LDS/GDS. Who knows.

I think CUDA is ok (for now atleast) as far as how "close to metal" the programming model should be.
This is interesting because it's a critique of kernel and stream unmanageability in CUDA:

http://www.kunzhou.net/2008/BSGP.pdf

I looked at existing docs for D3D11 CS and OpenCL shders (whatever ppt's were avl) and they look to be exactly CUDA with a different terminology and different API on cpu-side to access the gpu. But yes, they will evolve over time and we will see interesting debates on how to take it forward.
I'm pretty dubious about the "exactly", but we'll just have to wait and see.

Jawed

Dave Baumann
17-Nov-2009, 23:46
Another puzzling example is the recent SiSoft Sandra 2009 GPGPU benchmark, where Nvidia 9600 GT beats ATI 4870 in single precision:
http://www.sisoftware.co.uk/index.html?dir=qa&location=cpu_vs_gpu_proc&langx=en&a=

SiSoft have updated their benmark and now include an OpenCL path.

http://www.sisoftware.net/index.html?dir=qa&location=gpu_opencl&langx=en&a=

Arnold Beckenbauer
17-Nov-2009, 23:57
SiSoft have updated their benmark and now include an OpenCL path.

http://www.sisoftware.net/index.html?dir=qa&location=gpu_opencl&langx=en&a=

Mr. Baumann (:)), does Cat. 9.11 support OpenCL?

fellix
18-Nov-2009, 00:49
Works for me with Cat 9.11 WHQL:

http://img109.imageshack.us/img109/4898/sandraz.jpg

Arnold Beckenbauer
18-Nov-2009, 10:29
It used Stream, not OpenCL.
http://www.sisoftware.net/images/gpgpu_aa_opencl_9600.png

CarstenS
21-Nov-2009, 11:30
It seems like it doesn't use Open CL - at least referring to sisoftware's website's comments, which later refer to a OpenCL Beta4 driver:
"
ATI Radeon HD 4850
800 / 625MHz / 512MB
359.7 / 177.7 MPixel/s (CAL)
508.7 / 25.52 MPixel/s
--> OpenCL allows us to achieve even faster performance than CAL, 50% better which is just incredible!

ATI Radeon HD 5870
1600 / GHz / 1GB
912 / 459.478 MPixel/s (CAL)
1588 / 69.5092 MPixel/s
--> We see again ~50% gains in OpenCL versus CAL, the compiler doing better than us in optimising the code. Fantastic result!


But strangely, the website also mentions that there's only emulated results for doubles in OpenCL. Which results in very mediocre, NV-like performance.

AlexV
21-Nov-2009, 12:34
But strangely, the website also mentions that there's only emulated results for doubles in OpenCL. Which results in very mediocre, NV-like performance.

ATI's OCL stack doesn't expose DP yet IIRC.

Dave Baumann
21-Nov-2009, 16:43
SiSoft also make note of that in the table below their results. Double support in OpenCL not part of the core and as an extension that isn't supported yet.

If you run the off the shelf drivers at the moment, it will default to their Stream implemtation rather than OpenCL as the CAL interface for OpenCL isn't in there at the moment, plus some of that other OpenCL requirements; if you download the Stram SDK and use the drivers that are suplied there then it will use CAL. We'll be integrating the correct version of CAL in upcoming drivers but there may be an additional install required to install the necessary OpenCL libraries to enable OpenCL apps to run without requiring the full SDK on everyone's PC's.

CarstenS
21-Nov-2009, 17:26
Did you already decide in which of the many upcoming driver releases you'll integrate Open CL? I mean, all your commitment to open standards should be backed up by making it usable as soon as possible for a wide audience.

Arnold Beckenbauer
21-Nov-2009, 17:27
The Stream SDK 2.0 beta 4 is installed on my PC, the OpenCL driver, too (9.11 supports OpenCL too).

But Sandra doesn't want to use OpenCL, or is it because it's Sandra Light?
OpenCL samples, pcchen's OpenCL apps works too etc.

Dave Baumann
21-Nov-2009, 18:36
Did you already decide in which of the many upcoming driver releases you'll integrate Open CL?
Yes.

trinibwoy
21-Nov-2009, 19:37
If you run the off the shelf drivers at the moment, it will default to their Stream implemtation rather than OpenCL as the CAL interface.

So is their terminology here wrong - "We see again ~50% gains in OpenCL versus CAL, the compiler doing better than us in optimising the code. Fantastic result!"

It's supposed to be Stream vs OpenCL right? With both interfacing with CAL as the backend?

Also, why don't I have the option to run the bench through OpenCL? Running the latest FW 195.55 beta. This is really confusing, the press release refers to Sandra 2010 with OpenCL support but all download links are for the 2009 version with the old Stream/CUDA benchmarks.

Sigh....

http://support.sisoftware.co.uk/knowledgebase.php?article=2

You need Sandra 2009 SP5 or 2010 for Catalyst 9.11 (December 2009) or later for STREAM 2.0 and OpenCL 1.0.
You need Sandra 2009 SP5 or 2010 for CUDA 2.3 and OpenCL 1.0* thus ForceWare 190.89 conformant release.

Now that says 2009 SP5 is required but I can't find SP5 anywhere (or 2010 for that matter).

Arnold Beckenbauer
27-Nov-2009, 13:10
Sandra 2010 Lite is out: http://www.sisoftware.net/index.html?dir=dload&location=sware_dl_3264&langx=en&a=