AMD: R9xx Speculation

Btw., I mentioned that to you before and I will reiterate it, but also nvidia GPUs often gain from explicit vectorization as it reduces the granularity of memory accesses and increases the burst lengths. It is simply more cache friendly and with a lot of algorithms being bandwidth limited, it can be astonishingly efficient for some problems in view of the "scalar" nature of nvidia GPUs.

Do you see these efficiency improvements on Fermi as well? Do you see same (or similar) improvements on Fermi as on older nv GPUs (ie, those without caches)?

I don't have fermi hw so I am asking.
 
Do you see these efficiency improvements on Fermi as well? Do you see same (or similar) improvements on Fermi as on older nv GPUs (ie, those without caches)?

I don't have fermi hw so I am asking.
I neither own a GF100, but for something memory intensive one should look into using textures which are cached on every GPU. Somehow related, in my opinion CUDA people often use local memory for no reason (i.e. just as a software managed cache). Just using textures and exploiting the texture cache is easier and often even faster (if one can allow for the additional latency). But maybe that was just bad luck with the stuff I've got to see.
 
Last edited by a moderator:
My experience with vectorization on Nvidia GPUs has not been positive. The extra register pressure caused by vectorizing code often causes large occupancy losses and ends up significantly harming performance. That's one reason AMD requires larger register files than Nvidia.
Did this also apply to the GT200 line where nv increased the register files? But I agree that on G80/G92 register pressure is/was often a limitation.

Nevertheless, GF100 has about half the registers of Cypress, but also only half the peak performance. So something that performs close to peak on a Cypress (vectorized) shouldn't be that much more demanding (relatively speaking) on a GF100.
 
I neither own a GF100, but for something memory intensive one should look into using textures which are cached on every GPU. Somehow related, in my opinion CUDA people often use local memory for no reason (i.e. just as a software managed cache). Just using textures and exploiting the texture cache is easier and often even faster (if one can allow for the additional latency). But maybe that was just bad luck with the stuff I've got to see.

My gut feeling is that vectorize-to-get-higher-bw is limited to gpus without a real cache hierarchy.

IIRC, AMD's optimization guide (the one with sdk 2.2) says coalesced loads of float's or float4's (evergreen family presumably) has a very small difference on bw.
 
Did this also apply to the GT200 line where nv increased the register files? But I agree that on G80/G92 register pressure is/was often a limitation.

Nevertheless, GF100 has about half the registers of Cypress, but also only half the peak performance. So something that performs close to peak on a Cypress (vectorized) shouldn't be that much more demanding (relatively speaking) on a GF100.

This applies on all Nvidia architectures.
To fill a G80 (GT100) all the way, each thread must use 10 registers or fewer
To fill a GT200 all the way, each thread must use 16 registers or fewer
To fill a GF100 all the way, each thread must use 21 registers or fewer.

Filling the chip all the way is not necessarily the right thing to do. However, register pressure is a key performance barrier, and it's very common to optimize code in an attempt to get below some register pressure limit, since performance is very discontinuous. Scalar code (not-vectorized) I wriften uses 15 or so 32-bit registers. Vectorizing with float4 data requires 4 32-bit registers for every vectorized piece of data. That can easily lift my code past the register limits and severely impact performance.

So, vectorization can be good for the reasons you suggested, but if you apply it blindly, you will have severe performance problems on Nvidia hardware. I've seen vectorized code perform 2-3x slower than non-vectorized code.
 
So you see that a GTX 470 is nearly 57% more efficient with 16:1 AF compared to a HD 5870.

That's not saying much when your starting point is a 100% theoretical advantage for the 5870. So even with lower efficiency its absolute AF performance is still considerably higher according to those numbers. 0.381*850*80 >> 0.597*607*56 !

Cypress' instantaneous shader throughput is much higher, yet does not result in an overall performance benefit so either GF100 has a similar advantage in instantaneous performance on other workloads (no analysis available) or the shader portion of a typical gaming workload is so small as to limit the potential benefit and Amdahl's law is smacking you in the face. My question is why does AMD bother and I guess the most popular answer is that the shaders are cheap so why not.

And a smaller chip.

My curiosity is limited to utilization and measured performance given the available horsepower. Die size doesn't affect that either way.

In case you haven't noticed, finding a decent architectural investigation of cards these days is damn near impossible. I'm certainly not suggesting it's easy - it's way more complex now than 5 years ago, e.g. as resource types have exploded and rendering techniques are more variegated.

Yep, I was looking forward to AlexV's article. Guess he missed his own deadline, start of this week wasn't it? :)
 
2 extra layers, for what? If they're both 1GB 256-bit boards, why are there two extra layers?
For better thermal characteristics? Or more complex power circuits? Both 5750 and 5770 are 128bit, but 5750 has 6 layers, 5770 has 8 (acc. to vr-zone). Custom OC models of HD5870 has also 2 additional layers.
 
Cypress' instantaneous shader throughput is much higher, yet does not result in an overall performance benefit so either GF100 has a similar advantage in instantaneous performance on other workloads (no analysis available) or the shader portion of a typical gaming workload is so small as to limit the potential benefit and Amdahl's law is smacking you in the face. My question is why does AMD bother and I guess the most popular answer is that the shaders are cheap so why not.
Exactly. AMD added 2 cores to RV770 because the die would have had a huge blank spot otherwise.

It's worth noting if you add up all the blank areas of GT200 on 65nm, it's quite a surprise. 6.5% of the die I reckon.
 
For better thermal characteristics? Or more complex power circuits? Both 5750 and 5770 are 128bit, but 5750 has 6 layers, 5770 has 8 (acc. to vr-zone). Custom OC models of HD5870 has also 2 additional layers.
Is it power or is it memory clocks?

It occurred to me that GDDR5 at the craziest speeds might just be easier with more layers.

How many layers is HD5870 reference?
 
OK, so reference HD5770 and HD5870 both have 8 layers. We can say two things:
  1. Barts having 8 layers doesn't indicate it has a 256-bit bus, since HD5770 has 8 layers too
  2. layers don't seem to correlate with the power difference between HD5770 and HD5870 - 80W TDP difference for reference boards
Regardless, if Barts does have a 256-bit bus, it really should be substantially more than 30% faster than HD5770.
 
It seems right, that layers reflect clocks.

HD5770 + 5870 = 850 / 4800

Bart with 8 layers will be likely clocked at this level.

10 layers of Cayman indicates higher clocks (at least for GDDR5)...
 
I think only MSI's 5870 Lightning has a 10 layer PCB (just like their other lightning cards), the other non-reference card, Gigabyte's 5870 SOC, has a modified 8 layer PCB.
 
It seems drivers are already present for R9xx in the newest Mac OS X 10.6.5 beta.

ATIRadeonX3000.kext (and other relevant kexts) has the following devices listed.
224,CAYMAN GL XT (6701),NI CAYMAN
225,CAYMAN GL XT (6702),NI CAYMAN
226,CAYMAN GL XT (6703),NI CAYMAN
227,CAYMAN GL PRO (6704),NI CAYMAN
228,CAYMAN GL PRO (6705),NI CAYMAN
229,CAYMAN GL (6706),NI CAYMAN
230,CAYMAN GL LE (6707),NI CAYMAN
231,CAYMAN GL (6708),NI CAYMAN
232,CAYMAN GL (6709),NI CAYMAN
233,CAYMAN XT (6718),NI CAYMAN
234,CAYMAN PRO (6719),NI CAYMAN
235,ANTILLES PRO (671C),NI CAYMAN
236,ANTILLES XT (671D),NI CAYMAN
237,BLACKCOMB XT/PRO (6720),NI BLACKCOMB
238,BLACKCOMB LP (6721),NI BLACKCOMB
239,BLACKCOMB XT/PRO Gemini (6724),NI BLACKCOMB
240,BLACKCOMB LP Gemini (6725),NI BLACKCOMB
241,BARTS GL XT (6728),NI BARTS
242,BARTS GL PRO (6729),NI BARTS
243,BARTS XT (6738),NI BARTS
244,BARTS PRO (6739),NI BARTS
245,WHISTLER XT (6740),NI WHISTLER
246,WHISTLER PRO/LP (6741),NI WHISTLER
247,WHISTLER XT/PRO Gemini (6744),NI WHISTLER
248,WHISTLER LP Gemini (6745),NI WHISTLER
249,ONEGA (6750),NI TURKS
250,TURKS XT (6758),NI TURKS
251,TURKS PRO (6759),NI TURKS
252,SEYMOUR XT/PRO (6760),NI SEYMOUR
253,SEYMOUR LP (6761),NI SEYMOUR
254,SEYMOUR XT/PRO Gemini (6764),NI SEYMOUR
255,SEYMOUR LP Gemini (6765),NI SEYMOUR
256,CAICOS GL PRO (6768),NI CAICOS
257,CASPIAN PRO (6770),NI CAICOS
258,CAICOS PRO (6779),NI CAICOS

ATIRadeonX3000AMDCaymanHardware
ATIRadeonX3000AMDBartsHardware
ATIRadeonX3000AMDTurksHardware
ATIRadeonX3000AMDCaicosHardware

Radeon Northern Islands Unknown Prototype
Radeon Cayman Unknown Prototype
Radeon Cayman GL PRO Prototype
Radeon Cayman GL XT Prototype
Radeon Caicos Unknown Prototype
Radeon Seymour LP Prototype
Radeon Seymour PRO/XT Prototype
Radeon Caicos PRO Prototype
Radeon Turks Unknown Prototype
Radeon Whistler PRO/LP Prototype
Radeon Whistler XT Prototype
Radeon Turks PRO Prototype
Radeon Turks XT Prototype
Radeon Barts Unknown Prototype
Radeon Blackcomb LP Prototype
Radeon Blackcomb XT/PRO Prototype
Radeon Barts PRO Prototype
Radeon Barts XT Prototype
Radeon Park LP Prototype

Could it really be that Apple is about to get updates that line up with their PC bretheren?

Interesting to say the least.
 
Last edited by a moderator:
Partial truth , or Complete speculation ?

What about this ?

Partial truth , or Complete speculation ?


61647355.png


190931123104.jpg



http://www.3dcenter.org/news/2010-08-26
 
Makes sense to me. GTX 485 would be the fully unlocked Fermi. GTX 490 would be single card SLI GTX 460.

Everything on Ati side seems to rhyme with those code names previously found.

Not sure why some cards have more accurate release schedule (Nov) than others (Q4)

I'm pretty sure Q4 for 69x0 will be later than the Q4 for 67x0 :)
 
Back
Top