AMD: R9xx Speculation

Uhmm, ever had divergence between vector lanes in a "true" SIMD unit?

Lane masks in LRB and bitwise masks in SSE. Similar mechanisms implemented in hw on ati and nv.

VLIW can do everything (SIMD) vector units could do and then some more

I'd like to see a moderately complex problem "vectorized" over ATI's vliw slots before I'd believe it. Divergence and all the warts included.
 
No, actually quite common and a recommended technique (brings often even some benefits for nvidia GPUs as you most of the time reduce the granularity of memory accesses). The hindrance is often only the effort the developer has to put in, but this is a minor inconvinience in my book if you get twice the performance.

And it twists the programming model completely out of shape. :rolleyes:

Now there are multiple cores, vector width (64) and another vector width(4) to tackle. All for at best 4x perf gain. For one out of 3 vendors. Not worth the trouble in my book.

The following is from nv's marketing slides but IS absolutely true.

-- 2-3x is not worth spending much effort. Wait a little or spend some money.

-- 10-15x gains are worth doing big rewrites.
 
I think it's time to blow Rohit's mind:

Code:
             46  x: LDS_ADD     ____,  PV45.z,  (0x00000001, 1.401298464e-45f).x      
                 y: ADD_INT     T1.y,  (0x0000000C, 1.681558157e-44f).y,  T0.w      
                 z: MULADD_e    T2.z,  PV45.x, -0.5,  0.5      
                 w: MULADD_e    T1.w,  PV45.w,  0.5,  0.5      
                 t: MUL_e       ____,  R3.w,  KC0[6].w

What's your point? It is doing independent ops over 5 lanes. I know It can do that.

My doubts are over vectorizing code over VLIW lanes while managing branches and divergence efficiently.
 
Something is wrong in that equation. If, as your analysis shows, AMD's cards are significantly faster when processing shaders then where does Nvidia catch up? With GT200 you could sorta pin it on the texturing but now Cypress has an advantage there too.

maybe this helps you to understand the "equation" better and what seems to be missing to understand the situation:

Die Texelfüllraten mit 16:1 Tri-AF in Relation zur theoretischen Peak-Texelfüllrate:
HD 5870: 0,381
HD 5850: 0,397
HD 5830: 0,402
HD 5770: 0,491
HD 4870: 0,510
GTX 470: 0,597
GTX 465: 0,595
GTX 460: 0,511 (1024 MiB) bzw. 0,497 (768 MiB)
GTX 260: 0,509

source: http://www.forum-3dcenter.org/vbulletin/showthread.php?t=487189&page=2

The table shows the efficiency of the different GPUs with 16:1 AF compared to the theoretical fillrate. So you see that a GTX 470 is nearly 57% more efficient with 16:1 AF compared to a HD 5870. According to Gipsel this is related to the difference in bandwidth feeding the TMUs in the different architectures (as far as I understand it!).
 
Radeon Mobility series:

Vancouver
Granville (N.I.)
Capilano (N.I.)
Robson CE (N.I.)

Ski Resorts:
Blackcomb (S.I.)
Whistler (S.I.)
Seymour (S.I.)
Robson LE (S.I.)

Lexington was Cypress Mobility, got cancelled (because?)

Desktop parts tape-out was some months ago.
 
Last edited by a moderator:
Cypress has for more fillrate than Fermi (theoretical and measured). Are you suggesting that Fermi's Z-rate makes up for a deficiency in flops, texturing and fillrate?
For backwards looking comparisons (e.g. RV770 v GT200) non-ALU factors are significant.

I'm not trying to be difficult but I just don't see how you can have lots of flops + high utilization + higher texturing rate = equal performance.
And a smaller chip.

Something is wrong in that equation. If, as your analysis shows, AMD's cards are significantly faster when processing shaders then where does Nvidia catch up? With GT200 you could sorta pin it on the texturing but now Cypress has an advantage there too.
GT200 had a huge fillrate advantage too.

As for Fermi I've already mentioned other factors, i.e. rasterisation efficiency and memory efficiency. And Fermi has higher Z rate.

In case you haven't noticed, finding a decent architectural investigation of cards these days is damn near impossible. I'm certainly not suggesting it's easy - it's way more complex now than 5 years ago, e.g. as resource types have exploded and rendering techniques are more variegated.

When ATI gains decent tessellation performance (i.e. die size cost could be substantial) we'll have a better comparison of overall performance. Evergreen is clearly not the design AMD intended to build for D3D11 - I'm not suggesting the next chip will be better in that respect, either.

GF104 really should be capable of matching HD5870 in games, I don't understand why NVidia hasn't even tried. Is the cost of testing to find the chips that will do that really so high?
 
The table shows the efficiency of the different GPUs with 16:1 AF compared to the theoretical fillrate. So you see that a GTX 470 is nearly 57% more efficient with 16:1 AF compared to a HD 5870. According to Gipsel this is related to the difference in bandwidth feeding the TMUs in the different architectures (as far as I understand it!).
Ooh, that looks like L2->L1 bandwidth and L2 size are key factors. The 20 cores of Cypress are each getting a meagre share of L2 bandwidth.
 
http://vr-zone.com/articles/alleged-ati-radeon-hd-6870-3dmark-vantage-benchmark-leaked/9692.html

It is the GPU-Z screenshot, however, which reveals more details. GPU-Z recognizes the sample as an "ATI Radeon HD 6800 Series". The GPU ID is 6718, which according to the Catalyst 10.8 codename list, indicates a Cayman XT GPU - well in line with rumours thus far. The core clock is same as the HD 5870 - 850 MHz. This suggests the performance boost comes from more functional units (perhaps 1920 SP) or improved performance per clock (using some of Northern Islands' units) or a combination of both. The GDDR5 speed is boosted by a whopping 33% to 1.6 GHz, or a whopping 6.4 GHz effective. The same 256-bit memory interface is retained, but the ultra fast memory results in a massive memory bandwidth of 204.8 GB/s - well over the GTX 480's 177.4 GB/s. Of course, one of the possibilities for such a high memory clock speed, as well as the impressive benchmark, could be that the card is benchmarked overclocked.
 
My doubts are over vectorizing code over VLIW lanes while managing branches and divergence efficiently.
Well, I don't hold out any hopes for recursive algorithms. But apparently some do.

To those doubts you can add bandwidth to feed the VLIW monster.

The flatness of Larrabee is where it's at, in my view. But it's not a product.

It'll be really interesting to program that as a SIMD-16 (explicit parallelism, i.e. vectorisation) as well as scalar SIMD (implicit parallelism).
 
My doubts are over vectorizing code over VLIW lanes while managing branches and divergence efficiently.

It will be at least as difficult to vectorize such code using AVX on a CPU. My point was that the direction in CPUs seems to be towards more vectorization and parallelization.

I don't know, maybe CPUs and GPUs will converge, but it's probably not towards the point where CPUs are now.
 
GF104 really should be capable of matching HD5870 in games, I don't understand why NVidia hasn't even tried. Is the cost of testing to find the chips that will do that really so high?
Probably as with other chips like GF100, 106 and maybe even 108 as well: To have something to add besides some 50-100 MHz of clock speed when AMD refreshes this fall.
 
The flatness of Larrabee is where it's at, in my view. But it's not a product.

It'll be really interesting to program that as a SIMD-16 (explicit parallelism, i.e. vectorisation) as well as scalar SIMD (implicit parallelism).

I have a program that would perform fantastically well on LRB (but not on nv or AMD :| ), and it uses both vectorization and VLIW. It's a shame those LRB boards aren't widely available.
 
It will be at least as difficult to vectorize such code using AVX on a CPU. My point was that the direction in CPUs seems to be towards more vectorization and parallelization.
You are missing the point. The discussion is about two levels of vectorization (SIMD width - 64 and VLIW width - 4) in the same program vs one (just the SIMD width) and not about vectorization per se.
 
The discussion is about two levels of vectorization (SIMD width - 64 and VLIW width - 4) in the same program vs one (just the SIMD width) and not about vectorization per se.

I understand that, but the point I tried to make stands. CPUs are moving to more parallelization on all the levels they use.
 
Back
Top