AMD: R9xx Speculation

rpg.314 · Aug 27, 2010

Perhaps the geomagnetic poles flipped, but SI are are now NI

http://semiaccurate.com/2010/08/26/amd-spills-beans-northern-island-codenames/

Kaotik · Aug 27, 2010

rpg.314 said:
Perhaps the geomagnetic poles flipped, but SI are are now NI

http://semiaccurate.com/2010/08/26/amd-spills-beans-northern-island-codenames/

He's basing that only on the "NI" in front of the codenames.
I'd put my money on "NI" just marking the architecture, but the family (as codenames suggest) is still SI

rpg.314 · Aug 27, 2010

Gipsel said:
Uhmm, ever had divergence between vector lanes in a "true" SIMD unit?

Lane masks in LRB and bitwise masks in SSE. Similar mechanisms implemented in hw on ati and nv.

VLIW can do everything (SIMD) vector units could do and then some more

I'd like to see a moderately complex problem "vectorized" over ATI's vliw slots before I'd believe it. Divergence and all the warts included.

rpg.314 · Aug 27, 2010

Gipsel said:
No, actually quite common and a recommended technique (brings often even some benefits for nvidia GPUs as you most of the time reduce the granularity of memory accesses). The hindrance is often only the effort the developer has to put in, but this is a minor inconvinience in my book if you get twice the performance.

And it twists the programming model completely out of shape.

Now there are multiple cores, vector width (64) and another vector width(4) to tackle. All for at best 4x perf gain. For one out of 3 vendors. Not worth the trouble in my book.

The following is from nv's marketing slides but IS absolutely true.

-- 2-3x is not worth spending much effort. Wait a little or spend some money.

-- 10-15x gains are worth doing big rewrites.

rpg.314 · Aug 27, 2010

Jawed said:

I think it's time to blow Rohit's mind:

Code:

             46  x: LDS_ADD     ____,  PV45.z,  (0x00000001, 1.401298464e-45f).x      
                 y: ADD_INT     T1.y,  (0x0000000C, 1.681558157e-44f).y,  T0.w      
                 z: MULADD_e    T2.z,  PV45.x, -0.5,  0.5      
                 w: MULADD_e    T1.w,  PV45.w,  0.5,  0.5      
                 t: MUL_e       ____,  R3.w,  KC0[6].w

What's your point? It is doing independent ops over 5 lanes. I know It can do that.

My doubts are over vectorizing code over VLIW lanes while managing branches and divergence efficiently.

neliz · Aug 27, 2010

Kaotik said:
He's basing that only on the "NI" in front of the codenames.
I'd put my money on "NI" just marking the architecture, but the family (as codenames suggest) is still SI

http://forum.beyond3d.com/showpost.php?p=1436488&postcount=817

neliz said:
N.I. before S.I.

mboeller · Aug 27, 2010

trinibwoy said:
Something is wrong in that equation. If, as your analysis shows, AMD's cards are significantly faster when processing shaders then where does Nvidia catch up? With GT200 you could sorta pin it on the texturing but now Cypress has an advantage there too.

maybe this helps you to understand the "equation" better and what seems to be missing to understand the situation:

Die Texelfüllraten mit 16:1 Tri-AF in Relation zur theoretischen Peak-Texelfüllrate:
HD 5870: 0,381
HD 5850: 0,397
HD 5830: 0,402
HD 5770: 0,491
HD 4870: 0,510
GTX 470: 0,597
GTX 465: 0,595
GTX 460: 0,511 (1024 MiB) bzw. 0,497 (768 MiB)
GTX 260: 0,509

source: http://www.forum-3dcenter.org/vbulletin/showthread.php?t=487189&page=2

The table shows the efficiency of the different GPUs with 16:1 AF compared to the theoretical fillrate. So you see that a GTX 470 is nearly 57% more efficient with 16:1 AF compared to a HD 5870. According to Gipsel this is related to the difference in bandwidth feeding the TMUs in the different architectures (as far as I understand it!).

neliz · Aug 27, 2010

Radeon Mobility series:

Vancouver
Granville (N.I.)
Capilano (N.I.)
Robson CE (N.I.)

Ski Resorts:
Blackcomb (S.I.)
Whistler (S.I.)
Seymour (S.I.)
Robson LE (S.I.)

Lexington was Cypress Mobility, got cancelled (because?)

Desktop parts tape-out was some months ago.

Kaotik · Aug 27, 2010

neliz said:
http://forum.beyond3d.com/showpost.php?p=1436488&postcount=817

neliz said:
Radeon Mobility series:

Vancouver
Granville (N.I.)
Capilano (N.I.)
Robson CE (N.I.)

Ski Resorts:
Blackcomb (S.I.)
Whistler (S.I.)
Seymour (S.I.)
Robson LE (S.I.)

Lexington was Cypress Mobility, got cancelled (because?)

Desktop parts tape-out was some months ago.

So you're suggesting that NI's come before SI's but SI's are already in the drivers marked as NI's? :???:

Jawed · Aug 27, 2010

trinibwoy said:
Cypress has for more fillrate than Fermi (theoretical and measured). Are you suggesting that Fermi's Z-rate makes up for a deficiency in flops, texturing and fillrate?

For backwards looking comparisons (e.g. RV770 v GT200) non-ALU factors are significant.

I'm not trying to be difficult but I just don't see how you can have lots of flops + high utilization + higher texturing rate = equal performance.

And a smaller chip.

Something is wrong in that equation. If, as your analysis shows, AMD's cards are significantly faster when processing shaders then where does Nvidia catch up? With GT200 you could sorta pin it on the texturing but now Cypress has an advantage there too.

GT200 had a huge fillrate advantage too.

As for Fermi I've already mentioned other factors, i.e. rasterisation efficiency and memory efficiency. And Fermi has higher Z rate.

In case you haven't noticed, finding a decent architectural investigation of cards these days is damn near impossible. I'm certainly not suggesting it's easy - it's way more complex now than 5 years ago, e.g. as resource types have exploded and rendering techniques are more variegated.

When ATI gains decent tessellation performance (i.e. die size cost could be substantial) we'll have a better comparison of overall performance. Evergreen is clearly not the design AMD intended to build for D3D11 - I'm not suggesting the next chip will be better in that respect, either.

GF104 really should be capable of matching HD5870 in games, I don't understand why NVidia hasn't even tried. Is the cost of testing to find the chips that will do that really so high?

Jawed · Aug 27, 2010

mboeller said:
The table shows the efficiency of the different GPUs with 16:1 AF compared to the theoretical fillrate. So you see that a GTX 470 is nearly 57% more efficient with 16:1 AF compared to a HD 5870. According to Gipsel this is related to the difference in bandwidth feeding the TMUs in the different architectures (as far as I understand it!).

Ooh, that looks like L2->L1 bandwidth and L2 size are key factors. The 20 cores of Cypress are each getting a meagre share of L2 bandwidth.

neliz · Aug 27, 2010

http://vr-zone.com/articles/alleged-ati-radeon-hd-6870-3dmark-vantage-benchmark-leaked/9692.html

It is the GPU-Z screenshot, however, which reveals more details. GPU-Z recognizes the sample as an "ATI Radeon HD 6800 Series". The GPU ID is 6718, which according to the Catalyst 10.8 codename list, indicates a Cayman XT GPU - well in line with rumours thus far. The core clock is same as the HD 5870 - 850 MHz. This suggests the performance boost comes from more functional units (perhaps 1920 SP) or improved performance per clock (using some of Northern Islands' units) or a combination of both. The GDDR5 speed is boosted by a whopping 33% to 1.6 GHz, or a whopping 6.4 GHz effective. The same 256-bit memory interface is retained, but the ultra fast memory results in a massive memory bandwidth of 204.8 GB/s - well over the GTX 480's 177.4 GB/s. Of course, one of the possibilities for such a high memory clock speed, as well as the impressive benchmark, could be that the card is benchmarked overclocked.

Jawed · Aug 27, 2010

rpg.314 said:
My doubts are over vectorizing code over VLIW lanes while managing branches and divergence efficiently.

Well, I don't hold out any hopes for recursive algorithms. But apparently some do.

To those doubts you can add bandwidth to feed the VLIW monster.

The flatness of Larrabee is where it's at, in my view. But it's not a product.

It'll be really interesting to program that as a SIMD-16 (explicit parallelism, i.e. vectorisation) as well as scalar SIMD (implicit parallelism).

CRoland · Aug 27, 2010

rpg.314 said:
My doubts are over vectorizing code over VLIW lanes while managing branches and divergence efficiently.

It will be at least as difficult to vectorize such code using AVX on a CPU. My point was that the direction in CPUs seems to be towards more vectorization and parallelization.

I don't know, maybe CPUs and GPUs will converge, but it's probably not towards the point where CPUs are now.

CarstenS · Aug 27, 2010

Jawed said:
GF104 really should be capable of matching HD5870 in games, I don't understand why NVidia hasn't even tried. Is the cost of testing to find the chips that will do that really so high?

Probably as with other chips like GF100, 106 and maybe even 108 as well: To have something to add besides some 50-100 MHz of clock speed when AMD refreshes this fall.

Jawed · Aug 27, 2010

Juniper->Barts scaling the same would mean GF104 is about to meet its maker.

rpg.314 · Aug 27, 2010

Jawed said:
The flatness of Larrabee is where it's at, in my view. But it's not a product.

It'll be really interesting to program that as a SIMD-16 (explicit parallelism, i.e. vectorisation) as well as scalar SIMD (implicit parallelism).

I have a program that would perform fantastically well on LRB (but not on nv or AMD :| ), and it uses both vectorization and VLIW. It's a shame those LRB boards aren't widely available.

rpg.314 · Aug 27, 2010

CRoland said:
It will be at least as difficult to vectorize such code using AVX on a CPU. My point was that the direction in CPUs seems to be towards more vectorization and parallelization.

You are missing the point. The discussion is about two levels of vectorization (SIMD width - 64 and VLIW width - 4) in the same program vs one (just the SIMD width) and not about vectorization per se.

CRoland · Aug 27, 2010

rpg.314 said:
The discussion is about two levels of vectorization (SIMD width - 64 and VLIW width - 4) in the same program vs one (just the SIMD width) and not about vectorization per se.

I understand that, but the point I tried to make stands. CPUs are moving to more parallelization on all the levels they use.

Kaotik · Aug 27, 2010

neliz said:
http://vr-zone.com/articles/alleged-ati-radeon-hd-6870-3dmark-vantage-benchmark-leaked/9692.html

The GPU score is 1.6k higher than GTX"512"'s @ expreview (clocked at 480 levels), assuming the 6800-score is real, and on similar hardware

AMD: R9xx Speculation

rpg.314

Kaotik

Drunk Member

rpg.314

rpg.314

rpg.314

neliz

GIGABYTE Man

mboeller

neliz

GIGABYTE Man

Kaotik

Drunk Member

Jawed

Jawed

neliz

GIGABYTE Man

Jawed

CRoland

CarstenS

Moderator

Jawed

rpg.314

rpg.314

CRoland

Kaotik

Drunk Member

Similar threads