AMD: R9xx Speculation

Kaotik · Oct 29, 2010

Mize said:
It has 1x 8-pin power and 1x 6-pin power so it's Cayman or Fermi, no?

With the "AMD" logo on it, I wouldn't put my money for Fermi, but each to their own

GZ007 · Oct 29, 2010

Jawed said:
Cos that's merely catching up to where NVidia has been for a while, at best.

You mean Quadro cards if u dont count tesselation

.

3dilettante · Oct 29, 2010

Jawed said:
Yes the curious thing is that this description of off-die buffering was associated specifically with tessellation. It seems that off-die buffering of tessellation data is an "improvement" on keeping that data in the SIMDs (seemingly in LDS?). Maybe that's because it's easier to share it across the chip?

The B3d article on Fermi indicates that Cypress can be affected by the fatness of the control points and math in the HS.
Having more capacity may help with the former problem.

Can't say I like the sound of this, though - it might be an improvement, but ugh, cached like in Fermi seems preferable. And 2x the geometry performance, if true, sounds like no real improvement.

The description doesn't seem specific enough to indicate there isn't on-die storage that may spill to memory.

Mize · Oct 29, 2010

Kaotik said:
With the "AMD" logo on it, I wouldn't put my money for Fermi, but each to their own

In trying to ascertain if it was a fake I was assuming faking an AMD more plausible than an entire card.
Anyway...

Mintmaster · Oct 29, 2010

Jawed said:
Yes the curious thing is that this description of off-die buffering was associated specifically with tessellation. It seems that off-die buffering of tessellation data is an "improvement" on keeping that data in the SIMDs (seemingly in LDS?). Maybe that's because it's easier to share it across the chip?

Didn't the B3D article say that tessellation data is kept in the GDS? There could be some data bottlenecks with that.

Does anyone know how fast RV770 and newer ATI GPUs process triangles that pass through a trivial geometry shader?

rUmX · Oct 29, 2010

Mize said:
R300/9700 was a groundbreaking card. For the first time both AF and AA were useable with playable FPS. I doubt Cayman can deliver something as monumental, but one can hope.

Crysis Warhead > 60FPS @ 25x16 4xAA Enthusiast? One can only dream...

Alexko · Oct 29, 2010

Jawed said:
Cos that's merely catching up to where NVidia has been for a while, at best.

Isn't that amply sufficient?

Jawed · Oct 29, 2010

Mintmaster said:
Didn't the B3D article say that tessellation data is kept in the GDS? There could be some data bottlenecks with that.

GDS is used to send parameters to TS, I believe.

There's a separate data path from HS direct to DS. Additionally DS consumes the output of TS (obviously). So HS and TS data needs to be staged for consumption by DS - in theory covering quite a bit of lag between the two data streams. This appears to be the crux of the buffering issue. The B3D article, I believe, describes "locking" HS and DS together as a pair within a SIMD. This then limits the amount of data that can be staged, and presumably also affects the SIMDs' ability to sink the output of TS.

Does anyone know how fast RV770 and newer ATI GPUs process triangles that pass through a trivial geometry shader?

No. A key characteristic of GS since R600 has been pushing all the data off-die through the ring buffer. This is how GS was originally able to support huge amounts of data per vertex, before D3D10 got cut back in favour of NVidia's architecture.

So it appears this hint is for some kind of ring buffer (or multiple ring buffers?) off die for DS to consume.

My problem with this concept is that tessellation, generally, is supposed to reduce VRAM bandwidth (and space) usage by doing stuff on die instead of dealing with hugely-expanded vertex data streams. Shoving HS/TS data off die really works against that. Unless there's a healthy Fermi style L2 cache, it seems like not much progress to me.

Jawed · Oct 29, 2010

Alexko said:
Isn't that amply sufficient?

Seems NVidia has quite a reserve, both in terms of unlocking throughput (Quadro is the real deal for throughput) and in terms of clocks. So, no, I don't think it's sufficient. Bearing in mind Cayman looks like it's going to have to last for a year+ (emphasis on +). Also, can it scale further? Is it really scalable?

Alexko · Oct 29, 2010

Jawed said:
Seems NVidia has quite a reserve, both in terms of unlocking throughput (Quadro is the real deal for throughput) and in terms of clocks. So, no, I don't think it's sufficient. Bearing in mind Cayman looks like it's going to have to last for a year+ (emphasis on +). Also, can it scale further? Is it really scalable?

GeForces are only limited in geometric throughput when tessellation is disabled, right? That doesn't sound like a bottleneck you'd be likely to hit outside of pro rendering applications.

Besides, I know that tessellation is trendy —and rightly so, I suppose— but the main objective is to render games with max details and smooth framerates, right? As far as I can tell, even Cypress is capable of doing that, and it has yet to crumble under the weight of high geometric demands on any game. Barts is better than Cypress in terms of tessellation, and Cayman apparently offers further improvements over that.

So where's the problem? I'd understand your concern if Cypress were struggling with, say, Alien vs Predator, but I don't see any reason to be worried here.

Geometric throughput is just one metric. Should NVIDIA drastically increase FLOPS because AMD leads in this area? Only actual performance matters.

no-X · Oct 29, 2010

I think the question is how effectively will Cayman use its geometry power. In theory it should be capable of about 1700 megatriangles per second.

Shouldn't it be enough to achieve more than 60 FPS on Eyefinity-6 configuration made of 2560*1600 LCDs displaying scene made of single-pixel triangles?

Unknown Soldier · Oct 29, 2010

Picao84 said:
Holy Jesus! That chip is BIG!
Wasnt expecting something so big coming from AMD
I dont believe in Antilles being 2x Cayman, with a big die like that...

And that's why I guess AMD were aiming for 32nm. Of course that didn't happen.

Mintmaster · Oct 30, 2010

Jawed said:
GDS is used to send parameters to TS, I believe.

There's a separate data path from HS direct to DS. Additionally DS consumes the output of TS (obviously). So HS and TS data needs to be staged for consumption by DS - in theory covering quite a bit of lag between the two data streams. This appears to be the crux of the buffering issue. The B3D article, I believe, describes "locking" HS and DS together as a pair within a SIMD. This then limits the amount of data that can be staged, and presumably also affects the SIMDs' ability to sink the output of TS.

TS data is very small, though. It's just 4 bytes per vertex if you use a triangle strip, and close to half if you do caching. If you can stage just one kilobyte then you have several wavefronts of vertices buffered up.

My problem with this concept is that tessellation, generally, is supposed to reduce VRAM bandwidth (and space) usage by doing stuff on die instead of dealing with hugely-expanded vertex data streams. Shoving HS/TS data off die really works against that. Unless there's a healthy Fermi style L2 cache, it seems like not much progress to me.

Like I said, TS data is 4 bytes per vertex, which means 2-4 bytes per triangle. Even Fermi's peak of 4 tris per clock would consume under 11 GB/s using an off-die buffer for the TS output.

eastmen · Oct 30, 2010

Jawed said:
Seems NVidia has quite a reserve, both in terms of unlocking throughput (Quadro is the real deal for throughput) and in terms of clocks. So, no, I don't think it's sufficient. Bearing in mind Cayman looks like it's going to have to last for a year+ (emphasis on +). Also, can it scale further? Is it really scalable?

I'm wondering why you think it will have to last over a year. Surely when they move to 28nm sometime in 2011 they will replace cayman

AnarchX · Oct 30, 2010

Mintmaster said:
Like I said, TS data is 4 bytes per vertex, which means 2-4 bytes per triangle. Even Fermi's peak of 4 tris per clock would consume under 11 GB/s using an off-die buffer for the TS output.

32 ROPs @ 850MHz ~ 153GB/s -> 32 ROPs @ 900MHz ~ 162GB/s
256-Bit @ 5.8Gbps ~ 185GB/s
8 memory channels 23GB/s each.

So Cayman turns with tessellation to 224-Bit (162GB/s) mode with 1792MiB general VRAM and an exclusive 23GB/s 256MiB tessellation buffer?

Jawed · Oct 30, 2010

Alexko said:
GeForces are only limited in geometric throughput when tesselation is disabled, right? That doesn't sound like a bottleneck you'd be likely to hit outside of pro rendering applications.

I got that the wrong way round, sigh.

Besides, I know that tesselation is trendy —and rightly so, I suppose— but the main objective is to render games with max details and smooth framerates, right? As far as I can tell, even Cypress is capable of doing that,

GTX460 is faster than HD5870 in Civ 5:

http://www.techspot.com/review/320-civilization-v-performance/page9.html

Even with tessellation off:

http://www.techspot.com/review/320-civilization-v-performance/page7.html

AMD had devrel involved in making that game.

I don't see Metro 2033 performance holding up, either (not even highest in-game settings):

http://www.techreport.com/articles.x/19844/12

MistaPi · Oct 30, 2010

I have another question about MLAA. If I enable MLAA in the driver and set the level to 8x samples, does it mean 8xMLAA or 8xMSAA+MLAA? I am thinking the latter and that MLAA is a fixed filter?

fellix · Oct 30, 2010

That would be 8xMSAA+MLAA.

MistaPi · Oct 30, 2010

Ok. Thanks.

Man from Atlantis · Oct 30, 2010

more mlaa performance tests

http://www.computerbase.de/artikel/...-radeon-hd-6800/4/#abschnitt_performancewerte

AMD: R9xx Speculation

Kaotik

Drunk Member

GZ007

3dilettante

Mize

3dfx Fan

Mintmaster

rUmX

Alexko

Jawed

Jawed

Alexko

no-X

Unknown Soldier

Mintmaster

eastmen

AnarchX

Jawed

MistaPi

fellix

MistaPi

Man from Atlantis

Similar threads