AMD: R9xx Speculation

Yes the curious thing is that this description of off-die buffering was associated specifically with tessellation. It seems that off-die buffering of tessellation data is an "improvement" on keeping that data in the SIMDs (seemingly in LDS?). Maybe that's because it's easier to share it across the chip?
The B3d article on Fermi indicates that Cypress can be affected by the fatness of the control points and math in the HS.
Having more capacity may help with the former problem.

Can't say I like the sound of this, though - it might be an improvement, but ugh, cached like in Fermi seems preferable. And 2x the geometry performance, if true, sounds like no real improvement.
The description doesn't seem specific enough to indicate there isn't on-die storage that may spill to memory.
 
Yes the curious thing is that this description of off-die buffering was associated specifically with tessellation. It seems that off-die buffering of tessellation data is an "improvement" on keeping that data in the SIMDs (seemingly in LDS?). Maybe that's because it's easier to share it across the chip?
Didn't the B3D article say that tessellation data is kept in the GDS? There could be some data bottlenecks with that.

Does anyone know how fast RV770 and newer ATI GPUs process triangles that pass through a trivial geometry shader?
 
R300/9700 was a groundbreaking card. For the first time both AF and AA were useable with playable FPS. I doubt Cayman can deliver something as monumental, but one can hope.

Crysis Warhead > 60FPS @ 25x16 4xAA Enthusiast? One can only dream... :p:LOL:
 
Didn't the B3D article say that tessellation data is kept in the GDS? There could be some data bottlenecks with that.
GDS is used to send parameters to TS, I believe.

There's a separate data path from HS direct to DS. Additionally DS consumes the output of TS (obviously). So HS and TS data needs to be staged for consumption by DS - in theory covering quite a bit of lag between the two data streams. This appears to be the crux of the buffering issue. The B3D article, I believe, describes "locking" HS and DS together as a pair within a SIMD. This then limits the amount of data that can be staged, and presumably also affects the SIMDs' ability to sink the output of TS.

Does anyone know how fast RV770 and newer ATI GPUs process triangles that pass through a trivial geometry shader?
No. A key characteristic of GS since R600 has been pushing all the data off-die through the ring buffer. This is how GS was originally able to support huge amounts of data per vertex, before D3D10 got cut back in favour of NVidia's architecture.

So it appears this hint is for some kind of ring buffer (or multiple ring buffers?) off die for DS to consume.

My problem with this concept is that tessellation, generally, is supposed to reduce VRAM bandwidth (and space) usage by doing stuff on die instead of dealing with hugely-expanded vertex data streams. Shoving HS/TS data off die really works against that. Unless there's a healthy Fermi style L2 cache, it seems like not much progress to me.
 
Isn't that amply sufficient?
Seems NVidia has quite a reserve, both in terms of unlocking throughput (Quadro is the real deal for throughput) and in terms of clocks. So, no, I don't think it's sufficient. Bearing in mind Cayman looks like it's going to have to last for a year+ (emphasis on +). Also, can it scale further? Is it really scalable?
 
Seems NVidia has quite a reserve, both in terms of unlocking throughput (Quadro is the real deal for throughput) and in terms of clocks. So, no, I don't think it's sufficient. Bearing in mind Cayman looks like it's going to have to last for a year+ (emphasis on +). Also, can it scale further? Is it really scalable?

GeForces are only limited in geometric throughput when tessellation is disabled, right? That doesn't sound like a bottleneck you'd be likely to hit outside of pro rendering applications.

Besides, I know that tessellation is trendy —and rightly so, I suppose— but the main objective is to render games with max details and smooth framerates, right? As far as I can tell, even Cypress is capable of doing that, and it has yet to crumble under the weight of high geometric demands on any game. Barts is better than Cypress in terms of tessellation, and Cayman apparently offers further improvements over that.

So where's the problem? I'd understand your concern if Cypress were struggling with, say, Alien vs Predator, but I don't see any reason to be worried here.

Geometric throughput is just one metric. Should NVIDIA drastically increase FLOPS because AMD leads in this area? Only actual performance matters.
 
Last edited by a moderator:
I think the question is how effectively will Cayman use its geometry power. In theory it should be capable of about 1700 megatriangles per second.

Shouldn't it be enough to achieve more than 60 FPS on Eyefinity-6 configuration made of 2560*1600 LCDs displaying scene made of single-pixel triangles?
 
GDS is used to send parameters to TS, I believe.

There's a separate data path from HS direct to DS. Additionally DS consumes the output of TS (obviously). So HS and TS data needs to be staged for consumption by DS - in theory covering quite a bit of lag between the two data streams. This appears to be the crux of the buffering issue. The B3D article, I believe, describes "locking" HS and DS together as a pair within a SIMD. This then limits the amount of data that can be staged, and presumably also affects the SIMDs' ability to sink the output of TS.
TS data is very small, though. It's just 4 bytes per vertex if you use a triangle strip, and close to half if you do caching. If you can stage just one kilobyte then you have several wavefronts of vertices buffered up.

My problem with this concept is that tessellation, generally, is supposed to reduce VRAM bandwidth (and space) usage by doing stuff on die instead of dealing with hugely-expanded vertex data streams. Shoving HS/TS data off die really works against that. Unless there's a healthy Fermi style L2 cache, it seems like not much progress to me.
Like I said, TS data is 4 bytes per vertex, which means 2-4 bytes per triangle. Even Fermi's peak of 4 tris per clock would consume under 11 GB/s using an off-die buffer for the TS output.
 
Seems NVidia has quite a reserve, both in terms of unlocking throughput (Quadro is the real deal for throughput) and in terms of clocks. So, no, I don't think it's sufficient. Bearing in mind Cayman looks like it's going to have to last for a year+ (emphasis on +). Also, can it scale further? Is it really scalable?

I'm wondering why you think it will have to last over a year. Surely when they move to 28nm sometime in 2011 they will replace cayman
 
Like I said, TS data is 4 bytes per vertex, which means 2-4 bytes per triangle. Even Fermi's peak of 4 tris per clock would consume under 11 GB/s using an off-die buffer for the TS output.
  • 32 ROPs @ 850MHz ~ 153GB/s -> 32 ROPs @ 900MHz ~ 162GB/s
  • 256-Bit @ 5.8Gbps ~ 185GB/s
  • 8 memory channels 23GB/s each.

So Cayman turns with tessellation to 224-Bit (162GB/s) mode with 1792MiB general VRAM and an exclusive 23GB/s 256MiB tessellation buffer? :LOL:
 
GeForces are only limited in geometric throughput when tesselation is disabled, right? That doesn't sound like a bottleneck you'd be likely to hit outside of pro rendering applications.
:oops: I got that the wrong way round, sigh.

Besides, I know that tesselation is trendy —and rightly so, I suppose— but the main objective is to render games with max details and smooth framerates, right? As far as I can tell, even Cypress is capable of doing that,
GTX460 is faster than HD5870 in Civ 5:

http://www.techspot.com/review/320-civilization-v-performance/page9.html

Even with tessellation off:

http://www.techspot.com/review/320-civilization-v-performance/page7.html

AMD had devrel involved in making that game.

I don't see Metro 2033 performance holding up, either (not even highest in-game settings):

http://www.techreport.com/articles.x/19844/12
 
I have another question about MLAA. If I enable MLAA in the driver and set the level to 8x samples, does it mean 8xMLAA or 8xMSAA+MLAA? I am thinking the latter and that MLAA is a fixed filter?
 
Back
Top