AMD: R9xx Speculation

The doubled geometry throughput doesn't seem to be working very well, either. I hope drivers can cure whatever ails Cayman in this respect.
Accoring to Eric Demers, the real setup rate peaks at 1.8 triangles in the best case. :rolleyes:

For comparison, the GF100's rate is benched at 3.51 tris, with minimum overhead of course. Still miles ahead of Cayman, and that's without considering the much more roboust tessellation throughput.
 
Cypress simply stalls its pipeline with higher tessellation factors. Cayman now simply dumps all the data from the hull/domain shader in the frame buffer (video memory), but now the limiting factor becomes the memory BW -- not much better situation from the previous gen.
That simply shouldn't be the case. Tessellation requires only 4 bytes per final vertex (an INT16 coordinate pair). Even if you double it for FP32 and ignore caching that takes advantage of the ~2:1 triangle:vertex ratio, then full throughput would need under 30 GB/s at a time when there is little shading going on (due to the small/invisible triangles).

ATI either screwed up in the hardware or has a lot of driver work to do in managing this buffer.
 
<snip>

ATI either screwed up in the hardware or has a lot of driver work to do in managing this buffer.
Maybe it's not just one trip to the off-chip buffer after the TS output. Is it possible that Cayman skips some culling before dumping the data in the videomemory, and then needs to read it back for some intermediate step before processing?!
 
In Nvidias Island11-Tessellation demo, the relative performance difference between HS frustrum culling on and off does not vary greatly between Cypress, Barts and Cayman, IIRC.
 
Accoring to Eric Demers, the real setup rate peaks at 1.8 triangles in the best case. :rolleyes:

For comparison, the GF100's rate is benched at 3.51 tris, with minimum overhead of course. Still miles ahead of Cayman, and that's without considering the much more roboust tessellation throughput.

In the rather old, but nice Xvox, which I also used on Geforce cards, I am getting about 1,67 tris/clk for HD 6970, whereas GTX 460 is at only 1,5 (Dual Setups, more fair comparison than GF100/b with it's four engines; ~2,43 here) .
 
There's no reason for AMD to meddle with the SIMD width (and batch size, in the same manner). It would be counter productive to utilization efficiency.
Ok, I guess you're right.

Still, GF100/b (as well as Barts & GF104, for that matter) proves that 64 TMUs are enough even for the most recent games; 96 is already a bit overkill and doesn't seem to help performance that much.

My point is that it would make sense to go for an 8:1 4D-unit/TEX ratio with the 28nm successor, because in my eyes anything more than 128 TMUs would be a waste of mm²/transistor budget.
 
Agreed. They made many architectural changes that don't really pay off yet, but had to be done sooner or later.
I don't think this argument makes sense.
If the architectural changes don't pay off now then it would make more sense to introduce them when they DO pay off, rather than giving your customers a worse deal, and your competitor a leg up.

The HD6970, judging from application performance averages, brings no improvement in either performance/Watt, or performance/mm2 compared to the HD6870. (It's a bit worse actually) And it's a definite step back in performance/$ although that doesn't count for much. The new cards are not bad or anything, quite the contrary, and the HD6950 looks to be a sweet card overall, but... comparisons with either the HD58xx predecessors or the HD68xx generational sibling gives the impression that AMD lost the ball a bit.
 
Ok, I guess you're right.

Still, GF100/b (as well as Barts & GF104, for that matter) proves that 64 TMUs are enough even for the most recent games; 96 is already a bit overkill and doesn't seem to help performance that much.

My point is that it would make sense to go for an 8:1 4D-unit/TEX ratio with the 28nm successor, because in my eyes anything more than 128 TMUs would be a waste of mm²/transistor budget.

Or they could go other way. Different frequency for shaders and the rest of the chip. At 700 MHz the 32 ROPs and 96 TMUs would be more than enough and SP-s would run at 1.4GHz. They already get rid of the T unit and simplified them.
 
Last edited by a moderator:
Why no ALU hot-clock below two times base clock? S3 showed it with Chrome 400/500.
1.5 times base clock would a bit above previous ALU:TEX.

With "RV1070" we might see some serious changes, like real HQ filtering, reorganized TMU architecture (like in the patents Jawed found) and maybe a better scalable MC with 64-Bit steps.

What do you think about the possibilities of a Cayman 40nm refresh in mid 2011?
 
disappointing launch

It's a disappointing launch. AMD loses ground with 69XX even as Nvidia is inching closer to it after Fermi debacle. But IMO it's somewhat expected.

Every few years both Nvidia and AMD have to pay price to GPU gods. Nvidia did it with Fermi in one go, changed the architecture, swallowed the bitter pill, and will now gain strength. AMD held it back with 58XX and is now trying to spread change in architecture over two generations — 69XX and 78XX (or anything that comes next). And it will take exceptional engineering prowess for AMD to match Kepler now.

Nothing new in the script... just that after the flawless 48XX and a very strong 58XX we were expecting AMD to pull of a miracle.
 
I take back my praise from last night. Having read the reviews and Anand's final thoughts I'm going to hold out for GF114 at a reasonable price. It looks like GF114 will be competitive with 6950. I'm very surprised at these results, I really thought ATi had got this one, but it looks like the original prognosis of HD6970 = GTX570 = GTX480 has come to pass. Nvidia not only keep their single GPU crown, but they have extended their lead over ATi, and this is without any architecture improvements on the scale of 5870 -> 6970.

I think disappointing is the key word for 6970, and decent for 6950.

Also I think this proves my point that powertune is just an anti-Furmark switch. If Nvidia had implemented a similar function we would have countless articles from Charlie about how Nvidia couldn't contain their TDP and how broken the architecture and chip are and he would have bumped up a load of BS articles about bumpgate to hammer the point home.
 
Why no ALU hot-clock below two times base clock? S3 showed it with Chrome 400/500.
1.5 times base clock would a bit above previous ALU:TEX.

Fillrate and texelrate is already high. (also increasing it without bandwith doesnt make much sense).
So if they jump to 28nm than all the power they save from staying with same 32 ROPs and 96 TMUs could be invested into double ALU clocks.;) (and it would be certainly easyer to go for 1:2 than 1: 1.5)
 
Also I think this proves my point that powertune is just an anti-Furmark switch.
Not sure why you say that given the amount of space being dedicated to reviews on it (which is reflective of the amount of time we have put into conveying the message). For sure, though, 6970 is the less interesting one to look at of the two as it does have a lot more headroom; 6950 has a much more stringent TDP budget to stick to and without PowerTune it wouldn't have been close to the clock it is at.
 
Sideport ?

Regarding the Sideport , after some digging , I found this :

[FONT=verdana,geneva]A reader question: I had a user question asking, what happened to Sideport (XSP)? Sideport was intended to add more interconnect bandwidth. It has been disabled ever since the release of the RV770 (X2) from day 1. We heard that "that much bandwidth is not needed". IMO... you can never have enough bandwidth really. What was going on there?[/FONT]

[FONT=verdana,geneva]This is simply a case of our software capabilities catching up to our hardware capabilities.[/FONT]

[FONT=verdana,geneva]When the initial design of the RV770 was taking place and concepts such as Sideport were being kicked around our ATI CrossFireX™ software wasn't in the place it is right now, so there was a much higher reliance on inter-chip communication.
[/FONT]

[FONT=verdana,geneva]While having lots of bandwidth is rarely a bad thing, the ATI CrossFireX communication bandwidth between two discrete cards is less than local bandwidth - even though Sideport doubles the inter-GPU communication bandwidth on an X2 type solution it's still not significant enough to really change the disparity in local frame buffer and inter-GPU bandwidths.[/FONT]

[FONT=verdana,geneva]The software work that occurred in the space of time between the RV770 design and product saw significant improvements in inter-GPU communication. Internal to the driver we now have a number of "alternate frame rendering" (AFR) profiles, with many parameters that can be tweaked in order to control how the rendering behaves over multiple GPU's and reduce the inter-GPU communication as much as possible. By the time we put two RV770's on a board and started testing Sideport, the current ATI CrossFireX software capabilities delivered more than enough bandwidth, obviating the need for Sideport.[/FONT]
http://www.guru3d.com/article/interview-with-ati-dave-baumann/2
 
Back
Top