DavidGraham
Veteran
You silly , I meant next year of course , sorry for the typo !Not much time left, is it?
You silly , I meant next year of course , sorry for the typo !Not much time left, is it?
Accoring to Eric Demers, the real setup rate peaks at 1.8 triangles in the best case.The doubled geometry throughput doesn't seem to be working very well, either. I hope drivers can cure whatever ails Cayman in this respect.
That simply shouldn't be the case. Tessellation requires only 4 bytes per final vertex (an INT16 coordinate pair). Even if you double it for FP32 and ignore caching that takes advantage of the ~2:1 triangle:vertex ratio, then full throughput would need under 30 GB/s at a time when there is little shading going on (due to the small/invisible triangles).Cypress simply stalls its pipeline with higher tessellation factors. Cayman now simply dumps all the data from the hull/domain shader in the frame buffer (video memory), but now the limiting factor becomes the memory BW -- not much better situation from the previous gen.
That's fine, but we're seeing minimal gains over Barts in many cases.Accoring to Eric Demers, the real setup rate peaks at 1.8 triangles in the best case.
Carsten's results, however, still show the horrible scaling from before.
Maybe it's not just one trip to the off-chip buffer after the TS output. Is it possible that Cayman skips some culling before dumping the data in the videomemory, and then needs to read it back for some intermediate step before processing?!<snip>
ATI either screwed up in the hardware or has a lot of driver work to do in managing this buffer.
Accoring to Eric Demers, the real setup rate peaks at 1.8 triangles in the best case.
For comparison, the GF100's rate is benched at 3.51 tris, with minimum overhead of course. Still miles ahead of Cayman, and that's without considering the much more roboust tessellation throughput.
Ok, I guess you're right.There's no reason for AMD to meddle with the SIMD width (and batch size, in the same manner). It would be counter productive to utilization efficiency.
I don't think this argument makes sense.Agreed. They made many architectural changes that don't really pay off yet, but had to be done sooner or later.
Ok, I guess you're right.
Still, GF100/b (as well as Barts & GF104, for that matter) proves that 64 TMUs are enough even for the most recent games; 96 is already a bit overkill and doesn't seem to help performance that much.
My point is that it would make sense to go for an 8:1 4D-unit/TEX ratio with the 28nm successor, because in my eyes anything more than 128 TMUs would be a waste of mm²/transistor budget.
Yep!
Now the only thing missing in the picture is R/W L2 cache to complelent that buffering, but that's for the next generation.
Why no ALU hot-clock below two times base clock? S3 showed it with Chrome 400/500.
1.5 times base clock would a bit above previous ALU:TEX.
Not sure why you say that given the amount of space being dedicated to reviews on it (which is reflective of the amount of time we have put into conveying the message). For sure, though, 6970 is the less interesting one to look at of the two as it does have a lot more headroom; 6950 has a much more stringent TDP budget to stick to and without PowerTune it wouldn't have been close to the clock it is at.Also I think this proves my point that powertune is just an anti-Furmark switch.
http://www.guru3d.com/article/interview-with-ati-dave-baumann/2[FONT=verdana,geneva]A reader question: I had a user question asking, what happened to Sideport (XSP)? Sideport was intended to add more interconnect bandwidth. It has been disabled ever since the release of the RV770 (X2) from day 1. We heard that "that much bandwidth is not needed". IMO... you can never have enough bandwidth really. What was going on there?[/FONT]
[FONT=verdana,geneva]This is simply a case of our software capabilities catching up to our hardware capabilities.[/FONT]
[FONT=verdana,geneva]When the initial design of the RV770 was taking place and concepts such as Sideport were being kicked around our ATI CrossFireX™ software wasn't in the place it is right now, so there was a much higher reliance on inter-chip communication.
[/FONT]
[FONT=verdana,geneva]While having lots of bandwidth is rarely a bad thing, the ATI CrossFireX communication bandwidth between two discrete cards is less than local bandwidth - even though Sideport doubles the inter-GPU communication bandwidth on an X2 type solution it's still not significant enough to really change the disparity in local frame buffer and inter-GPU bandwidths.[/FONT]
[FONT=verdana,geneva]The software work that occurred in the space of time between the RV770 design and product saw significant improvements in inter-GPU communication. Internal to the driver we now have a number of "alternate frame rendering" (AFR) profiles, with many parameters that can be tweaked in order to control how the rendering behaves over multiple GPU's and reduce the inter-GPU communication as much as possible. By the time we put two RV770's on a board and started testing Sideport, the current ATI CrossFireX software capabilities delivered more than enough bandwidth, obviating the need for Sideport.[/FONT]