AMD: R7xx Speculation

Status
Not open for further replies.
True. I didn't mean to nitpick, just to point out that while GT200 has 2x the bandwidth of G92, it generally has less than 2x as much of everything else, so it may be more accurate to say that if (when) G92 is b/w limited, GT200 may also be, but to a lesser extent.

What I tried to say is that, if G92 is bandwidth limited, there's no chance to see in normal conditions (i.e. where the framebufer does not become the limit) a "more than 2x" scaling with respect to G92, as some seem to suggest.

But I don't know how a GPU functions at a low level (specifically, texture caching), so I'm probably wrong or oversimplifying when thinking that if G92's TF units are bandwidth limited and b/c GT200 gains +25% TFs but +155% b/w, then its TF units aren't (as b/w limited). I'm probably forgetting/ignoring that ROPs are the main b/w consumers, and that GT200's increased ROP count tracks closer (than TF) to its increased b/w. So, if ROPs are the main b/w consumer, then GT200 is pretty close to ~2x G92, as you said.

GT200 is not +155% bandwidth, is +112%. Yes, the TF units are not so bandwidth limited as the GT92's ones, but are only +25% more in count (and a little less taking in account the frequencies), so they cannot sustain double the throughput.


For instance, G92's texture power is indeed "overwhelming," but does it overwhelm available bandwidth or are you just saying it's wasted relative to the power/speed of the rest of the chip? :) I was guessing that ALUs don't require too much bandwidth, and are therefore not bandwidth "limited," b/c something like 3DMark's Perlin Noise test shows the 8800GT is faster than the 9600GT by the exact percentage of its theoretical FLOPS advantage: 62%. And if GT200 is scoring 300 while G92GT's scoring 155, then we're also seeing an improvement that tracks about 1:1 with the increase in ALUs (if not with FLOPS, but I don't know if the GT200 is better at flogging that extra MUL, or if this specific test will even give it the chance).

Yeah, a GPU is the sum of its parts, but the word bottleneck exists for a reason, doesn't it? :)

I'm saying it's in a certain way wasted in G92 (and Nvidia agrees, it seems, because they designed a chip that should outperform G92 based cards by a factor of 2 with only +25% of the TU). Perlin noise is a very ALU intensive shader, but normally shaders in the games are not so ALU intensive - otherwise R600 architecture had fared better in comparison to G8X/9X ;) . So Perlin noise (not being BW limited) shows the improvements in the shading power, whereas in the vast majority of the real gaming cases the improvements are not 100% but much, much lower. Then I ask, why? And this comes to the bottleneck argument, leading me (and not me alone) to believe that this is a bandwidth issue in these cases (except when it's more a framebuffer issue, like in high resolutions +AA).
 
http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~126593,00.html
AMD Stream Processor First to Break 1 Teraflop Barrier

—Next-generation AMD FireStream™ 9250 processor accelerates scientific
and engineering calculations, efficiently delivering supercomputer performance at
up to eight gigaflops-per-watt —

The AMD FireStream 9250 stream processor includes a second-generation
double-precision floating point hardware implementation delivering
more than 200 gigaflops, building on the capabilities of the earlier
AMD FireStream™ 9170, the industry’s first GP-GPU with double-precision floating point support.
The AMD FireStream 9250’s compact size makes it ideal for small 1U servers
as well as most desktop systems, workstations, and larger servers and
it features 1GB of GDDR3 memory, enabling developers to handle large, complex problems.

AMD is also working closely with world class application and solution providers
to ensure customers can achieve optimum performance results.
Stream computing application and solution providers include CAPS entreprise,
Mercury Computer Systems, RapidMind, RogueWave and VizExperts.
Mercury Computer Systems provides high-performance computing systems
and software designed for complex image, sensor, and signal processing applications.
Its algorithm team reports that it has achieved 174 GFLOPS performance for
large 1D complex single-precision floating point FFTs on the AMD FireStream 9250
 
FUD-Z is totally wrong on this one.

If it's just starting to sample, why the hell is a Bulk Order at Chiphell promising 4870s this Friday?

With official pics to boot (I guess they booted the "car" styling, or Sapphire has different plans on stickers)

hd4870jpgus2.jpg
 
Bloody hell, 40 TUs. That diagram is hilarious.

A few things I noted that are different:
  • CrossFireX Sideport was never mentioned on prior diagrams
  • L2 cache is segmented into 4 parts (1 per memory controller) - though it was never clear whether it was segmented before
  • the RBEs have 4 depth/stencil units instead of 2
  • there's no alpha/fog unit in the RBEs
  • the blend unit in the RBEs is differently coloured (bright green) - dunno if that indicates anything
I guess it's 16 SIMDs. But, blimey, 4:1 ALU:TEX. Hell, HD4850 must be choking for bandwidth.

Jawed
 
The way the lines are drawn from the thread dispatch processor, I was counting 10 SIMDS.
I was just assuming the same "horizontal" arrangement we saw in R6xx :???:

If each SIMD gets a dedicated TU ("vertical" - though it's horizontal on this diagram if the SIMDs are now read as rotated 90 degrees) then that's a bit of a change - though in terms of cache I guess the effect of the change is very limited because L1 is basically only big enough for a small region of texels and any one texel will find itself in multiple L1s anyway, whether it's a horizontal or vertical mapping from SIMDs to TUs.

Jawed
 
I'm not sure what to make of that diagram. Is there any sourcing to it, or did somebody decide to mess with MS Paint?

10 SIMDs would keep things mathematically consistent with each TMU block feeding groupings of 4 SIMD elements.

16 SIMDs would mean each SIMD would be 10 in length, and unless the we shave off part of a TMU block, it's not going to fit.
 
:oops: brain fade on my part, when I counted the columns on the diagram I wasn't paying attention to the fact that 16 single elements (pixels) are linked to a quad of TUs. I should have said 4 SIMDs. ARGH :oops:

In RV670, for example, the width of the TUs (16) agrees with the width of a SIMD, so the SIMDs are 16 wide.

If RV770 has 40-wide TUs then the naive interpretation is that the SIMDs are also 40-wide. 160 elements in a batch?

---

I've just noticed that diagram appears to have no vertex-fetch specific units: there's no address and point-sampling units; just bilinear addressing, sampling and filtering. If true that seems like quite a big change architecturally - and presumably saves a bit of space.

Jawed
 
Tweak Town

The other thing I have to say before I wrap this all up is that I’ve tested the HD 4850, and I’ve tested it in Crossfire. Now, if I hadn’t tested those cards I may have been more impressed with the GTX 280, but I have. I’ve seen the performance figures the cards put out. We also know the price on a pair of HD 4850s is going to be under $600 AUD, while the new GTX 280 in stock form seems to be launching at the absolute cheapest in Australia in the low $700 AUD area. Ouch.
 

Well that's the problem with all these GTX reviews. When you read them, they are not set in context of the upcoming competition from ATI. I think when 4850/4870 arrives, it will change how anyone looks at those GTX reviews quite a lot.

I'm glad that for a change we're not having to wait months for ATI to respond to Nvidia's latest monster chip and hopefully we won't be disappointed by ATI like we have been so many times in recent years.
 
Status
Not open for further replies.
Back
Top