AMD: R7xx Speculation

Status
Not open for further replies.
You'll only be waiting for 2.5 hours.

Random question answering from the thread and notes on things, since I won't have anything for NDA expiry (it's already 2am here, I'm not done, and I have work in the morning sadly):

  • FP16 filtering is half speed, and the samplers are limited by available interpolators (only 32 texcoords/clk) when processing INT8 at full speed
  • 260mm2, 960M transistors or so
  • Huge focus on area efficiency and perf/watt
  • Chip was pad limited in the beginning***, so the last couple of SIMDs are value adds that weren't originally planned for. Explains the first point a little bit.
  • ROPs are mostly 2x everywhere measured with MSAA and really help the chip go fast. ROP MSAA downfilter this time.
  • Seems to be 64KiB L1 per sampler with huge L1 bandwidth, new design
  • Finding peak rates everywhere on the chip has been easy. I've seen 1Tflop FP32, full bilinear rates and peak INT8 blend and Z-only writes (64 Zs/clock, yay!)
  • GDDR5 is really what lets the chip kick some real ass
  • GS perf is up compared to RV670, maybe a new (bigger) coalescing stream out cache there, and more threads in flight
  • Colour cache is per ROP, same as R6
  • 16KiB per SIMD shared memory + 16KiB global (not the SMX)
  • All 800 SPs can do integer, fat one only for specials still. 1 FP32 MAD or 1 FP<->INT or 1 integer op/clock for all
  • New caching scheme for the RF I think
  • Orlando-led design but very distributed in terms of teams. Scott Hartog led, and he worked on i740 at Intel
  • Over 5MiB of SRAM on-chip if you count every pool
  • New UVD block with 7.1 over HDMI
  • No ring bus MC, new controller nice and efficient due to new ROP design

It's the single most impressive graphics processor (and pairing with a memory technology, nice one Joe!) I've ever seen, when looked at as a whole. I don't say that lightly either, there have been some winning chips over the years.

Deeply impressive and really deserves to get ATI back on the map when it comes to performance 3D graphics. Sorry I won't have anything more filling for arch NDA expiry, go read hardware.fr, Tech Report and Morgoth's pieces if you want more data.

*** :LOL:^infinity, that's honestly the best thing ever
 
Last edited by a moderator:
[*]All 800 SPs are fat. 1 FP32 MAD or 1 special or 1 FP<->INT or 1 integer op/clock each


Wow! So instead of 1 out of 5 being more capable, now they're all the same? That's amazing they added so many SPs and at the same time increased their functionality.

Is there anything about the 4870X2?
 
[*]FP16 filtering is half speed, and the samplers are limited by available interpolators (only 32 texcoords/clk) when processing INT8 at full speed
Ahh that explains the int8 numbers... Test apps using the same coords for multiple textures shouldn't be showing this, though (are there any?)

[*]Chip was pad limited in the beginning***, so the last couple of SIMDs are value adds that weren't originally planned for. Explains the first point a little bit.
Not sure I follow this. Why would it be pad limited if it's larger than rv670? Shouldn't it have only a slightly larger pad count?
[*]ROPs are mostly 2x everywhere measured with MSAA and really help the chip go fast. ROP MSAA downfilter this time.
Really? Surprising to see AMD would get back to fixed function resolve (for basic modes) - with tons of shader alus what's the point?

[*]All 800 SPs are fat. 1 FP32 MAD or 1 special or 1 FP<->INT or 1 integer op/clock each
Hmm I find that hard to believe. The pics don't indicate that.
 
All 800 SPs are fat. 1 FP32 MAD or 1 special or 1 FP<->INT or 1 integer op/clock each

Nice! It's hilarious that they had to add ALUs just to fill up space. That's impressive :LOL:

Deeply impressive and really deserves to get ATI back on the map when it comes to performance 3D graphics.

No doubt. It does look like one sleek mofo. Good job AMD/ATI.
 
FP16 filtering is half speed, and the samplers are limited by available interpolators (only 32 texcoords/clk) when processing INT8 at full speed
Where the heck are the texcoords interpolated? Odd that it doesn't scale with SIMD count.
Huge focus on area efficiency and perf/watt
Why this wasn't always the case is beyond me...
Finding peak rates everywhere on the chip has been easy. I've seen 1Tflop FP32, full bilinear rates and peak INT8 Z-only (256 Zs/clock, yay!)
INT8 Z-only? What do you mean by that?
All 800 SPs are fat. 1 FP32 MAD or 1 special or 1 FP<->INT or 1 integer op/clock each
Ridiculous. We gotta see some GPUBench numbers.
No ring bus MC, new controller nice and efficient due to new ROP design
Is this partly due to 1 ROP quad per memory channel, like NVidia's products since G80?

It's the single most impressive graphics processor (and pairing with a memory technology, nice one Joe!) I've ever seen, when looked at as a whole. I don't say that lightly either, there have been some winning chips over the years.
That's some serious praise! I gotta agree with you, though, particularly when looking at ATI's recent track record. Before RV770, I honestly thought NVidia was just more talented.
 
INT8 blend, sorry about that. My specials rate is wrong, only the fatter unit can do that (corrected that too). Integer for them all though.

The controller seems designed around that being the case, where each quad ROP connects to a 64-bit memory partition and its L2 pool (and other caches).
 
Ahh that explains the int8 numbers... Test apps using the same coords for multiple textures shouldn't be showing this, though (are there any?)
I think lots of apps do this, but I'm not sure about test apps.

I thought 3DMark06 was like this because it always showed higher texture rate than D3DRightMark for G84 onwards, but obviously I was wrong.

Not sure I follow this. Why would it be pad limited if it's larger than rv670? Shouldn't it have only a slightly larger pad count?
I guess it's the power/ground pins, and maybe some more for R700... ;)

Really? Surprising to see AMD would get back to fixed function resolve (for basic modes) - with tons of shader alus what's the point?
Maybe the rate at which samples can be fed back to the shader is limited. Another reason could be that the shaders can do something else in the meantime.
 
INT8 blend, sorry about that. My specials rate is wrong, only the fatter unit can do that (corrected that too). Integer for them all though.
Okay. So you still mean 256 Z/s per clock? Are you talking about reads/tests (i.e. z-rejection rate) or writes?

Does that mean 16xAA with almost no perf hit? :LOL:
 
Okay. So you still mean 256 Z/s per clock? Are you talking about reads/tests (i.e. z-rejection rate) or writes?

Does that mean 16xAA with almost no perf hit? :LOL:
Let's pretend I didn't brainfart and x4 shall we :LOL:
 
Thanks for the talking points, Rys, and looking forward to the article (duh). Was AMD (read: Fusion) at all responsible for some of the left-field area efficiency increases, or is it all "ATI"?
 
Was AMD (read: Fusion) at all responsible for some of the left-field area efficiency increases, or is it all "ATI"?
It's presumably all ATI, given the timescales for design, Nice question for us to ask Scott, though, or maybe Wavey knows and can spill the beans. 4am here and the sun is coming up, I'm out for a few hours.
 
Status
Not open for further replies.
Back
Top