AMD RV770 refresh -> RV790

2560*1600*4bytes/pixel~16MB :oops:, that's way too much. Does xenos renders to it's edram in tiles or does it operate at a lower resolution?

Perhaps 2 or 3 shrinks later. :)
 
Last edited by a moderator:
AFAIK Xenos uses it's eDRAM in 2x2-Viewport-Tiling if either 4xMSAA is enabled or resolution exceeds 720p.
 
The 3 leadoff 40nm chips are - RV740, GT216 and GT218.
Two AMD 40nm chips have been rumoured. Apart from RV740, one could be RV870 or "RV720". In the first half of this year, though, I think we can discount RV870. And then there's the scarcity of "RV720" rumour.

Not sure what else comes after, amd also needs something to replace their igp, also i swear i read last week somewhere jsen huang let it slip ion would dx11 before the end of the year(will spend rest of today looking for this link ;)).
IGPs being what they are (glacial), it is definitely possible they are the other 40nm chip not to be seen for most of the year.

The rumor i heard was ~290mm2, that was last year. Wasn't sure if the extra size was due to the change from 55GP to GT or they had added extra units.
Is there a precedent for GP->GT causing a chip to grow substantially?

Sorry is not 40nm, the vr-zone link said 1.3V above TSMC's 1.2V max for 40nm...exceeded on the first chip, doesnt seem very probable.
I see VR-Zone has an "update from CJ" attached, which says there'll be an OC version of RV790 bringing 25-30% performance gains. Assuming that's based on clocks and extra clusters, then I suppose 12 clusters would easily fit in 290mm2. 750MHz 12 clusters is 120% faster, at 850MHz that's 36% faster.

Is interesting planning to try and put out a US$300 part, that seems way high without a lower priced volume part. Maybe they did something to the MC/RBE's so it only rally works well with GDDR5, samsung expects GDDR5 to be 20% of the market this year.
$300 would be the higher-clocked (XT, or OC as VR-Zone is calling it) GPU, but if it's only 10-15% faster than the Pro GPU then pricing does seem screwy, relying solely upon the difference in memory between them.

Indeed, one way to fit the rumour is that Pro is 12 clusters at 750MHz with 512MB GDDR3 at 975MHz (HD4850 is 993MHz), producing "20% higher performance than HD4870-512MB" only if you count situations when bandwidth isn't a constraint. The XT would then be 825MHz with 900MHz GDDR5, for a supposed "25-30% gain over HD4870-512MB"...

Jawed
 
Yes, it looks like 12 SIMD units.

Source: AMD RV790 and RV740 not before April

Hmm, ~30% increase in max throughput. Only 20% increase in alu's suggests that power was a constraint so they decided to reduce the number of transistors and adjust by increasing clocks instead. Though I was expecting a larger increase in the alu count, in 40-60% range. :rolleyes:

In that case, however, the die size should definitely be smaller than 200mm2. I'd guess around 160-180mm2. If they are aiming for a $200/300 price point with it, then it should have great margins compared to rv770. But I doubt if they will attempt a 4890x2 or whatever.
 
Carsten,

I know that you understand German. So you can read that the informations are refered to AMD and another, not publish source.
 
Accoring to my info: RV790 is Mid and End of April. Both versions of RV740 (9600GT and 9800GT competitors, targetprices ~$119 for 512MB GDDR5, ~$99 for 1GB DDR3, A11 currently clocked at 700Mhz engine) should be in May. But hey, it could be dated already with all the smoke and mirrors AMD have been pulling lately.
 
alu count seems reasonable, but memory bandwidth seems to have very little growth compared to rv770, when it was already texturing limited in certain games.
Texturing shouldn't be bandwidth constrained generally.

This makes me wonder, could amd decide to implement some edram (ie xenos like) stuff on future stuff (may be not rv790/740 but on rv8xx perhaps or may be even later), that would relieve a lot of bandwidth requirements. It could be on die if necessary and since display resolutions don't grow as fast as moore' law, the cost could be manageable/negligible in not too distant future.
D3D11 arguably needs a major rejig in on-die memory architecture because pixel shading is allowed to read and write render targets.

R600 architecture already allows registers to be read from and written to memory locations. It supports this functionality through the memory read/write cache, which should optimise for locality (to a degree, anyway).


The Sequencer's main duties are:
  • scheduling ALU clauses
  • scheduling TU clauses
  • moving data into/out-of registers
  • fetching constant cache lines
  • controlling branching and manipulating/testing the stack
The ability to move data into and out of registers (MEM_SCRATCH instructions, according to the R600 ISA document) may well be the key to supporting pixel shader reading/writing of render targets. i.e. simply reserve a register per render target per pixel (or sample).

Or perhaps a new kind of fetch clause type will be defined in addition to vertex fetch and texture fetch clauses, i.e. pixel fetch. This boils down to how ordering of triangles is handled, because fetching a pixel/sample from a render target must be strictly ordered by triangle for each location.

If you look at the end of a pixel shader program you'll see a Sequencer export instruction. This specifies the registers that are written to the render targets, i.e. translating from a register location into a memory location - though of course in this situation the pixel has to pass through the RBE that handles that memory location (according to tiling of render targets in memory).

I'm wondering if all colour blend operations will be performed by adding instructions into the pixel shader to read then blend.

So, the overall effect of D3D11 pixel reading/writing on the ATI architecture could be fairly minimal. It may be that the register file has to increase in capacity simply to deal with the additional latency that this kind of manipulation generates.

Jawed
 
RV740 vs 9600GT and 9800GT? Wow, that should be a fun bloodbath.

I'm wondering if all colour blend operations will be performed by adding instructions into the pixel shader to read then blend.

What does fixed function blending cost relative to other ROP bits like AA and compression etc? Also would pixel shader blends imply that AA happens there too? And if it does how would AA sample compression work in the shaders? Seems like a lot of stuff would slow down and/or bandwidth efficiency would be lost.
 
Carsten,

I know that you understand German. So you can read that the informations are refered to AMD and another, not publish source.

I am referring purely to your posting here. Haven't read HW-Infos in a while.
 
Yes, it looks like 12 SIMD units.
Source: AMD RV790 and RV740 not before April
The 12 SIMD seems like your own speculation based on "more SIMD" and "+30% overall performance"?
This would IMHO be a too small chip in 40nm for 256bit gddr5. Sound more likely for the 55GT variant (although I don't believe in a redesign and a more expensive chip for +20% performance, especially not from AMD).
The 16 SIMD seems more plausible regarding size, however, this would probably be quite RBE limited unless something is done on those. Maybe the 16 SIMD and nothing else would be close to overall +30% actual game performance on the same clock?
We still don't know a whole lot about the RV740 RBEs - if they could use those in the '90? Do we even have a reliable size estimate on the '40?
 
Lots of good stuff :D
The 12 SIMD seems like your own speculation based on "more SIMD" and "+30% overall performance"?
This would IMHO be a too small chip in 40nm for 256bit gddr5. Sound more likely for the 55GT variant (although I don't believe in a redesign and a more expensive chip for +20% performance, especially not from AMD).
Yeah, agreed 12 clusters implies 55nm. But I still think 12 clusters is pure guesswork.

R580 was a "20%" refresh of R520, ~20% bigger die and about 20% better performance under some conditions around the time of launch.

But RV670 was merely a major cost-reduction of R600.

The 16 SIMD seems more plausible regarding size, however, this would probably be quite RBE limited unless something is done on those. Maybe the 16 SIMD and nothing else would be close to overall +30% actual game performance on the same clock?
I suspect HD48xx GPUs are a bit short on texturing but with a bit more texturing they'd simply run into a wall with Z fillrate. I wish someone would make a concerted effort to test this stuff.

We still don't know a whole lot about the RV740 RBEs - if they could use those in the '90? Do we even have a reliable size estimate on the '40?
RV740 RBEs are a big deal - I can't see how they can make use of ~60GB/s unless there's twice as many as in RV730 or the Z configuration is doubled.

I think I saw 100mm2 for RV740 rumoured, earlier in this thread? Seems way too low to me - somewhere in the region of 120-130mm2 with all the extra ALUs. I'm assuming that it'll be a 4:1 GPU with 8 clusters.

Totally wild speculation: the 290mm2 rumour for a 55nm RV790 could allow for 8x Z per colour in the RBEs, since 12 clusters wouldn't take it past 280mm2 :p

Jawed
 
Can some body explain to me the meaning of being fillrate limited? Also what is fillrate anyway? Is it the max number of pixels a gpu can push out while running trivial pixel shaders? If yes, then why is it quoted for fixed function gpus?
 
Depends on the type of fill-rate you're talking about. Vanilla pixel fill-rate is basically how fast the card can write dumb fragments to the frame-buffer. So you're basically flat-shading polygons - no texturing, no shader code. It's a measure of ROP throughput and in practice is significantly bandwidth bound. Why shouldn't it be quoted for FF stuff? Back then the only difference was that the ROP, texturing and shader pipeline were all combined.

Z-fillrate is an even "dumber" version where you're not writing color but only depth. Some architectures (especially Nvidia's) can accelerate this process by writing many more depth values per clock than they can write color. It's extremely useful for z-only passes in those algorithms that employ them.

And well, you know what texture fillrate is. Number of pixels * Number of textures per pixel.
 
What does fixed function blending cost relative to other ROP bits like AA and compression etc?
Blimey, there's a tricky question, dunno.

http://msdn.microsoft.com/en-gb/library/bb205120(VS.85).aspx
http://msdn.microsoft.com/en-gb/library/bb204894(VS.85).aspx
http://msdn.microsoft.com/en-gb/library/bb204892(VS.85).aspx

It's just a bit of math :LOL: I doubt the ALUs are particularly costly these days, though it's worth noting that full fp32 functionality didn't make it into D3D10, so back in 2005, say, it was a relatively costly bit of math.

Compression should be separately handled and is essentially a function of the "memory hierarchy" part of the render back end, rather than the graphics-math part.

Some of the blend modes are min or max - similar to Z testing, basically. Arguably they're similar enough in functionality that they would all move into pixel shading simultaneously.

Also would pixel shader blends imply that AA happens there too? And if it does how would AA sample compression work in the shaders? Seems like a lot of stuff would slow down and/or bandwidth efficiency would be lost.
Remember AA resolve works fine in R6xx's pixel shaders and custom resolve functionality is something that deferred rendering engines can do.

Apart from a question of routing (and bandwidth) there's also a question of latency. If pixel-shader colour blending is implemented it increases latency, which inflates the amount of storage space on die given over to holding colour data.

Arguably handling that state is something that the register file and the out-of-order thread scheduling are perfectly adapted to do - why build a second one in the RBEs?

Of course as far as Larrabee's concerned, this is all just stuff in L2 to be manipulated :p

(The thought has occurred to me that once Larrabee style GPUs take over, GPU architecture just won't be at all interesting :cry: )

Jawed
 
(The thought has occurred to me that once Larrabee style GPUs take over, GPU architecture just won't be at all interesting :cry: )

Yeah no doubt. And it'll be even less interesting than CPU architectures as all the fancy logic for extracting ILP and achieving high single-threaded performance won't be in the picture. We aren't quite there yet though, there's still a lot of work to be done to deal with SIMD divergence.

In terms of blending in the shaders, it's probably going to be the first thing to move. Nvidia already introduced global memory atomics with CUDA. And DX11 requires more general access to the framebuffer anyway so if you've gotta do all that anyway then why not? I've got no idea what AMD has in store for DX11 but I figure Nvidia is going to invest even more heavily in CUDA next generation than they have to date. How much of that investment serves to improve game performance will determine whether they find themselves in the same perf/mm hole they're in today.
 
I think I saw 100mm2 for RV740 rumoured, earlier in this thread? Seems way too low to me - somewhere in the region of 120-130mm2 with all the extra ALUs. I'm assuming that it'll be a 4:1 GPU with 8 clusters.
Jawed
That was a very basic estimation, which more than one person came up with back in Dec, before we even had any hard info. Still, 100mm2 isn't too far off when a linear shrink of RV770 is right at ~140mm2.
Remove the sideport, two clusters and it should be pretty close.
Only have another 1-1.5 months to wait, until we have better info.
 
Hmm, TSMC's documentation indicates that scaling from 55->45/40nm should be ~0.55x, so on that basis you'd be right. I was under the impression it was more like ~0.67x :???: Can't find the posting that led me astray...

Jawed
 
Back
Top