AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
no, addressing doesn't work like this. there's usually a 256MB window of addressing space per GPU, regardless of video memory size. Other addressing space is allocated to other peripherals (the likes of system clock, PCI devices, DMA stuff..); a bunch is reserved anyway by the OS, that's why 3.25GB memory is the most common number under windows.
What? Where'd you get that from? Video memory is mapped 1:1 into the 32 bit adress space, plus the rest.
 
Last edited by a moderator:
I think only RTs can support R/W and this "works" because the fragment can only access its pixel. Don't know where to look to find this out for sure, though.

Given that one can map that RT's texture as a UAV then you have R/W anywhere right?

UAVs are bindable only as read or write though.

Not from what I can tell. Think about it this way, SRVs are read-only. Wouldn't make any sense to have a read-only UAV. Besides G80 through CUDA has had R/W global access for a long time. It just wouldn't make sense to have a write only UAV. Also global atomic operations can return the read value before the atomic op. So you've got R/W there.

Perhaps you cannot have the same resource bound to multiple stages at the same time?


CS as a "rasteriser" for post-processing of a render target is pretty much the simplest use case as far as I can tell. Are you thinking of a volumetric space-filled (Z) walk through a particle system?

There are a lot of possibilities here, won't know for sure until I play with hardware. I think deferred lighting will be the way to go with DX11, still wrapping my mind around the best way to handle dynamic shadows with all the new functionality. Anyway possibly of binning small lights into tiles then using a full screen CS pass to accumulate light, or maybe separate CS passes block aligned screen bounding rects for larger lights, etc. IMO to know for sure you need to know what is going to work best for shadows and then work around those constraints. I've got other crazy ideas which wouldn't really apply to a standard graphics engine...

Hmm, well, I guess this is like being told to write a program with only one malloc(). Then again multiple buffers will start to fragment memory if they have varying lifetimes. Virtualised memory (everything paged) is a partial solution as contiguousness no longer becomes a prerequisite of being able to create the buffer. But I'm well out of my depth on this whole subject, gladly so.

Arg, malloc(), that kind of dynamic memory allocation doesn't scale.
 
I'm struggling to find specifics on this, without resorting to downloading/installing the SDK to see what those documents say - I dunno what side effects doing that would have on my PC. MSDN library seems bloody useless.

Not from what I can tell. Think about it this way, SRVs are read-only. Wouldn't make any sense to have a read-only UAV. Besides G80 through CUDA has had R/W global access for a long time. It just wouldn't make sense to have a write only UAV. Also global atomic operations can return the read value before the atomic op. So you've got R/W there.
I wasn't aware of SRVs (or forgot). Anyway, I think you're right on all this.

Perhaps you cannot have the same resource bound to multiple stages at the same time?
Yeah.

A CS has a "draw call" all to itself, so there's no feeding it from/to the rest of the pipeline in parallel with other stages. Chas Boyd at the end of his Gamefest 2008 talk vaguely referred to the potential, in future, to bind multiple D3D devices concurrently as a way to do this, making the application responsible for ensuring things work. I guess in some horribly coarse-grained fashion.

Jawed
 
Evergreen is the entire DX11 family name according to my source at Computex.

It would be nice if it works out that way. The opposite with "cypress" as the family and "evergreen" as a specific chip would likely incite violence amongst plant people. The trees already named are natives of california and texas, likely remaining code names would be related trees growing in that region as well.

Putting aside whether wafer shown is a midrange or performance chip - can at least agree on that amd will need 2 chips below it to cover the market? ie a value/entry level gpu and larger one for the higher ends notebooks. Especially now as nvidia has 2 new products in these segments.

Finally careful doing analyses by comparing the displayed dx11 chip with the RV740. The RV740 was the first chip on a new process and likely has additional redundancy and units using up area to make it more certain it actually works(ie there is a high chance of 'ghosts' in the chip). This might not happen so much in follow on chips as the process becomes better understood.
 
Just a wild theory here. We know that AMD had fully functional samples of RV740 in February... By that time they probably also knew that TSMC botched the 40nm process. ... they learned that nVidia is several months behind them with their DX11 offering, so they decided to delay RV870 and use a different process node for it (....GlobalFoundries' 32nm) RV840 was sufficiently small to have relatively good yields on 40nm though, so they started manufacturing it.

I'm totally agreed with you here that they won't push too much as it seems but i'm in disbelief that they'llname their 50% better chip RV840 (sticking to a todays naming scheme) but rather something like RV870 with RV880 going onto 32nm GF as i mentioned earlier with something that RV870 suposed to like. Cause they wont be able to compete with Larabee monster even if it's not a native graphic processor.
And after all seems that german site two month ago (that table with 1200SP @900MHz) actually had real inside info.


I don't know, but to me, showing off their DX11 silicon (albeit behind the curtain) certainly sounds like sending a clear message.
Of course there's still the option that the chip is actually bigger than 180 mm2 and can support a 256bit bus, but I'm not convinced that a GPU this size can replace RV770/RV790. GPUs are continually getting larger and nowadays, 200 mm2 is almost the value segment.

Unfortunately we wont see RV870 to be a blow minding chip but it's a "decent successor" to RV790 just like it was R600 decent to RV580+ but now at least they'll have time advantage they didn't have with twice rescheduled R600 that was late almost a year. Buggy and undeveloped 80nm TSMC that has a life of it's own a year-year and half later than 90nm TSMC while 45nm/40nm are a pretty much a same generation. And more shrunk comparing to intels actual 45nm while in 80nm intel already had 65nm for whole year.

It's simple an ultimately shrunk advanced node like 40nm gives AMD much more breathing space than buggy 80nm ever did. And they'll done it cause they can. Proven SP pipeline, easy dx11 implementation that had a little advancement to real dx10 implementation of R600 should be, and node that's far off some proto stage that was 80nm when monsterous R600 was originally scheduled.
 
Im probably reading too much into the past here, but I don't think we should count too much on die size to indicate what units are on the chip. For instance, RV670 at 192mm^2 and RV770 at 265mm^2 went from 320 to 800 SP's - 2.5x the shaders for 38% more die size.

So if we compare across the same process, RV740 to this new die, even if its 137mm^2 -> 180ish mm^2, fitting 2.5x more shaders would be 1600 SP's.

You cant multiply on architectures that are intended for more packed (advanced) computing APIs. We now need native tessellation shaders that are still software but differently threaded according to that threaded gear box slide. So there's a lot more arbitering and scheduling inside same SIMD core @dx11 no matter how similar to that dx10.1 they already had.

I haven't seen this posted anywhere, though it is old, so here it is.

It's pretty much theo's chewing the same story from that interview
http://www.brightsideofnews.com/print/2009/5/26/the-future-of-globalfoundries-revealed.aspx
Anway, seems they bragging they finally had all going according to plans. With recent cash inflow and much of efforts they did in the last two node implementations it's finally time that they make some profit on their investment. Not that they didn't in the past 3yrs. But not in banking and oiling way.
 
Last edited by a moderator:
=>kerrito: After I wrote the post, I was doing a little deeper analysis, mainly about the growing die sizes, and I came to the conclusion that the card that VR-Zone photographed at Computex is probably a "RV870 Pro". Of course naming is irrelevant, what's important is whether there will be a bigger and faster chip than the one seen at Computex. Right now I don't think so. 32nm is a different story, but such chips will be available at least six months after first DX11 parts.
 
=>kerrito: After I wrote the post, I was doing a little deeper analysis, mainly about the growing die sizes, and I came to the conclusion that the card that VR-Zone photographed at Computex is probably a "RV870 Pro". Of course naming is irrelevant, what's important is whether there will be a bigger and faster chip than the one seen at Computex. Right now I don't think so. 32nm is a different story, but such chips will be available at least six months after first DX11 parts.

Yep. But that might be just enough not to let Larabe to take a huge market knowingly how easily intel penetrate the market, especially with old school fanboyism. I think they already developing libraries for GF 32nm node, but as all first kittens that32nm @GF could have some side-effects if really there's a lot of difference between long improved TSMC process with all their experience with huge chips and GF that avoid to prouce anything biafter ogger than 250mm2. Maybe that's why AMD sticks with their mainstream roadmap since R600 so that they could seamlessly move production into their Fabs.

Anyway, with experienced 32nm that they have in prot production now they could easily produce 200-220mm2 32nm chip that would have 4TFlops just a few month (6-9 month) after original RV870 launch, depending on Larabee's performance and how would it threat gpu market. On the other way ATi Stream is pretty much in beta stage so it might depend on that development also. It would be a big move like in R520-R580 step when latter improved poor 9.0c compliant R423 into a whole lot better performance chip. And also on just a half-node 110nm->90nm
We could then see an whole RV8x0 series as small chips <220mm2 that give an additional value to AMD profit margins.
 
I'm not all that sure that Larrabee will significantly change the battlefield. Intel wants to capture the high-margin markets, that meaning primarily GPGPU. I'd say that this time fanboyism will play into AMD/ATI and nVidia's advantage, since in GPUs, they are the companies with tradition. Maybe a lot of people will associate Larrabee with Intel's integrated solutions, so when they see an Intel GPU their first thoughts will be "bad drivers, bad performance"...
 
=>kerrito: After I wrote the post, I was doing a little deeper analysis, mainly about the growing die sizes, and I came to the conclusion that the card that VR-Zone photographed at Computex is probably a "RV870 Pro". Of course naming is irrelevant, what's important is whether there will be a bigger and faster chip than the one seen at Computex. Right now I don't think so. 32nm is a different story, but such chips will be available at least six months after first DX11 parts.
That's pretty curious reasoning I reckon, so care to expound?

You appear to be suggesting that this is a 256-bit memory bus chip and will be the basis of the enthusiast level X2 part.

This chip is ~45mm² bigger than RV740. A 256-bit bus, alone, would consume at least 18mm² of that, maybe a bit more - that's just the physical connections, the MCs are extra. Call it 20mm² leaving about 25mm² for extra clusters, D3D11 architectural changes and performance increases.

An extra 2 clusters (to get back to the 10 clusters of RV770) would be about 12mm². (Each cluster is about 10mm² in RV770.)

And we don't know if this chip has a "cap ring". We don't know if RV740 has one, either. RV790's cap ring costs ~19mm². The perimeter of RV770 is 65mm, so if the cap ring is literally an addition round the entire chip's edge, that's about 0.29mm thick. Putting the same size ring on this chip, whose perimeter is about 54mm, costs ~16mm². Does it scale with process?

So, two clusters 12mm² + 128-bit bus 20mm² + cap ring 16mm² = 48mm².

Alternatively, RV740 has a 2:1 RBE:MC ratio (16 pixels per clock fillrate with 128-bit bus). This might be some funkiness similar to how RV530 had better RBEs than R580 - or it could point towards all future ATI chips working with the same ratio. If the latter, then:

128-bit bus 20mm² + 16 RBEs + extra clusters? + D3D11 tweaks = ??mm². Sounds like a lot more than 45mm².

ATI chips need more fillrate I reckon (even if bandwidth on GDDR5 is not going to magically double this year). But there's a tricky factor here with the crossbar from L2s to TUs, where L2 count scales with fillrate. In RV770 there's 10 TUs for the 4 L2s. In RV740 it's 8:4 - what RV770 was supposed to have been. If ATI chips will have 32 RBEs, then connecting 8 L2s to 16 or more TUs looks troublesome.

Jawed
 
=>Jawed: I used a totally different approach, one totally ignorant of the underlying architecture… and somewhat easier to begin with. We saw a wafer with the chips and it's probably big enough to support a 256-bit bus.

Now, we know that ATI doesn't do big chips anymore (it would surprise me if they started again) and also doesn't do "odd" memory bus widths, such as 192 or 384 bits (yeah I know there's a first time for everything... but let's assume they'll keep the current system). That means a 384 or 512-bit chip is probably out of the question. And a bigger one with 256-bit bus? Not likely - it would end up as G92 and G94, the former being memory starved and the latter replaceable by salvage parts (and it seems 40nm yields will favor salvage parts).

Then there's the argument of GPU die size constantly growing - but looking at historical data, some actually made a "step back"; it was those that used a new manufacturing process, such as the R520. 40nm is a big step from 55nm, so it seems logical that the step back in die size occurs with the 40nm parts.

By the way, will RV870 use the same transistor density as RV740? Because that seems to be the basis of your analysis. Waddyaknow, shit happens
eusa_whistle.gif
 
=>Jawed: I used a totally different approach, one totally ignorant of the underlying architecture… and somewhat easier to begin with. We saw a wafer with the chips and it's probably big enough to support a 256-bit bus.
I think it would be the smallest GPU ever with a 256-bit bus, if it were. For what it's worth, I think it could be 256-bit.

But if this is like R600->RV670, then 128-bit would be fine. RV770/790 only really need their bandwidth for 8xMSAA or very high resolution/highest-quality settings. Mainstream GPUs tend not to be so luxuriously endowed in the following generation.

Now, we know that ATI doesn't do big chips anymore (it would surprise me if they started again)
Yet RV790 was larger - they didn't even bother ditching the apparently useless CrossFireX Sideport - seems like a low-investment refresh. Almost seems to indicate that in this price category the die size difference, 20mm² say, is no big deal.

and also doesn't do "odd" memory bus widths, such as 192 or 384 bits (yeah I know there's a first time for everything... but let's assume they'll keep the current system). That means a 384 or 512-bit chip is probably out of the question. And a bigger one with 256-bit bus? Not likely - it would end up as G92 and G94, the former being memory starved and the latter replaceable by salvage parts (and it seems 40nm yields will favor salvage parts).
At 4xMSAA RV770/790 have bags of spare bandwidth - which means there's room for more RBEs. Which cost a lot of area.

Then there's the argument of GPU die size constantly growing - but looking at historical data, some actually made a "step back"; it was those that used a new manufacturing process, such as the R520. 40nm is a big step from 55nm, so it seems logical that the step back in die size occurs with the 40nm parts.
Which GPUs, specifically, are you referring to?

By the way, will RV870 use the same transistor density as RV740? Because that seems to be the basis of your analysis. Waddyaknow, shit happens
eusa_whistle.gif
The best case scenario is that RV770 can be scaled down to ~181mm² using the density we see in RV740 (assuming CrossFireX Sideport is deleted), which gives 20mm² of extra bus + 12mm² of clusters, leaving 13mm² (~7%) for all other changes + performance. Another 2 clusters?

In other words the best case is that this chip could offer ~10% more performance than RV790. Doesn't sound to me like the best chip of AMD's introductory D3D11 generation.

As a 128-bit, 16 RBE chip with say 4 or 6 extra clusters, D3D11 changes and fast GDDR5, 1.2GHz? Making it a chip for 1680x1050x4xMSAA users, or 1920 on older games?

I dunno, but either way this doesn't look fast enough to be the basis of a $300 SKU or $550 X2 SKU.

Jawed
 
Jawed said:
Yet RV790 was larger - they didn't even bother ditching the apparently useless CrossFireX Sideport - seems like a low-investment refresh. Almost seems to indicate that in this price category the die size difference, 20mm² say, is no big deal.

That depends on time to market and costs of redesign to remove the sideport (no point leaving the die are with no sideport blank is there?).

Balance the two up and then there is your answer - it might be more expensive to remove a redundant feature. I am sure Dave knows better ;).
 
Jawed (and whoever else might want to comment), what do you think about Theo's comment "The alleged specifications of ATI Evergreen reveal that this chip is not exactly a new architecture, but rather a DirectX 11-specification tweak of the RV770 GPU architecture." This get to market first approach seems rather ingenious to me, doesn't seem like much in DX11 is radically different from what ATI had with DX10.1 (beyond shared memory stuff). Bump up registers and local store to CS5 level, insure 32-bit (u)int atomics work, bump tessellation unit from the 16x to 64x level required by DX11 and implement whatever is necessary to run VS+HS directly to global memory with a second pass for TS+DS+PS (GS as per R700). Append/consume is easy to do with atomics.

About the only possible wildcard I can see is DX11 R/W render targets. Seems (from what others have posts) as if DX11 has only 32-bit int and 32-bit unsigned int atomics, so using atomics on individual (8-bit or 16-bit) components of a 32-bit (or 64-bit) render target doesn't seem possible (would instead have to do a CAS on the entire pixel and a retry loop on CAS failure, 64-bit would be a mess without 64-bit atomics). Which really likely makes unordered R/W to render targets as marginal in usefulness (using a full 32-bit value per component IMO is not an option unless working out of cached memory). So seems as if the memory export ability of R700 would be fine for DX11 R/W RT access...
 
That depends on time to market and costs of redesign to remove the sideport (no point leaving the die are with no sideport blank is there?).
Plus RV790 might only be in production until August. RV770/790 could well both disappear when D3D11 GPUs launch (assuming simultaneous launch of the chip we've seen plus the successor of RV790). RV710/740 holding the value end of the market until their corresponding D3D11 parts appear, presumably within 1-2 quarters.

Jawed
 
We consider 180mm² GPU a bit large for 128bit part... but what's the minimum size of 128bit GDDR5 part? Maybe ATi plans to shrink this GPU when 32nm or 28nm manufacturing process will be available. Just like RV530->RV535, RV630->RV635 etc.

180mm² at 40nm means about 120-130mm² at 32nm or slightly over 100mm² at 28nm.
 
Jawed (and whoever else might want to comment), what do you think about Theo's comment "The alleged specifications of ATI Evergreen reveal that this chip is not exactly a new architecture, but rather a DirectX 11-specification tweak of the RV770 GPU architecture."
My original theory, going back to the arrival of R600, is that this is an architecture with a very long lifetime - and there's plenty of hints that ATI designed R600 with a lot more capability than D3D10 ended-up with, as D3D10 was cut-back. At the same time there are hints that 10.1 has capabilities beyond anything planned for 10. So it's all a bit murky.

It seems to me that GS/SO were particularly hard-hit, i.e. amplification was seriously curtailed. But if TS was in the future, what was anyone aiming for in GS/SO anyway?

At the same time I've been wondering whether the fine-grained and more pervasive nature of memory operations in D3D11 requires an overhaul - which is why I was asking about whether the pure latency-hiding architecture is enough on its own or whether more advanced caching is required (with a nod towards the pre-fetching that already exists in regular caching of texels for ordinary texture mapping).

And then we get into routing bottlenecks and crossbar madness.

This get to market first approach seems rather ingenious to me, doesn't seem like much in DX11 is radically different from what ATI had with DX10.1 (beyond shared memory stuff). Bump up registers and local store to CS5 level,
LDS needs doubling, but I'm not aware of any effect on registers.

insure 32-bit (u)int atomics work,
It occurred to me recently that global atomics aren't currently a part of CAL programming (or I haven't found them) - there is atomicity at LDS/shared-register (i.e. both at SIMD) level, and this atomicity is very much in the sense of any operation, rather than the D3D11-style integer atomics. This is because when a clause is underway on ATI it's uninterruptible, so as long as the entire atomic update is within a single clause it's atomic by default.

So it seems to me that atomics in the D3D11 sense are entirely new, even if D3D11's thread group shared memory atomic operations are trivial (i.e. they're just restricted in data type versions of what the clusters already achieve).

bump tessellation unit from the 16x to 64x level required by DX11
Seems non-trivial - though I've discovered that the amplification of TS is only on odd factors, i.e. from 15x to 64x is only 24x more vertices. The sheer quantity of data seems to imply that current ATI GPUs (including Xenos) use 15x for practical purposes - specifically because of on-die buffer capacity? Simply because of setup throughput?

64x in D3D11 could be just another bonkers limit, like 4096 vec4s per pixel of registers.

and implement whatever is necessary to run VS+HS directly to global memory with a second pass for TS+DS+PS (GS as per R700). Append/consume is easy to do with atomics.
SO is append, and currently supports 4 streams bound at one time. As far as I can tell vertex fetch is consume. So the atomicity required in coordinating all clusters already exists in ATI. It's just a question of whether it scales.

About the only possible wildcard I can see is DX11 R/W render targets. Seems (from what others have posts) as if DX11 has only 32-bit int and 32-bit unsigned int atomics, so using atomics on individual (8-bit or 16-bit) components of a 32-bit (or 64-bit) render target doesn't seem possible (would instead have to do a CAS on the entire pixel and a retry loop on CAS failure, 64-bit would be a mess without 64-bit atomics). Which really likely makes unordered R/W to render targets as marginal in usefulness (using a full 32-bit value per component IMO is not an option unless working out of cached memory). So seems as if the memory export ability of R700 would be fine for DX11 R/W RT access...
I don't understand why D3D11 R/W would want to be per-component in a RT :???: RTs are always multiples of 32 bits as far as I can tell.

The way I see this is that the developer is on their own - if they want atomicity and a specific ordering on RT R/W they need to roll their own - will it be faster than multi-pass?

If anything, the disaffection for GS/SO may be repeated in much of the new stuff in D3D11 - e.g. if it turns out that bandwidth and latency are much too dominant - at least for the first generation of GPUs.

Jawed
 
But if this is like R600->RV670, then 128-bit would be fine. RV770/790 only really need their bandwidth for 8xMSAA or very high resolution/highest-quality settings. Mainstream GPUs tend not to be so luxuriously endowed in the following generation.
And if it's the biggest GPU of this family and an X2 part will be based on it? Wouldn't it be bandwidth starved then?
Yet RV790 was larger - they didn't even bother ditching the apparently useless CrossFireX Sideport - seems like a low-investment refresh. Almost seems to indicate that in this price category the die size difference, 20mm² say, is no big deal.
I don't exactly consider RV790 a "big" GPU, it has less than 300 mm² and uses a mature manufacturing process.
Which GPUs, specifically, are you referring to?
2002: R300 ~ 218 mm², high-end
1/2006: R580 ~ 315 mm²
Q4 2007: G92 ~ 324 mm², but not high-end anymore
But R520, for instance, used a new manufacturing process and its die size (264 mm²) was smaller than R420's (281 mm²).
In other words the best case is that this chip could offer ~10% more performance than RV790. Doesn't sound to me like the best chip of AMD's introductory D3D11 generation.

As a 128-bit, 16 RBE chip with say 4 or 6 extra clusters, D3D11 changes and fast GDDR5, 1.2GHz? Making it a chip for 1680x1050x4xMSAA users, or 1920 on older games?

I dunno, but either way this doesn't look fast enough to be the basis of a $300 SKU or $550 X2 SKU.
There's plenty of possibilities. Perhaps that wild theory of mine I mentioned earlier might still be in the game - if nVidia isn't going to have a DX11 part ready till Q2 2010, ATI can sell practically anything in the meantime, while preparing a more powerful chip to battle the GT300?
On the other hand, +10 % compared to the RV790 still gives you a pretty fast GPU.
no-X said:
We consider 180mm² GPU a bit large for 128bit part... but what's the minimum size of 128bit GDDR5 part? Maybe ATi plans to shrink this GPU when 32nm or 28nm manufacturing process will be available. Just like RV530->RV535, RV630->RV635 etc.

180mm² at 40nm means about 120-130mm² at 32nm or slightly over 100mm² at 28nm.
Now I can't remember, was it you who said that even for an "optical" die-shrink, there's still some redesigning necessary because of the analog parts of the chip? RV530 had one cluster, so there wasn't really a choice, but RV740 has eight. And as with RV770 during development, it's easy to add clusters to achieve the desired die size. So, if Evergreen (or is it Cypress?) is a 128-bit GPU, why wouldn't they "fine-tune" its die size to the smallest required (that could be around 120 mm² for GDDR5) and add more clusters later with the die-shrink?
 
And as with RV770 during development, it's easy to add clusters to achieve the desired die size. So, if Evergreen (or is it Cypress?) is a 128-bit GPU, why wouldn't they "fine-tune" its die size to the smallest required (that could be around 120 mm² for GDDR5) and add more clusters later with the die-shrink?
Because that would be RV740???

I am liking no-X's thought process though. Very interesting idea.
Save RV740's bus/MC/ROP setup and toss on an extra cluster or two over RV770. Eventually shrink it as a test for GF's 32/28nm.
I like it alot.
 
Back
Top