AMD: R8xx Speculation

3dilettante · Sep 15, 2009

Jawed said:
So R800 has 4 RBEs, sinking pixels from 20 clusters.

Or rather 8, going by how AMD insists on having 8 RBEs drawn.
Even when there is an RBE section next to each memory controller, it contains two RBEs.

This opens up other curiousness on how these contend for the same memory controller. I interpret this to mean that the RBEs are neighbors in screen-space if they are also neighbors physically. It would seem best that they must operate in concert, if they aren't going to cause the memory controller to turn around unnecessarily. Though if that is the case, why have them drawn as separate RBEs at all?

The 2 rasterisers can each have distinct regions of screen space. This means that the two hierarchical-Z/stencil systems can work independently, yet still accelerate a common back buffer.

My problem with this is how the two rasterisers get their work. Until you've done a coarse rasterisation (identifying a screen-space tile, the first stage of hierarchical rasterisation), you don't know which rasteriser to use. Whoops, coarse rasterisation is the first step in rasterisation. etc.

Is this absolutely necessary?
Is there something that results from the process of rasterizing two triangles with overlapping coverage that can't be corrected after the fact?

Perhaps it's another brute-force doubling of hardware.
Two rasterization blocks that act like they are alone, each scanning their own triangles and hoping the rest of the pipeline knows how to handle overlap.

trinibwoy · Sep 15, 2009

no-X said:
I admit, that I'm definetely not a financial expert, but tell me please, why Jen Hsun explained part of their loss by a huge stock of unsold GT200/65nm GPUs, if their costs aren't accounted?

If they aggressively writeoff inventory when ASPs fall that will be a hit against income. In a situation where you're going to lose money anyway it makes sense to maximize the accounting loss now.

Jaaanosik · Sep 15, 2009

trinibwoy said:
If they aggressively writeoff inventory when ASPs fall that will be a hit against income. In a situation where you're going to lose money anyway it makes sense to maximize the accounting loss now.

The bottom line is NV did not break even with the GT200. Is that correct?
Would anybody know how much they lost?

trinibwoy · Sep 15, 2009

There's no way to tell how much a company makes on any given GPU as for that calculation you would need a breakdown of R&D and production costs. For all we know R&D expense for GT200 was nothing compared to G80 as it was a much smaller evolution. Today they're incurring R&D expenses for things we won't see for months/years.

ninelven · Sep 15, 2009

Jawed said:
So, either R900 is a total rethink in the Larrabee direction or AMD's fucked in 18 months.

You don't think they will just migrate to a 384-bit memory bus? 1.5 GHz GDDR5 should be available by then, which would provide ~288 GB/s.... how much bandwidth do you really need?

Jawed · Sep 15, 2009

3dilettante said:
Or rather 8, going by how AMD insists on having 8 RBEs drawn.
Even when there is an RBE section next to each memory controller, it contains two RBEs.

Yes, 8, brain-fart.

This opens up other curiousness on how these contend for the same memory controller. I interpret this to mean that the RBEs are neighbors in screen-space if they are also neighbors physically. It would seem best that they must operate in concert, if they aren't going to cause the memory controller to turn around unnecessarily. Though if that is the case, why have them drawn as separate RBEs at all?

Any single RBE (i.e. for four pixels per clock) owns multiple tiles in screen space. There isn't any meaning in the neighbourliness of RBEs, so when two share an MC it's no different from the MC supporting multiple tiles owned by a single RBE. Well, that's my opinion...

The whole point of the colour/Z/stencil buffer caches is to give the MCs sizable and coherent lumps of work to do, instead of piecemeal quad-sized work. After all an RBE will, over the course of four successive clocks, receive 64 fragments from the shader pipeline. That's a nice chunk of data.

Scatter type memory operations are where the performance cliffs lie, and RBE isn't scatter-like.

Now you could argue that the sheer quantity of work generated by 2 RBEs per MC is a problem. Well, that's just a bandwidth question, I don't think it's really a number of transactions per second question - most games have a fillrate/bandwidth issue, not a transactions per second issue. Well that's how it seems...

Is this absolutely necessary?
Is there something that results from the process of rasterizing two triangles with overlapping coverage that can't be corrected after the fact?

No, I was referring to the concept of splitting the workload of rasterising each triangle (in the case of triangles spanning multiple tiles in screen space) or fully assigning a triangle to only one rasteriser (if the triangle only covers a single tile). Until you've rasterised a triangle you don't know which screen space tiles it falls into. If you want to split-up the effort of rasterisation of triangles, then you only want a single rasteriser to be working on a given tile.

Perhaps it's another brute-force doubling of hardware.
Two rasterization blocks that act like they are alone, each scanning their own triangles and hoping the rest of the pipeline knows how to handle overlap.

By definition the pipeline already handles overlap. The key question is minimising wasted effort on fragments that will be killed by the RBEs.

Now there may be (as you suggest) some other system being used here, that isn't screen space tile driven for workload assignment. e.g. round-robin assignment of triangles to rasterisers. This would require the two hierarchical-Z buffers to be duplicates of each other (i.e. requiring coherency traffic). Even if there's some latency there that allows two successive triangles to both be rasterised and shaded even though one entirely hides the other, the actual rendering will still work because the RBEs are the final arbiters of Z. Hierarchical-Z may even be able to send a kill signal after a batch of fragments has commenced shading (when it finally receives the message from the other hierarchical-Z unit that the triangle it allowed, which is sat in its cool-off queue, should have been killed). This would only apply if the entire triangle/tile needs killing (here I'm presuming that only one triangle occupies a pixel shading batch/tile).

The diagram seems to suggest that a single batch despatcher consumes the two rasterisation streams and can therefore issue to any of the 20 clusters.

Jawed

Jawed · Sep 15, 2009

ninelven said:
You don't think they will just migrate to a 384-bit memory bus? 1.5 GHz GDDR5 should be available by then, which would provide ~288 GB/s.... how much bandwidth do you really need?

If AMD wants to make GPUs with more than 256-bit of memory bus, then fine. HD5870 is seemingly big enough to support a 384-bit bus going by conventional I/O sizing.

Maybe that patent application I linked has yet to see the light of day and 512-bit of interface is possible in a 220mm² die. Who knows, eh?

Or maybe Rambus will finally get its day.

I dunno, is HD5870 a sweet-spot GPU in terms of die-size? What about the 28/32nm variant? etc.

Meanwhile GDDR5 gives every indication of going nowhere in terms of speeds if that AMD slide is to be believed. 5Gbps is really nothing to boast about ("Enables Speeds Approaching 5 Gbps"). It's frankly bizarre (to the extent I'm dubious and am hoping I'm reading it badly wrong).

Jawed

ninelven · Sep 15, 2009

Jawed said:
I dunno, is HD5870 a sweet-spot GPU in terms of die-size? What about the 28/32nm variant? etc.

Yeah, not really. I had really hoped to see at least a 320-bit memory bus; I can only guess they are planning ahead for a 28nm variant, and it simply won't have the die space for anything more than 256-bit. At least it should be incrediblely profitable for them.

LordEC911 · Sep 15, 2009

Jawed said:
I dunno, is HD5870 a sweet-spot GPU in terms of die-size? What about the 28/32nm variant? etc.

Jawed

That would be the most logical reason. Try to minimize the R&D and cost of shrinking the GPU in, roughly, 6 months when GF gets the node ready.

I believe someone else here already suggested that about Juniper quite a few months ago.
Edit- Found it on page 36, it was no-X.

no-X said:
We consider 180mm² GPU a bit large for 128bit part... but what's the minimum size of 128bit GDDR5 part? Maybe ATi plans to shrink this GPU when 32nm or 28nm manufacturing process will be available. Just like RV530->RV535, RV630->RV635 etc.

180mm² at 40nm means about 120-130mm² at 32nm or slightly over 100mm² at 28nm.

mczak · Sep 15, 2009

Rangers said:
I see the same trend with 285 v. 5850? What's it's excuse?

It seems way less pronounced there though. The 5850 starts already faster and just increases its lead a bit vs. the 285GTX but the 5870 sometimes is slower and then tramples all over the GTX 295 with highest settings. Plus, I think there were some threads about nvidia cards/drivers having sharper performance drop when running out of memory compared to AMD.

Richard · Sep 15, 2009

A guide to those who can't read thread titles:

This thread is about ATI's upcoming part.

Use this one for nVidia's upcoming part.
Use this one for how nVidia will counter ATI's launch.
Use this one for nVidia is over.
Use this one for AMD is over.

nplack · Sep 15, 2009

A Guide to those who can't be bothered to read an entire post before moving it:

I was commenting on how ATi's timeline for upcoming parts could change due to the delay in the GT300. Seems worthy to place in the R8xx speculation thread.

But to each their own.

-Plack

3dilettante · Sep 15, 2009

Jawed said:
Any single RBE (i.e. for four pixels per clock) owns multiple tiles in screen space. There isn't any meaning in the neighbourliness of RBEs, so when two share an MC it's no different from the MC supporting multiple tiles owned by a single RBE. Well, that's my opinion...

The whole point of the colour/Z/stencil buffer caches is to give the MCs sizable and coherent lumps of work to do, instead of piecemeal quad-sized work. After all an RBE will, over the course of four successive clocks, receive 64 fragments from the shader pipeline. That's a nice chunk of data.

The RBE-specific caches are local to each RBE, so if there are two per memory controller, the controller sees two separate chunks of data being sent out.
Under load conditions with each RBE contending equally, a naive arrangement might interleave traffic from each RBE with the other, which would hurt utilization of the memory bus if the targets are far enough apart in memory.
I suppose a single RBE could for some reason interleave from multiple batches, though I'm not sure it would want to.

Ways to limit the abuse of the MC would be to either make sure there is much greater locality between RBEs--that is that they work on neighboring tiles at the same time; make it so that an RBE has a monopoly on an MC for some number of bus transactions; or expand the MC's ability to recombine traffic.

Now you could argue that the sheer quantity of work generated by 2 RBEs per MC is a problem. Well, that's just a bandwidth question, I don't think it's really a number of transactions per second question - most games have

The regularity of the transactions and their locality can influence the amount of utilized bandwidth.
Jumping around and closing/reopening DRAM pages or otherwise not providing the linear accesses DRAM really likes can cut down the amount of time available for actual transactions.
The factors for this would be in the opaque realm of AMD's mapping policies, memory controller parameters, and GDDR5's architectural restrictions.

By definition the pipeline already handles overlap. The key question is minimising wasted effort on fragments that will be killed by the RBEs.

This is an admirable goal. Given how much of the design appears to be "MOAR UNITZ", I am curious to see what they tried. The GPU peaks at 190W and it has a 60% increase in performance with a doubling of almost everything, so maybe they haven't tried too much.

Now there may be (as you suggest) some other system being used here, that isn't screen space tile driven for workload assignment. e.g. round-robin assignment of triangles to rasterisers. This would require the two hierarchical-Z buffers to be duplicates of each other (i.e. requiring coherency traffic). Even if there's some latency there that allows two successive triangles to both be rasterised and shaded even though one entirely hides the other, the actual rendering will still work because the RBEs are the final arbiters of Z. Hierarchical-Z may even be able to send a kill signal after a batch of fragments has commenced shading (when it finally receives the message from the other hierarchical-Z unit that the triangle it allowed, which is sat in its cool-off queue, should have been killed). This would only apply if the entire triangle/tile needs killing (here I'm presuming that only one triangle occupies a pixel shading batch/tile).

One "advantage" of this scheme would be that it requires minimal investment in changing the rasterizers.
Rather than sending rasterization data back and forth, the GPU can get lazy and just rely on broadcasting from the RBE-level Z buffer to both Hierarchical Z blocks, and rely on the RBEs to automatically discard whatever excess fragments make it past the even more conservative than usual early Hierarchical Z checks.

I'm not saying I'd find this to be the best solution, but it is a solution that involves a certain "economy of effort".

elsence · Sep 15, 2009

Jawed said:
Clearly I'm out of touch. Bit embarrassing really. I was actually planning on buying the 2GB HD5870 (even though I hate the cooler, damn it's yucky and despite the fact the performance is worryingly low), but that's because I always buy "excess" memory.
Jawed

Hi Jawed,

My post had only purpose to present my point of view and to see if you agree with me...
I have no technology background and i don't think you are out of touch at all...

I just thought that since the memory capacity is doubling every 2 years and the resolution is changing with a much slower tempo
(In the future we will see..., but i don't think bezeliousfinity lol will change anything for the vast majority of the market)
don't you think that it makes sense the game engines to use the extra memory capacity (that their customer base has) over each cycle per resolution target?

Jawed said:
The thing that gets me about the performance of HD5870 is that it appears AMD is basically saying "that's it, we're bandwidth limited and GDDR5 won't go much, if any, faster".

I think that AMD made a very good decision with the 256bit controller (if you think what are the positives and what are the potential negatives of a 512bit controller for ATI, definetely a good business decision...)
I perceive the AMD slides regarding GDDR5 differently.

Jawed said:
This is a dangerous point because I strongly believe Larrabee is considerably more bandwidth efficient. So, either R900 is a total rethink in the Larrabee direction or AMD's fucked in 18 months. I don't care how big Larrabee is (whatever version it'll be on), I want twice HD5870 performance by the end of 2010.

About Larrabee i know very little.

From what i understand Intel wants to control the whole GPGPU direction and secure the future CPU/GPU status in their favor...

I really don't like Intel practises in a lot of issues but the guys are smart and this particular time they are very strong (You can see the expirements they are doing with their current CPU product line and with the upcoming 32nm one, certainly they can afford them right now, it is the best time for them to test ideas/models/practises in the market...)

Larrabee from what i can understand is not going to be competitive from a performance standpoint with the future high end GPUs that ATI & NV can make, but it doesn't have to be.
Intel to control the GPU market has to stay in the market at least 2 engines mini cycles (2+2 years) and must use such a pricing model, business tactics and a marketing strategy so to entice partners and consumers to their solutions...

Despite reason, i am not optimistic that Intel will achieve this, that's why my original plan when i heard about Larrabee's GPU was that Intel will implement a custom GPU socket solution lol and develop their strategy with this direction...

Jawed said:
The dumb forward-rendering GPUs are on their last gasp if memory bandwidth is going nowhere.

Of course if AMD can make a non-AFR multi-chip card that doesn't use 2x the memory for 1x the effective memory, then I'm willing to be a bit more patient and optimistic.

Jawed

If you read a JPR report 1-1,5 month back, JP was talking that nearly half of the new PCs in 2012 will be sold with multi AIBs GPU solutions (scaling will be done with Lucid Tech chip solutions according to him) (lol).

I hope his prediction to be wrong.

I don't see how AMD/NV will like this direction...

I think that ATI & NV have the technical capability to make homogeneous multi-core designs like SGX543MP (I am not talking about the tile based rendering method...)
(that's how i see the progression of the ATI/NV future shared memory GPU designs)
so they will not need Lucid for perf. scaling (why NV/ATI to lose all the money that customers are going to pay for Lucid based solutions when this money can go directly to NV/ATI?)

mczak · Sep 15, 2009

Hmm, are the texture units still similar in filtering capability?
I always thought that nvidia could afford having near perfect AF filtering with their chips (since G80) since they comparatively had more texture units than AMD r6xx and newer chips. After all angle-independent AF increases the amount of samples (on average) you need to do filtering with.

Richard · Sep 15, 2009

nplack said:
I was commenting on how ATi's timeline for upcoming parts could change due to the delay in the GT300. Seems worthy to place in the R8xx speculation thread.

Hi,

Since you haven't been here long enough to have PMs, I'm putting this here. Hopefully it will enlighten others who insist on turning these into VS threads. That's right, my post wasn't just for your edification:

That, and other articles like it, from "both sides" are textbook F.U.D. They provide no solid evidence all the while increasing fear, uncertainty and doubt of what the future will bring. No worthwhile discussion will come out of this before the whole thing degenerates into ad-hominem or disputing the validity of the premise.

While we're able tolerate an increase in fanboy-fervour over major releases such as these ones, this thread is about the R8xx. Adding a question about ATI doesn't make it right; adding a question about ATI whose answers will have to accept highly disputed premise doubly so. That's why we have a "how nVidia will respond to ATI" thread. Make a "how will ATI respond to nVidia" one if you absolutely must.

That attitude towards a moderator will get you places too.

rjc · Sep 16, 2009

Continuing on the mainstream and entry level (and notebook) models...

cfcnc at chiphell has just posted this:

Redwood = RV830，128bit，接替RV730，性能大约是RV730的150%
(Redwood = RV830, 128bit, to replace RV730, performance approximately is 150% of RV730)
Cedar = RV810，64bit，接替RV710，性能大约是RV710的150%
(Cedar = RV810, 64bit, to replace RV710, performance approximately is 150% of RV710)

From previous leaks in mobile segment Redwood = Madison and Cedar = Park, they are apparently the same power level and also pin compatible with their predecessors(not sure exactly how this works as RV730 and RV710 didn't support GDDR5).

To get 50% extra at same power level would likely only be able to get 20%-30% maximum from the shrink itself. So the other gains must be coming from design improvements, better (lower voltage or higher clocking) memory plus finally maybe extra units have been added.

Will be interesting how the die sizes come out, guess ideally want to run as close to possible to pad limited to minimise costs. Their competition the GT218(60mm2) and GT216(100mm2) are close to as small as can be for 64bit and 128bit memory interfaces respectively.

FrameBuffer · Sep 16, 2009

rjc said:
Continuing on the mainstream and entry level (and notebook) models...

cfcnc at chiphell has just posted this:
From previous leaks in mobile segment Redwood = Madison and Cedar = Park, they are apparently the same power level and also pin compatible with their predecessors(not sure exactly how this works as RV730 and RV710 didn't support GDDR5).

To get 50% extra at same power level would likely only be able to get 20%-30% maximum from the shrink itself. So the other gains must be coming from design improvements, better (lower voltage or higher clocking) memory plus finally maybe extra units have been added.

Will be interesting how the die sizes come out, guess ideally want to run as close to possible to pad limited to minimise costs. Their competition the GT218(60mm2) and GT216(100mm2) are close to as small as can be for 64bit and 128bit memory interfaces respectively.

wouldn't 150% of RV710 be nipping at the toes of the RV730s performance ? ((unless I've gotten confused between performance from driver updates from release)).

aaronspink · Sep 16, 2009

DegustatoR said:
Can a 128-bit GDDR5 card be faster than 256-bit GDDR5 card on average? I think it'll depend on what shader core frequencies Juniper XT will have. But in bandwidth limited situations it'll probably be slower anyway.

Um, yes. Can a 512b card be faster than a 16384b card? Um, yes.

Bandwidth beyond that required is worthless!

aaronspink · Sep 16, 2009

Pressure said:
Even the Radeon HD 4870X2 (TDP 286 Watt) does not consume twice that of the Radeon HD 4870 1GB (TDP 160 Watt).

Maybe or maybe not, but we know that it consumes more than 286 watts(4870x2).

Given past performance and negligence in the graphics market, I wouldn't be surprised if either ATI or Nvidia sold a card with a nuclear generator and then claimed it required 0 watts!

In other words, take non-measured values of power requirements for graphics cards with an ocean worth of salt. The card is likely to be able to exceed the vendors "TDP" without overclocking fairly easily. One day, they might actually match the specs they advertise.

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

3dilettante

trinibwoy

Meh

Jaaanosik

trinibwoy

Meh

ninelven

PM

Jawed

Jawed

ninelven

PM

LordEC911

mczak

Richard

Mord's imaginary friend

nplack

3dilettante

elsence

mczak

Richard

Mord's imaginary friend

rjc

FrameBuffer

aaronspink

aaronspink

Similar threads