Let the beating begin!

RickCain said:
After reading that "article", it makes me wonder why someone would post this knowning they don't have a full understanding of what they are writting about. With such gross misrepresentation of yeilds, it makes me wonder what the modivation really was.

These are the type of articles that make a website lose credibility, in my opinion.


OR they could be an attempt by the writer to try and better understand a certain method/item/choice (what be it) by inspiring discussion around that subject..

DUCK !!.. incoming black helicopters.. must put on tin foil hat..

JKing ;p
 
FrameBuffer said:
OR they could be an attempt by the writer to try and better understand a certain method/item/choice (what be it) by inspiring discussion around that subject..

would make sense to me if the article was not first released on a website prior to having a discussion if there was merit to the yeilds and such.
 
Jawed said:
"3:1" is still a funny ratio to me, and I honestly expect that each of the four shader units in R580 is constructed with 4x fragment quads, but with one of them sacrificed for redundancy (much like Cell in PS3 sacrifies one SPE for redundancy). This leaves each shader unit with 3x active fragment quads.These two concepts, in my opinion, mark a dramatic shift in yield-modelling for R5xx, as compared with previous GPUs.

Jawed
What do you mean my "sacrificed" Like if one quad only has 2 opperative it 'borrows' one from the next quad?

Sounds good to me but doesn't that imply that the redundant unit can be switched on again though bios/driver update if they happened to be all available on a well baked part?
 
In that case, the whole quad would be scrapped and the chip would be binned as a lesser-quad unit, like a GTO (X1800GTO = X1800XL/XT minus one quad). The quads are probably isolated units, not able to borrow sugar (more like a room, in this case) from their neighbors.

It may be too complicated for ATI's driver writers to optimize for 2:1 alongside 3:1 variants. RV560 was speculated to be 2:1, but the latest rumors put it back in familiar 3:1 territory. For the same reason, they may not be bothered to work in 4:1 variants, if Jawed is right--heck, there may not be enough to warrant an extra SKU.

Jawed, did you reverse "shader unit" and "fragment quad" in that second point? The precise wording is confusing me, as I think of R580 as having 4 quads and 48 shader units (not the other way around), and so "3x active shader units per quad" rather than "each shader unit with 3x active fragment quads."
 
Last edited by a moderator:
RickCain said:
After reading that "article", it makes me wonder why someone would post this knowning they don't have a full understanding of what they are writting about. With such gross misrepresentation of yeilds, it makes me wonder what the modivation really was.

These are the type of articles that make a website lose credibility, in my opinion.

Ah yes, after releasing my "article" I was hoping to manipulate the stock price on ATI, causing it to crash, all for my gleeful enjoyment of having such power...

Ok, here is the deal. Yields and actual costs for dies are well kept secrets for these companies. It was a speculative article, and the point that I was making was that there is a huge difference in die sizes between the two main competitors, and with a standard yield model I thought it was interesting to see what kinds of yields the manufacturers might actually get. My reasons for writing this article were purely because it was of interest to me. The discussion that has come afterwards has been very positive in my view. The idea that ATI may be implementing a new way to provide redundancy in their products is fascinating, and it brings up some other very interesting topics.

Another thing of interest for me, and I believe I have covered this in my article, was the difference in overall design philosophy for this generation of products. ATI is working on features and performance all at the cost of die size, but NVIDIA is concentrating on lowering transistor counts yet keeping performance per clock pretty consistent. I think both have achieved a lot in this past year, and the products we see now are simply fantastic from both camps.

There was no underlying reason I wrote this article, and I certainly wasn't pointed in this direction by one company or another. I am not an AEG member, and NVIDIA had nothing to do with this. I tried to keep it not slanted one way or another, but at the same time you cannot get away from the very fact that the G71 is 196 mm square while the competing R580 is 352 mm square. Just the size alone probably makes a pretty significant difference when it comes to manufacturing on the same process with the same size of wafers.

Edited for clarity
 
Last edited by a moderator:
RickCain said:
After reading that "article", it makes me wonder why someone would post this knowning they don't have a full understanding of what they are writting about. With such gross misrepresentation of yeilds, it makes me wonder what the modivation really was.

These are the type of articles that make a website lose credibility, in my opinion.

And how is he suppose to validate this article? Somehow I don't think ATI or Nvidia is going to share this information. He clearly states this is an educated guess and may not be 100% accurate. It is at least a start to trying to understand how ATI and Nvidia's yields and costs may be affected by die size and wafer costs.
 
Pete, I refuse to call a section of 12 fragment pipes and 4 TMU pipes a "quad" (oh, and 4 ROP pipes and not forgetting the register file). It's just too sloppy.

In my view that's a shader unit. Once fragments are assigned to a shader unit, which I theorise includes a local version of the ultra threaded despatch processor, fragments spend their whole life there.

So R580 has four shader units, each with 12 fragment pipes and 4 TMU pipes - or 3 fragment quads and 1 TMU quad.

It's my theory that the die is configured such that each shader unit consists of 4 fragment quads, with one of them sacrificed for redundancy. It may be more fine-grained than that (as per the recent pair of patents). I'm not sure how register file redundancy would work - whether the register file is organised in quads (would seem logical) or whether it's consolidated for the entire shader unit. But I expect there's very fine-grained redundancy there, as it's just memory.

One thing that worries me about die-level redundancy is the TMUs - as far as I can tell they take up a hell of a lot of die space, too, and so I presume there must be some redundancy mechanism there as well. I imagine redundancy here would be more along the lines of the patents. If so, that would make my "four fragment quads, one sacrificed" theory a bit excessive and redundancy would be at the pipe level instead.

Certainly pipe-level redundancy is easier to read across to R520 and RV515.

My main aim is to stop people thinking that a bigger die equals lower yields. It works for memory, it works for Cell, no reason why it shouldn't work for GPUs. We've seen ATI's (and NVidia's) patents on the subject...

Jawed
 
Last edited by a moderator:
Jawed said:
So R480 has four shader units, each with 12 fragment pipes and 4 TMU pipes - or 3 fragment quads and 1 TMU quad.

Totally mixed up IMHO. It's rather 4 quads, each with four shader "pipes" (in the old sense) containing 3 fragment units and 1 TMU each. Not that this is a valid description of the architecture, but just the grouping of items in your post is totally wrong. We're yet far from the flexibility of the unified architecture here, too much wishful thinking on your side ;)
 
_xxx_ said:
Totally mixed up IMHO. It's rather 4 quads, each with four shader "pipes" (in the old sense) containing 3 fragment units and 1 TMU each. Not that this is a valid description of the architecture, but just the grouping of items in your post is totally wrong. We're yet far from the flexibility of the unified architecture here, too much wishful thinking on your side ;)
I don't think Jawed's grouping is wrong at all. But you could argue that "arithmetic pipes" would be a better term than "fragment pipes".
 
Xmas said:
I don't think Jawed's grouping is wrong at all. But you could argue that "arithmetic pipes" would be a better term than "fragment pipes".

I think you're both referring to this, but that is the marketing-friendly representation IMHO. Think about the differences concerning batch size, rings a bell? Since you can't assign 12 batches in parallel to one "quad", I'd say it still has 16 "pipes" with three fragment units each. AFAIK the individual units in these triplets are not decoupled/independent of each other, right?
 
What would you view the difference between each R580 "quad" and the entire G70 pipelines then?
 
_xxx_ said:
I think you're both referring to this, but that is the marketing-friendly representation IMHO. Think about the differences concerning batch size, rings a bell? Since you can't assign 12 batches in parallel to one "quad", I'd say it still has 16 "pipes" with three fragment units each. AFAIK the individual units in these triplets are not decoupled/independent of each other, right?
I'm not referring to the representation in that image at all.
Thread size is 48 pixels/12 pixel quads IIRC, and these threads are distributed to 4 units, in a way that each unit is working on a different "tile" of the framebuffer (supertiling). Each of those units has 12 (or 3 quad) "arithmetic pipes" (ADD + MAD, vec3 + scalar) that execute instructions in lockstep, each processing four out of the 48 pixels (12 quads) per thread.
 
  • Like
Reactions: Geo
Dave Baumann said:
What would you view the difference between each R580 "quad" and the entire G70 pipelines then?

I'd say that whereas G70 has two shaders per pipe, R580 has three.

I'm rather talking about the difference to R520. IMO, it's like R520 with each "pipe" containing three of the R520 shader units if you wish. So I think that these triplets are dependant "per pipe" and thus the batch size of 12 pixels is coming from 1 quad = 4 pipes with 3 shaders each. The difference being that these three shaders are not individually utilized I think, or with limitations.

I understand what Xmas says, I'm just still not convinced that's the case.
 
_xxx_ said:
I'd say that whereas G70 has two shaders per pipe, R580 has three.
G70 has two ALU's pipe pipeline, that can dual issue instructions for the same pixel, same as ATI's fragment pipelines can (NVIDIA has two MADD ALU's, ATI has a MADD and ADD ALU), but they are working on the same "pixel" in both cases. R580's pipelines are also "wide" in that they are operating on separate, parallel pixels, that happen to of the same "batch", and this is pretty much the case for G7x across all its quads most of the time.

I'm rather talking about the difference to R520. IMO, it's like R520 with each "pipe" containing three of the R520 shader units if you wish. So I think that these triplets are dependant "per pipe" and thus the batch size of 12 pixels is coming from 1 quad = 4 pipes with 3 shaders each. The difference being that these three shaders are not individually utilized I think, or with limitations.
Nope, because, they are already two ALU's "deep" - it'd be next to impossible to find a shader where you can issue 6 - 12 instructions on a single pixel! This is why they are still "wide" and operating independantly on separate pixels.

rv530shader.gif

http://www.beyond3d.com/reviews/ati/rv5xx/
 
Ok, that's a better representation there. But I still don't get it, there must be some kind of dependancy between those three. I'll be back if I can think of the right formulation for the question, still a bit confused right now.

EDIT:

This is a part of what I was referring to:

The net result here is that the block size of those threads is also increased from 16 to 48 pixels per thread, which lowers the efficiency of dynamic branching in relation to RV515 and R520
 
Last edited by a moderator:
The only dependancy is that they are working on the same "batch" of pixels. R520, for instance, breaks down its pixel workload into "batches" of 16 pixels per "command command processor" (effectively there are 4 command processors in R520, each handling 128 threads, whereas in RV515 there is just a single command processor) - these pixels have the same state and shader program applied to them.

As there are only 4 pipelines (a single quad) per command processor a single batch of pixels is operated on over 4 cycles (i.e. a single instruction for a single batch of 16 pixels will operate on a single quad in 4 cycles). Now, oddly (there are reasons) its important that a single instruction per batch takes 4 cycles, so that is actually fairly fixed - the easiest way of increase the number of pixels being operated on in parallel is just to increase the batch size; you could add more command processors, but these are actually expensive in terms of transistors, so, the easiest thing to do is just increase the batch size and have 12 pixels being operated on in parallel over 4 cycles (ergo, 48 pixels per batch).
 
For I am become dumb, destroyer of words.

Jawed said:
Pete, I refuse to call a section of 12 fragment pipes and 4 TMU pipes a "quad" (oh, and 4 ROP pipes and not forgetting the register file). It's just too sloppy.
Right, understood. Each of those "fragment pipes" are indeed working on a separate fragment, so it's not far to call 12 of them a quad. Bear with me, though, as I'm still having trouble with your terminology. This may be stubborness talking (though I adapted to NV cutting ROPs out of the pipeline fast enough).

Massive edit: Let me cut my confusion/bloviation/confusing bloviation down to size. I've posted some dumb replies, but this was my dumbest yet, pre-edit. Unbelievable.

1)
It's my theory that the die is configured such that each shader unit consists of 4 fragment quads, with one of them sacrificed for redundancy.
I'm with Xmas in that "'arithmetic pipes' would be a better term than 'fragment pipes.'" It's less confusing to me, anyway (tho the aim is to better accomodate reality, not my potential misunderstanding of it).

So, each shader unit has four arithmetic quads, with one sacrificed for yields.

2)
In my view that's a shader unit. Once fragments are assigned to a shader unit, which I theorise includes a local version of the ultra threaded despatch processor, fragments spend their whole life there.

So R480[sic] has four shader units, each with 12 fragment pipes and 4 TMU pipes - or 3 fragment quads and 1 TMU quad.
OK, to summarize, I'd be more comfortable with R580 having four shader units, each with twelve arithmetic pipes and four texture pipes (or three math and one TMU quads).

3) OT, but I suppose the reason you don't call R5x0's shader units shader pipes (a la Xenos) is b/c R5x0 still assigns a ROP per pipe, whereas Xenos finally dissociates ROPs from the rest of the pipeline.
 
Last edited by a moderator:
  • Like
Reactions: Geo
Thinking about this a little further, we'd also need to apply Josh's yield advantage numbers to market segment percentages. That might help quite a bit to bring them at the macro end within range of what we know the financial realities to be (at least more so than they currently are).

By volume, ATI estimates the market to be:

5-8% High-end
30-35% Mainstream
60% Value

http://www.beyond3d.com/forum/showpost.php?p=707854&postcount=20
 
Pete, Xmas, yes I agree arithmetic pipes is a much better name.

:rolleyes: at the R480 slipping in there. Sigh.

I tend to include the ROPs (and hierarchical Z/stencil) as part of a shader unit - ROPs are not decoupled in R580 as they are in Xenos or NVidia's NV4x. This makes more sense when you remember the tiling of the render target (64x64 pixels in size - though size is programmable in R4xx onwards).

As I said earlier, a shader unit is solely responsible for a fragment from the moment it's generated, by polygon rasterisation, to the moment it's written to memory as colour/z/stencil. The differences in R5xx versus Xenos despite the similarities of threading and 3:1 ALU:TEX make thinking about R600 tricky :???:

As to R5xx yield, it's a matter of discerning the kinds of redundancy (fine-grained, ALU-level, or more coarse-grained quad-level). To be honest the "single quad" nature of R520 and RV515 implies that fine-grained redundancy rules the roost - making it less likely that my "one arithmetic quad in four is sacrificed for yields" theory is correct. That kind of redundancy might only occur as an extra layer of redundancy.

I don't see how it's possible to use raw statistics to determine R5xx yield, when there is prolly in-die redundancy the degrees of which are unknown.

Jawed
 
I heared the rumor about 7900GTXs were limited to 5983 Pcs. Maybe Someone can pinpoint me why Nvidia dedicated to limiting volume of 7900GTXs
 
Last edited by a moderator:
Back
Top