New patents by ATI

I like the power management one, that should be better than just lowering the clock.

Is there anyone with the know-how that can put a few of those diagrams in a diagram of an entire gpu?
 
They're certainly meaty. The diagrams are identical for figures 1 to 11, then they diverge - I haven't read them, just skimmed through some of the diagrams. The claims seem quite different too, but they're buggers to comprehend at the best of times. The first patent seems to be about control methods and data structures for this redundancy while the second seems to be about implementing the redundancy itself.

The kind of redundancy here is at a finer level than I've been proposing for R580 (and other 3:1 R5xx GPUs). These patents seem to be talking about redundancy at the ALU unit level, not at the pipeline level which is what I've been suggesting.

Anyway, there's some groovy diagrams in there - though sadly obliterated by "fax quality" pixellation - how ironic.

Say R600 consists of shader arrays that each comprise 16 fragment processors. Each fragment processor consists of a Vec4 ALU and a scalar ALU. The redundancy might be provided by a 17th fragment processor, able to perform Vec4+scalar computations. Any one of the 16 fragment processors can fail. Say FP3 (counting from 0 at the left) is found to be non-functional during testing, a fuse corresponding to this is blown, which appears as "0" when the GPU reads a special memory location. Then during processing:
  • FPs 0, 1, 2 receive data for processing as normal
  • FP3 is dark
  • FP4 takes the data originally intended for 3
  • FP5 takes the data originally intended for 4 etc.
  • FPR takes the data originally intended for 15
  • the results produced by FPs 0, 1, 2 are routed normally
  • FP4's result is routed to appear as though it was produced by FP3
  • FP5's result is routed to appear as though it was produced by FP4
  • FPR's result is routed to appear as though it was produced by FP15
There's also talk of powering-down parts of the GPU :LOL: Fun stuff - still loads more than I've got time to read right now.

Jawed
 
That type of thing would be similar to what we have seen with NVIDIA's "Single Extra Pipeline" patent thats been around for a while.
 
Urian said:
Are you saying that this patent talks about an special configuration of an existing GPU
Me? No. I just gave an example of fine-grained redundancy to contrast it with my theory of how relatively coarse-grained redundancy works in R580 and the seemingly very coarse redundancy in Xenos (also my theory).

In all these methods covered by the patents in this thread, what I think is paramount is the idea that redundancy should no longer produce two or more variants of a graphics card: one with all the quads functioning (e.g. X800XT), another with one dark quad (e.g. X800Pro) and another with two dark quads (e.g. X800SE).

I presume that we're moving into an era where redundancy is hidden from the SKU, where every die is fully functional or a complete dud (though there'll be clock speed variations amongst the fully functional dies).

Jawed
 
Jawed said:
Me? No. I just gave an example of fine-grained redundancy to contrast it with my theory of how relatively coarse-grained redundancy works in R580 and the seemingly very coarse redundancy in Xenos (also my theory).

In all these methods covered by the patents in this thread, what I think is paramount is the idea that redundancy should no longer produce two or more variants of a graphics card: one with all the quads functioning (e.g. X800XT), another with one dark quad (e.g. X800Pro) and another with two dark quads (e.g. X800SE).

I presume that we're moving into an era where redundancy is hidden from the SKU, where every die is fully functional or a complete dud (though there'll be clock speed variations amongst the fully functional dies).

Jawed

Actually, we're moving away from pipelines with the unification process to more of a pool of processors. So why wouldn't it be possible to still have various variants of the card with different number of total processors active? IE: A die now has 52 total processors, with the top end configuration consisting of 48 processors, the next rung down is 44 processors, and the rung below that only having 40 processors. Why must things still be arranged in terms of a PIPELINE? The pipeline would just be an automatic configuration or grouping of 4 processors. Why is this way off base?
 
The unit of shading fragments is naturally a quad, so that's taken as the current minimum capability of a "pipeline". You can choose to dedicate a shader state to that single pipeline (e.g. R420), or you can gang a number of those pipelines together (NV40) so that they all have the same shader state (i.e. shader program, program counter and constants are all shared).

As you increase the number of quads sharing a shader state, you theoretically cut the total amount of instruction decode logic in the GPU, and you can multiplex the register fetch and store pathways for all the pipelines - saving transistors, though to be frank I don't know what percentage of a pipeline is consumed by decode, fetch and store, and I dunno how practical the multiplexing is given the fairly extensive register file sizes of GPUs.

If texturing consists of multiple pipelines ganged together, then you can theoretically gain coherency in memory accesses, as well as gain savings in common decode and control hardware.

In Xenos and R5xx we see very small threads consisting of 4, 12 or 16 quads of fragments being processed by a pipeline (all of them do so in four phases, rather than the 64 or 256 that seem typical of older GPUs). Effectively it's a "short" (in time, dozens of cycles) and "wide" (4 quads in Xenos and 3 in R580) pipeline architecture - effectively the polar opposite of the old-fashioned fixed function pipeline that was a single quad (or half a quad or single fragment) that spent hundreds of cycles processing each instruction.

Xenos and R5xx can be built like this because of decoupled texturing and the ability to schedule very frequent shader state changes. And both arithmetic and texturing in these GPUs gain from lowered transistor overheads in terms of decode, fetch, store and control transistors (well, I presume they do!).

The scheduling complexity and its inherent transistor count overhead seems to be a given if you want to build an architecture like Xenos or R5xx that decouples arithmetic and texturing and also provides for small thread sizes. So their wide arrays, with the lowered overheads for decode etc. are a way to tackle the overhead incurred in having this crazy scheduler (oh and the increased register file that comes with it).

Well, that's the way I see it for ATI, at least. Whether R600 has such wide shader arrays is still something I can't decide on. Right now I'm leaning towards wide arrays for R600... Why not even wider?...

---

In your proposal for 48, 44, 40 etc. variants you do create a problem that certain parts of the GPU end-up "oversized", e.g. if you have 40 arithmetic pipelines and 16 texturing pipelines, the GPU is "out of proportion". Sure, no-one will complain at the performance, but parts of the GPU will be oversized. Beggars can't be choosers you might say, at least the die is still useful.

Well, I don't know. As silent_guy remarked recently, DRAMs come out with a 98%+ yield rate, because the extraordinarily high parallelism of a DRAM enables "just the right amount" of redundancy to be hidden within each die. Sure that makes each die bigger, but clearly there's a point on the yield curve that it makes a lot of sense to aim for. These fine-grained redundancy patents seem to be saying the same thing.

You could have a pool of processors, where each and every one has a distinct shader state and where there's a 1:8 or 1:16 redundancy - that's entirely feasible.

But I think the overriding problem with the "pool of 128 processors" view is that if you want to give each processor a distinct shader state you have a massive explosion in shader program storage, shader state storage, decode logic and fetch/store pathway control logic. The overheads multiply really fast. Though I don't have a decent idea of the quantities of transistors we're talking about :oops:

Jawed
 
Reading the first one, it occurs to me that decoupling and redundancy go hand in hand, so far as the first makes the second easier/more effective to implement. . .
 
And reading the second, I would think it would be much more effective with an array rather than per-ps. 50 ALU needing 48 good ones seems to make much more sense than requiring 64 ALU to be sure of having 48 good ones. . .
 
Bear in mind that each array would need to be blanced, so there would be some ALU redundancy per array, regardless of whether it was required or not.
 
Dave Baumann said:
Bear in mind that each array would need to be blanced, so there would be some ALU redundancy per array, regardless of whether it was required or not.

That's a performance point? I'm thinking more. . umm. . .marketing and repeatability. You can't sell unit 1 of said gpu to consumer A with X number of ALU's working on his workload while you sell unit 2 of said gpu to consumer B with X-1 number of ALU's working on his workload. . . at least not under the same board name. :smile: That's what redundancy-for-yields is all about, right? But anyway, it seems to me that decoupling and arrays is what really leverages redundancy-for-yields into something that is much more cost effective to do (really, anything that increases granularity of addressing units would theoretically marginally increase your capability to engage in redundancy-for-yields, I'd think) . . .which is not to say that decoupling and arrays don't have other virtues as well.
 
Last edited by a moderator:
My point being was that if you had an organisation that used a single ALU in an array for redundancy, but had multiple arrays, you'll still be turning more off than is needed - i.e. if there are 4 arrays, 4 ALU's need to be disabled all the time. In this instance, even if there were no defects in a die, its likely that those 4 ALU's would need to be turned off, with no possability of a higher performacne SKU, as the arrangement within the arrays will still be working on quads and an even number of ALU's are required.

In fact, if we apply this to a Xenos like arrangement, and two ALU's in a single array are hit with defects then the whole array is buggered because its operation works by having a set number of ALU's. In that case, the entire array and the extra redundant ALU's in the other arrays would need to be turned off and sold as a much lower performance chip (if at all).
 
Ah, okay. Sure, in any design there's going to be a point where enuf faults in the wrong spots causes it to break down and you're buggered. The trick is maximizing the point on the curve where you avoid that for the maximum number of dies at the minimal silicon cost of those extra units, right? And arrays help you do that more so than, say, ps-level redundancy, or redundancy inside the ps. Even three arrays is signficantly better leveraging with two extra units each than one extra unit each in 16 ps (6 vs 16), while providing more "protection" against faults (two bad alus in one ps and you're screwed vs three bad alus in the array before you're screwed).

I've seen, it seems to me, some numbers on faults per wafer. What I haven't seen is how. . .err, "big"?, a fault typically is compared to the average functional unit.
 
Yes, in this case there has be be a swtich over in costs between all operable cores carrying a certain level of redundant silicon irrespective of whether a defect hit that die or not and the "top end with no defects, lower end with some" models. In this case we are talking about something closer to PS3 / Cell were they are building with 8 SPU's and only enabling 7 in all the PS3's.
 
Back
Top