How close is R5xx to R600?

Geo

Mostly Harmless
Legend
How much do we think is re-used? I would have thot the scheduler and the ring bus, and maybe not much else. For instance, would those branch execution units work well in a unified arch? If so, how?

Yet we get this from Eric (and thanks to Hanners for pointing it out over at EB, tho I'm surprised we all missed it for six weeks!):

Well, I won't comment on unannounced products, but there's a lot invested into a new generation. About 110 man years for the R5xx generation. So, trying to maximize the number of parts we can get from it is important, to justify all the investment. The R5xx series was designed to be more flexible than previous architectures, since the metrics of yesterday have become less meaningful. 2 years ago, it mattered more how many “pipelinesâ€￾ you have, perhaps with some notion of the number of Z's or textures per pipe, but the basic metric was that. Today, we have moved away from that paradigm. Today, applications don't use fix function pipelines anymore, but create powerful shader programs to execute on the HW. It's not “how many pixels can you pop out per second?â€￾, it's “what is the throughput of your shaderâ€￾?. Our R5xx architecture has moved away from simply scaling of pipelines, to now scaling in terms of ALU operations, texture operations, flow control, Z operations as well as more traditional raster operations (all of this bathing in a design that can maximize the work done by each part). So will there be a 1 GHz 32 pipeline R5xx part? Well, we've ceased to measure things that way, so it won't be so easy to describe. But, yes, we will have more parts from this generation :)

In context, that sounds to me like more than X1900. . . after all, 110 man years! Even more so, when you read the question itself, which was pointing out that R300 became R420 without fundamental changes.

http://prohardver.hu/c.php?mod=20&id=996
 
If it unified then the scheduler will be different (it may be a case of how close is the R5xx schedule to Xenos?).

FYI - In contrast, Eric told me the development time for R420. Although I forget his actual wording I believe it could be measured in man months rather than years.
 
My unkind, smart-ass internal comment was "Yeah, only 55 man-years if the blasted thing had worked in May!" :cool:

The more I thot of this (and with a little help in IRC), it is probably quite a bit more than what I was thinking that would translate --tho you've swept the scheduler off the table now (or at least it requires more work).

Assuming the existing PS is the starting block moving forward, there could be quite a lot of re-use. The branch execution unit is already there, the texturing is already hooked up there --and this would help explain why they spent so much time/effort on the texturing this time. No reason to think the register array and z-stuff wouldn't come across pretty well. Etc.

So if I started thinking of R600 as requiring adding geometry capability to the existing R580 PS units, would I be headed in a reasonable direction?
 
I think the scheduler concepts in R5xx and Xenos are extremely similar.

What's intriguing about scheduling in both architectures is that they each use a 4-phase pipeline. In Xenos a thread consists of 64 vertices or fragments - and it takes four phases to process one instruction across all 64.

R5xx is similar, though the size of a thread varies: RV515 and R520 both use a thread size of 16, with four phases of 4 fragments. RV530 and R580 use 48-fragment threads, in four phases of 12 each.

So it seems to me that there's a very strong similarity in pipeline architecture comparing the two.

What's not so clear is how R5xx schedules threads in the ALU pipeline. Xenos seemingly uses an AAAABBBB thread scheduling pattern, so that four phases from thread A are executed, then four from thread B.

It's not clear if this technique is used in R5xx or what the reason for using it in Xenos is. Arguably, with higher clocks, more elaborate thread scheduling is required because pipelines tend to lengthen when clocks are raised - e.g. AAAABBBBCCCCDDDD ...

---

Texture-width is something that I've pointed out before.

In Xenos the texturing width is 4 quads - all effectively in unison. There is only one high-level scheduler controlling access to the texturing unit, so all texturing in Xenos is single-threaded.

In R5xx the texturing width is 1 quad. But also scheduling appears to be controlled on a per-shader-unit basis - so that each of the four texturing quads in R520 and R580 are operating independently.

So it begs the question of whether R600 is a narrow-texturing design like R5xx (which appears to be predicated on the use of screen-space tiling to create locality for each of R5xx's single-tier texture caches) or a wide design like Xenos (which prolly doesn't use screen-space tiling, and may well use an L1/L2 texture cache architecture).

It's worth mentioning vertex fetch at this point, as vertex fetch can also be thought of as point-sampling of textures - though ATI's own documentation is at pains to suggest that vertex fetch is best-suited to 1D data structures, not 2D or 3D textures.

Anyway, Xenos's scheduling and granularity of vertex fetching are not clear to me, but I expect it's a 16-wide operation, too...

---

Xenos's ALU architecture, though, is quite different.

Instead of the 3+1 main, 3+1 mini ALU organisation of R5xx's fragment pipelines (and R4xx and R3xx) coupled with the 4+1 organisation of it's vertex pipelines, Xenos uses a 4 main + 1 mini ALU organisation.

(R5xx's vertex pipelines may actually be 4 main + 1 mini - cloaked in mystery as far as I can tell.)

4 main + 1 mini seems easier to schedule - as the only type of dual-issue that might occur is one involving a scalar operation, e.g. RSQ (less choice of source operands to worry about the dependencies of!). In other words the 1 mini ALU might spend a lot of time idle, but less transistors will be wasted while it's sat idle as compared to the mini ALU in R5xx.

So I expect R600 is more like Xenos than R5xx.

---

Back-end is a part of Xenos that gets ignored. It's responsible for soaking up the vertices/fragments generated by the shader units and buffering them up and deciding what next to do with them. It's a stage that's not obviously there in R5xx (since the ROPs handle fragments and the post-vertex cache handles vertices), but is implicit in a USA as far as I can tell. The back-end also carries a fair amount of weight in the load-balancing of a USA, due to its "unified buffering" responsibility.

So, there's a great lump of functionality in R600 that will be inherited from Xenos - it's a big chunk of die in Xenos.

Xenos's back-end is not necessarily a close match for R600's back-end, though, because Xenos has to prepare fragments to be shipped off-die to the EDRAM unit, and perform other interfacing operations with respect to the EDRAM and, for example, AA resolve - all of which means there's enough differences that it's more of a conceptual translation, I guess.

---

Overall I think Xenos has more to tell about R600.

The "width" of texture processing in R600 is the great unknown in my view - 1-quad as in R5xx or 4-quad as in Xenos - and I haven't really worked out whether screen-space tiling holds the key.

In theory it does (as it seems extremely unlikely that ATI would drop screen-space tiling, since it would unravel part of the CrossFire architecture), and in theory that implies 1-quad texturing - but I dunno...

Jawed
 
  • Like
Reactions: Geo
geo said:
Assuming the existing PS is the starting block moving forward...

As opposed to the units in Xenos? You'd think R600 might have something resembling the "shader processor" from the R400 development line (inc. sequencer/arbitrator) - and similar texturing functionality, should this be the case. Then ring bus/MC/ROP/integrated bits from R520.

Edit - Oop, Jawed there first. Albeit with more brains. :LOL:
 
Last edited by a moderator:
  • Like
Reactions: Geo
MuFu said:
As opposed to the units in Xenos? You'd think R600 might have something resembling the "shader processor" from the R400 development line (inc. sequencer/arbitrator) - and similar texturing functionality, should this be the case. Then ring bus/MC/ROP/integrated bits from R520.

Edit - Oop, Jawed there first. Albeit with more brains. :LOL:

Many of us appreciate the "Executive summary" approach as well! :LOL:
 
I hadn't thought of the ring bus, actually. That's one feature that's here to stay.

And the important thing about it is that it supports lots of thinly spread clients quite happily. To me this would imply that R600 is a narrow, 1 quad, texturing design - with shader processing arranged around distinct texture and vertex fetch units (each with their own cache) and with screen-space tiling.

So, thanks for the prod Mufu.

Jawed
 
sireric said:
I know! I know!

:)

If I weren't laughing so hard I'd smack you upside the head with the Sacred Salmon of Correction. :LOL:

Obviously you can't talk about unannounced products. But feel free, oh say tomorrow afternoon after about 2pm GMT. . .to drop by and give an opinion which elements of your current top dog (at the time of posting, 'natch) seem. . .errr, "built to last". ;)
 
Last edited by a moderator:
From the sound of things I wouldn't expect adding GS capabilities to be relatively complicated on a USC.

My question would rather be whether TMUs will have more or less capabilities in the end. When I have a large pool of TMUs available, I wonder if they even have to be capable of even bilinear in the end.
 
Better yet, ask Eric how many man months* from R500+R520 to R600. :smile:

* Warning: The exaggeration in my humor may be larger than it appears.
 
Last edited by a moderator:
Pete said:
Better yet, ask Eric how many man months* from R400+R520 to R600. :smile:

* Warning: The exaggeration in my humor may be larger than it appears.

Humor aside, I'd first have to know if and how many R400 elements survived into R600 first.
 
Actually, I always thought that TMUs need more fine-grained programmer control. Option 1 is to simplify them and make them point samplers which can then be filtered in the shader. However, I think having N uncoupled texture load instructions in the shader is going to be less efficient than if the HW is aware of an implicit grouping.

Thus, even if the filter kernel is moved to the PS pipeline, I still believe that the fetching of the samples will remain fixed-function.

Here's my half-serious proposal, or, Option 2 for FutureTMU. Evolve Fetch4 into a FetchNxM. Allow the programmer to send an array of constants specifying a sampling grid (possibly sparse and anisotropic) along with the texture coordinates. The hardware grabs the samples and places them into an array of temporary registers, and then the pixel shader applies a filter kernel.

DX10 has made output programmable (can programmable framebuffer blending now be achieved? still not access to read pixel at current framebuffer screen position?), and it has generalized input grouping somewhat, but perhaps the next step is programmable fetch and group?
 
DemoCoder said:
Actually, I always thought that TMUs need more fine-grained programmer control. Option 1 is to simplify them and make them point samplers which can then be filtered in the shader. However, I think having N uncoupled texture load instructions in the shader is going to be less efficient than if the HW is aware of an implicit grouping.

Thus, even if the filter kernel is moved to the PS pipeline, I still believe that the fetching of the samples will remain fixed-function.
I've been wondering about this ove the past couple of days as well, but wondering if there would be much difficulty in, instead of removing the filter entirely, decoupling the filter from the sampler and having an array of each.

Alternatively, Xenos may have another answer. It has an array of "vertex fetch" units, which are really just point sampler's, and I assume there shouldn't be an issue for using these for single format textures as well as the standard bilinear units.
 
Back
Top