RXXX Series Roadmap from AnandTech

If the R5xx architecture is built upon a multiple-batch scheduler, like Xenos, then not only does it gain dynamic flow control, but it also gets the efficiency gains of Xenos, from the 50-70% utilisation that R420 can achieve to the 95% efficiency that Xenos can deliver.
How does reducing your batch size improve your utilization, for non-branching shaders? If anything, you make your memory requests less coherent, which would tend to dimish your efficiency.
 
Bob said:
How does reducing your batch size improve your utilization, for non-branching shaders? If anything, you make your memory requests less coherent, which would tend to dimish your efficiency.
Well, the question I would ask to this is: what constitutes a batch, and how does the hardware deal with situations where it can't fill a batch?

If your batches just aren't getting filled, then it would make sense to reduce the batch size within the GPU, and optimize for the smaller batch case. Or you could allow your batches to be a bit more different, increasing the batch size at the expense of something else (probably local storage, and therefore transistors, but also potentially memory accesses could suffer).

Edit: By the way, I think the best way to deal with branching and batches is not to change the size of the batches explicitly, but rather allow a few separate batches to run using the same space at once. If a branch separates one batch into two, then, it wouldn't halt the pipeline at all. It would potentially reduce the efficiency of texture reads, but shouldn't otherwise impact the pipeline.
 
Last edited by a moderator:
Bob, I'm not suggesting that batch-size is the key-driver of utilisation, here (although it's relevant).

Instead what I'm saying is that utilisation will be enhanced, because a Xenos style multiple-batch shader array "can't be stalled" in the general case. If the shader array can run a different batch on every successive cycle, then an instruction that would cause a stall in a conventional architecture has no effect in a multiple-batch architecture, where the next batch's instruction is executed (it might even be the same "stall-inducing" instruction as the previous batch - but that won't have any effect because on the next cycle another batch will be executed).

An example might be a shader that uses a texture result in a calculation immediately after the texture operation - not as a dependent texture read, but in an ALU calculation, a MUL, say. In a conventional architecture, if the scheduler can't fill the latency with other instructions, then a stall will occur. Xenos isn't stalled by this, so is able to run at around 95% utilisation. The other 5% seems to come from pixels near the edges of triangles (part of a quad at the edge of the triangle, but not part of the triangle itself). Maybe some other things too. (Apologies, I'm sure this is like teaching grandma to suck eggs - I'm just writing this to set out my understanding and elicit corrections if I'm wrong.)

The Xenos architecture simply sets the minimum size of a batch much smaller than it is in conventional architectures. Well, that's my under

Coherent memory accesses for textures are presumably driven by pre-fetching. In X800XT, for example, if one quad is issuing texture fetches, then it's only issuing four texture fetches per clock. The other three quads might be running different shaders.

Jawed
 
Well, you really don't want to be switching batches every other instruction for efficient memory access, though (both input and output). I would bet that with the Xenos, batches are executed sequentially, but the architecture just has a method of storing multiple batches, so that there is a minimal amount of time required to change between the batches.
 
Don't understand the point you're making about efficient memory access (input and output).

Texture accesses are the only operations that generate inputs (once the shader program and any constants have been loaded) and ROP operations are the only operations that generate outputs.

The former gains coherency with pre-fetching and the latter is stuck with whatever it can get. Hence Xenos's daughter die, which can re-order pixel colour/z/stencil data to make memory accesses coherent.

Jawed
 
Well, for texture accesses you wouldn't want to be continually switching between two batches because you might possibly be switching between two textures, which would require twice as much texture cache for the same efficiency. Much better to execute one batch at a time. Storage on-chip for multiple batches would basically just act to prevent stalls from batch switching, but you'd still want your batches to be nice and coherent.
 
PurplePigeon said:
So apparently ATI had investor meetings recently with brokerage(s) and one of the notes from an analyst states that:



Anybody know what that "soft ground" issue refers to?
Could mean various things, but a signal that's grounded at one voltage might not stay that way at a higher voltage if it's "soft".
 
Chalnoth, I'm don't think that's true at all - the total # of fragments in flight between an architecture that works on a few large batches as opposed to many small batches can be the same (so the total number of texels required for a lookup per fragment in flight should be similar). Assuming a small batch still has some kind of screen-space coherency (for example it corresponds to a 4x4 or 8x8 pixel tile) the cache space requirements for a given hit-rate will go up somewhat (texturing a small tile shouldn't evict texels required for texturing the tile borders), but I would be amazed if it were by a factor of 2x...

edited for clarity
 
Last edited by a moderator:
Well, what I'm saying is that you want to be doing it in the right order. If you're executing many different batches at once you're probably not going to be very good about memory accesses.
 
I expect that the memory requirements of a multiple-small-batch scheduling architecture are somewhat higher than a conventional one - apart from anything else, instead of having only a few batches in flight (hence program counters, constants and texels in cache) you've got dozens if not hundreds of batches in flight.

But the batches are issued in order and they will tend to execute in order, round-robin. The batch order will get broken up by dynamic flow control. Then it's a matter of if there are any texture fetches in the "else" clause, or whatever it is that is rarely executed as a result of the flow control. If the else clause only executes a few times across hundreds of batches, then each successive texture fetch will prolly find that the texture data pre-fetched by the previous instance has been flushed.

So that kind of texture fetch will definitely consume a disproportionately large amount of texture bandwidth.

X800XT has prolly got either 16KB or 32KB of texture cache in it, total - 4KB or 8KB per quad TMU. It's a very low base from which to start adding cache, if Xenos (for example) needs extra cache to support its, effectively, randomised texture fetching. It's worth remembering that the four quads of X800XT are 4-way MIMD (as a group), so taken as a group they issue randomised texture fetches, anyway. Not as random as the batches of Xenos, maybe, but still more random than in, say, NV40.

NV40 has a two-level cache architecture because all quads are sharing a texture, but the quads don't perform coherent fragment shading as they all share the workload for a triangle at a time (or multiple triangles in a batch), with pixel-quads allocated round-robin.

Obviously, being in the dark about batch sizes in these architectures really doesn't help.

In the end, pre-fetching is the main solution to the coherency problem. It's then a matter of sizing the texture cache to support the more randomised access patterns. In the end, the vast majority of texture fetches are going to be fairly coherent, and pre-fetched texels will stay in cache long enough that they won't need to be fetched multiple times.

It's only going to be in the exceptional cases (texture-fetch in rarely executed clauses) that cause multiple main memory reads.

(I should point out that my earlier post talking about batch switching upon a texture-fetch instruction isn't correct - realised this in bed just before I got up, sigh. Not a major thing, but I'll leave it is because we're past that point now.)

Jawed
 
Last edited by a moderator:
Jawed said:
It's a very low base from which to start adding cache, if Xenos (for example) needs extra cache to support its, effectively, randomised texture filtering.
Why would Xenos have a more 'randomised' texture filtering than any other programmable GPU out there?
 
Dave Baumann said:
Unfortunatly, Rys, the concpet falls down a little saying that its just one fragement quad as that would indicate that the ALU's are deep, i.e. operating in a single pixel with multiple instructions, which would go against the idea of "12 pipelines". I know where you are coming from, but in this case it more likely that the "pipelines" are no longer single quad pipelines, but pipelines that can handle multiple simultaneous quads.

Is this the new Memory Manager?

US
 
nAo said:
Why would Xenos have a more 'randomised' texture filtering than any other programmable GPU out there?
Whoops, that should be texture fetching, not filtering.

Sigh.

Jawed
 
Jawed said:
Roughly 50% extra utilisation means that R520's 16 pipelines could perform like 24 R420 pipelines (clock for clock).

That would be good .. especially if the R520 can do SM3.0 at those speeds. The R420 is no slouch.

Jawed said:
I wonder what this means for R580. It seems to me that a 48 pipe design (if that's really how things scale up) would be best implemented as 3 shader arrays. Which makes me think that perhaps RV530 is also 3 shader arrays.

Dave did say that the RV530 would give insight to the R520 .. so I can see that it has those 3 shader arrays you talking about.

US
 
ERK said:
Or maybe the falling clock edges are too long, since the ground cannot sink enough current?
So, overall, is this likely to be a layout problem? Will extra grounding, or rerouting of grounding be the fix?

If they finally found this problem in July, how are they going to release R520/RV530 in October? Is that enough time for re-working it?

Does this also affect R580?...

It's interesting that Xenos escaped - is that a reflection of it being a smaller die (without ROPs)? It's smaller than R520, I expect. But at the same time I wouldn't be surprised if RV530 is smaller still...

Jawed
 
Well, if the R520XL is 500/500, I doubt that needs another respin. So they could release the R520XL earlier than the R520XT, with the XT only being produced on the respin. The idea there would probably be that the R520XL is a reply to 7800GT(X), and the R520XT to the 7800 Ultra, if there will be such a part.

Still, if that's the case, I'm not sure how ATI expects a 500/500 chip (or, well, anything below 600/600) to beat the 7800GTX, unless there's more to the ALU performance than 16-1-1-1...The 6800U already was more powerful, *per-clock*, than the Radeon X850XT-PE...

Uttar
P.S.: Anandtech says the XL will be released "early september" and the XT "early october". The CF equivalents only coming in "mid october" is ridiculous though; ATI's CF strategy is laughable at best.
 
Uttar, good thinking about the XL - if 500MHz is indeed achievable - overclockers will hate it, :LOL:

I think a Xenos style scheduler could bring a 50% performance improvement due to utilisation. As I said earlier, 16 R520 pipelines could perform, clock-for-clock, like 24 R420 pipelines.

Jawed
 
Unknown Soldier said:
Dave did say that the RV530 would give insight to the R520 .. so I can see that it has those 3 shader arrays you talking about.
Errr, no. Thats almost the total opposite of what I said.

Uttar said:
P.S.: Anandtech says the XL will be released "early september" and the XT "early october". The CF equivalents only coming in "mid october" is ridiculous though; ATI's CF strategy is laughable at best.

They need to design new boards for it, and with the way thngs are they are probably working on a fairly compressed schedule. R5xx was probably too late in the design phase to make changes for any different silicon to support two boards, so they probably still need the composite chip though. However, they are not saddled with the board issues so there could be changes there - personally I think they should dump the master concept and, like the Alienware solution, put the composite chip on a separate board that could be sold for $50 or so, and then have internal connectors on the R5xx boards that connects to the composite board. Doing it this way would mean that vendors only have to carry one extra, cheap, SKU (the composite board) rather than multiple "Master" boards and gives a greater range in flexability to the end user in maximising their particular setup.

I've suggested that to them and yet to have a satisfactory explaination for not doing it that way.
 
The other thing I've realised is that 48 pipes for R580, arranged in three arrays of 16 pipes each, wouldn't be such a huge device.

Xenos has 48 pipes, and fits that into 232M transistors. Sure there's no ROPs (50M+ transistors?) and there's no separate vertex pipelines (40M transistors? wild guess). But it's worth remembering that G70 has 48 ALUs in its fragment shaders, so it can hardly be beyond the bounds of 90nm tech.

It seems that R520 was just the first cut at a shader array architecture for the PC space (that should have been in the market since May...) and to keep it simple ATI designed it with just 16 pipes, making for a relatively small core - relying upon the increased utilisation brought about by the new scheduler.

This should have suited the "keep it small on a new process" brigade - under 250M transistors? - but the wheels fell off with this soft ground problem.

Ah well.

Jawed
 
Dave Baumann said:
Errr, no. Thats almost the total opposite of what I said.

Oops .. sorry Dave .. now that I think of it you did say the R520 will give insight to the RV530(or something similar). I tried looking for the quote but couldn't find it though(gotta save them quotes somewhere).

Or was it the RV530 gives insight to the R580?

Eish . .now only to find that quote.
 
Last edited by a moderator:
Back
Top