RXXX Series Roadmap from AnandTech

BTW, 16 x 3 is 48 which is a number that I have seen somewhere ... What transistor difference could we expect from the R520 to the R580?

Xenos. The ALU to tex ratio between Xenos and R580 would also be very close (give or take the fact that Xenos's ALU's are also vertex processing but have a scalar unit).
 
Dave Baumann said:
Xenos. The ALU to tex ratio between Xenos and R580 would also be very close (give or take the fact that Xenos's ALU's are also vertex processing but have a scalar unit).

If I remember correctly the ratio would be 1:3, at least if you forget about the vertex/point sampling texture units. That ratio in fact seems correct for the new shader heavy applications but it's a bit low for some older applications, the likes of UT2004 where there is more texturing per arithmetic op (the correct ratio for those should be around 1:2). Not that those games require more performance from current GPUs (unless you want to beat someone on old benchmarks).

So what ratio would make sense for a middle end GPU, the RV530, 2:3 or 1:3? What is the 2 and what is the 1? If it wasn't an architecture description only for ATI GPUs the second number could be the number of ALUs 'in cascade' and the the fourth the texture to ALU ratio. But if that isn't the case I would go more for the second number being the texture:ALU ratio and the fourth being the ROP:ALU ratio. Therefore RV530 would have 1:3 texture/ALU ratio and 2:3 ROP/ALU ratio. R520 and RV515 would have 1:1 ratios for both texture/ALUs and ROP/ALUs and R580 would be 1:3 for both ratios. Xenos is 1:3 for texture and 1:6 for ROPs if my memory works well (which doesn't happen often ;).
 
So, current speculation has it that RV530 is one array of 12-fragments per clock? SIMD, one program counter shared by all 12 fragment threads?

Similar to how one array of Xenos is 16 fragments per clock?

Jawed
 
Umm, doesn't R420 already have *two* ALU per fragment shader (plus the Tex)? So the mini is gone in R520?
 
geo said:
Umm, doesn't R420 already have *two* ALU per fragment shader (plus the Tex)? So the mini is gone in R520?

I wouldn't count miniALUs/modifiers as ALUs. They are just some additional stuff that is before and after the normal FMAD/special blocks. Much like the swizzlers and the result mask.
 
Jawed said:
So, current speculation has it that RV530 is one array of 12-fragments per clock? SIMD, one program counter shared by all 12 fragment threads?

Similar to how one array of Xenos is 16 fragments per clock?

Jawed

Who knows. But if my explanation makes sense RV530 and RV515 would be just as the current R4xx GPUs with 'quad shader processors'. I don't even know if R3xx/R4xx/NV3x/NV4x are SIMD or not.
 
Jawed said:
RV530 is a beast
icon_exclaim.gif
icon_mrgreen.gif


Fragment shader rate:

RV530 - 7200M fragments/s
X800XL - 6400M fragments/s
X850XT - 8320M fragments/s

RV515 is a beast too:

RV515 - 3200M fragment/s
X550 - 1600M fragments/s
X300 - 1300M fragments/s
X700XT - 3800M fragments/s

I can not get the RV515 number to make sence, it seems that it is clocked at 450MHz and I really can not make it fit the 3200M Fragments. By my math it looks something like this:

Code:
Card/chip	Mfragments/s	Pipes	Mframents/pipe/s	Clockspeed	Fragments/pipe/cycle
X800xl		6400		16		400		400		1
X850xt		8320		16		520		520		1
X550		1600		4		400		400		1
X300		1300		4		325		325		1
X700xt		3800		8		475		475		1
RV515		3200		4		800		450		1.78
RV530		7200		4		1800		600		3
R520 guess1	9600		16		600		600		1
R520 guess2	17066,67	16		1066.67		600		1.78
R580 guess1	28800		16		1800		600		3

A score of 1800 Mfragments would make more sense for the RV515 and avoid the rather starnge result of 1.78 fragments per pipe per cycle, if the RV515 was clocked at 400MHz it would make it 2 fragments per pipe per cycle that would make more sense but would not fit the semi confirmed 4-1-1-1 number (neither does 1.78). The fragment count per cycle would fit if the RV515 was clocked at 800MHz but that seems out of question.
 
Jawed said:
Tim, you're right. I think I misread the 800MHz memory as 800MHz core :(

No I used asumed a core speed of 450Mhz (from anandtech and others), that is how i got the numbers in the tabel:

x850xt: 8320/16/400Mhz= 1 fragments per pipe per cycle.
RV530: 7200/4/600= 3 fragments per pipe per cycle.

Both of these makes sense, for the RV515 I see three ways to get the numbers to fit, but none of them really fits the availeble data:

RV515-1: 3200/4/450Mhz = 1.78 fragments per pipe per cycle.

I do not think the 1.78 makes sense, I want a nice whole number.

RV515-2: 3200/4/400Mhz = 2 fragments per pipe per cycle

That makes sense, but does not fit the 4-1-1-1 data.

RV515-3: 3200/4/800Mhz = 1 fragments per pipe per cycle.

That fits the 4-1-1-1 info, but 800Mhz is crazy. If the RV515 would do 1600 or 1800 Mfragments/s everything would click and fall into place - right now I simply cannot figure out how the RV515 score fits in.
 
Here's a quick comparison of shader arrays and quads, based on a "12-pipe" architecture.

For this comparison I'm going to use X850Pro (507 core, 520MHz memory, 33.3GB/s) with RV530 (600 core, 700MHz memory, 22.4GB/s), simply because the former exists.

X850 Pro:
  • 3 quads of fragment shader pipelines
  • each shader-quad has a dedicated (quad) TMU, with its own cache
  • each shader-quad is 4-way SIMD, i.e. one program counter is shared by all 4 pipelines, all executing the same instruction
  • the three shader-quads each operate independently of the others, so the shader-quads operate, overall, as 3-way MIMD
  • each shader-quad "owns" a tile of 256 pixels (a square of 16x16) in the backbuffer
(Speculation) the size of a tile corresponds with the batch size for the architecture (256 fragments). Batches are used to hide texture latency, where each instruction that is completed on the entire batch of 256 fragments will hide 64 cycles of texture latency (since this is 64 quads of fragments - in other words one batch executes in 64 phases per instruction). Texture latency can only be completely hidden if multiple instructions are executed in the shader.

This page appears to indicate that 6 instructions hide a single texture instruction's latency:

GPGPU bench results for single texture fetch in X800XT

the instruction count appears to include the texture instruction, itself - so breakeven comes at 6 instructions, total. Though it's worth noting that 2 texture fetches take 8 instructions to hide, so the overall average is 4 instructions per texture fetch for 2 or more texture fetches.

So in X800XT I'm guessing that the average texture fetch requires 256 cycles to hide. The X850Pro should have similar texture fetch latency (slightly higher-clocked memory).

RV530
  • 1 array of 12 pipelines
  • TMU configuration might be:
    • 1 TMU per X pipelines or
    • a TMU array, e.g. 8 TMUs, shared by all pipelines
  • the shader array shares a single program counter, making it 12-way SIMD
(Speculation) the batch size may be a small multiple of the array size, e.g. 24 fragments . The only way to hide (say) 256 cycles of latency is with more average instructions per texture operation per batch (2 phases per batch, 128 instructions per phase per texture) or to interleave multiple triangles' threads (batches). The former configuration simply isn't practical, so to support small batches, RV530 would need to be able to interleave batches for fragment shading.

Xenos is able to interleave batches in each shader array on successive clock cycles. Unfortunately we don't know how many batches Xenos can maintain at any one time. I'm going to assume that Xenos uses 32-fragment batches (i.e. 2 phases per batch) when pixel shading, and so to hide 256-cycles of texture latency it would need 32 batches each of 4 instructions average per texture op. (32 batches x 4 instructions per batch x 2 phases = 256 cycles).

If a batch was, say, 64 fragments (4 phases), then 16 batches would need to be active at one time, etc. I'm suggesting that 1024 fragments could be in flight at one time. X800XT, with four quads, each working on 256-pixel tiles, also has 1024 fragments in flight at one time.

So assuming that RV530 has a multiple-batch scheduler like Xenos's, it would take 32 batches (each of 24 fragments in 2 phases), each with an average of 4-instructions per texture op (i.e. 3 ALU ops and 1 TMU op) to hide 256 cycles of texture latency. This would correspond with 768 fragments in flight at one time.

So, does RV530 make use of a multiple-batch scheduler, like Xenos?

If so, it would imply that with such a small batch size (24) all forms of dynamic branching (loops, if...then...else) become fairly practical. What makes dynamic branching in NV40's or G70's fragment shader architecture worthless is the extremely high cost if only 1 fragment follows the worst-case execution path. It causes all other fragments in the batch to follow the same, slow, execution path.

So a smaller batch, if it's possible to implement by using multiple-batch scheduling (like Xenos) would make the worst-case execution path far less costly overall.

(Throughout this comparison, I've stuck to "256-cycle latency" for texturing. This depends on the clock rate and architecture, so in RV530 the latency could be much longer, for example. Naturally, if I've got the batch size wildly wrong, e.g. it should be 512-cycles, I believe the analysis will stand simply by changing the phase count - and the overall concept that Xenos and RV530 both schedule multiple small-batches still holds.)

Jawed
 
What's a "soft ground" issue?

So apparently ATI had investor meetings recently with brokerage(s) and one of the notes from an analyst states that:

The R520 had been sampling since Dec/04, and although the architecture and 90nm process were not a problem, ATI was not able to run the clock fast enough due to a “soft groundâ€￾ issue that was discovered in late July after debugging with several re-spins. Specifically, the R520 and RV530 had functional yields, but could not run at high speeds, while the RV515 and the C1 (the 90nm Xbox graphics chip) did not have any issues.

Anybody know what that "soft ground" issue refers to?
 
Jawed said:
(Speculation) the size of a tile corresponds with the batch size for the architecture (256 fragments).

There's two things that deterimine the tile size:
1. There's no L2 texture cache, so there's some inefficiency at the tile boundary. So the larger the tiles the more efficient the architecture.
2. The pipelines are pre-assigned to specific tiles, so batches of relatively small triangles can cause inbalanced load between the pipelines. So the smaller the tiles the more efficient the architecture.

R300 was created with configurable tile size, but ATI found the sweet spot is 16x16 so they use that.

Given that I wouldn't assume that there's any relation between the tile size and the batch size.
 
PurplePigeon said:
Anybody know what that "soft ground" issue refers to?

I haven't heard of the phrase in this kind of context before - usually it's safety-related. It might mean that they have had problems with high impedance on connections to the ground plane.
 
Which presumably means that as you crank it up, clock jitter caused by the increased noise sensitivity of the ground gets so bad that it can no longer stay in synch with itself.

Jawed
 
Hyp-X, I agree about those constraints. To be honest I'm not too worried about the actual batch size in R3xx...R4xx..., because we know that it's large, maybe 256 fragments, maybe 1024. Whatever the number, being large like this makes dynamic flow control impractical (as it is in NV40/G70). It was just a launching off point for me to consider batch sizes.

Though it would be great to find out what the batch size is in R420
icon_exclaim.gif


What I'm interested by is the possibility that a multiple-batch scheduler, like that in Xenos, can make small batches (say 16 or 32 fragments, in Xenos) practical, and therefore make dynamic flow control a viable part of the SM3 feature set.

If the R5xx architecture is built upon a multiple-batch scheduler, like Xenos, then not only does it gain dynamic flow control, but it also gets the efficiency gains of Xenos, from the 50-70% utilisation that R420 can achieve to the 95% efficiency that Xenos can deliver. Roughly 50% extra utilisation means that R520's 16 pipelines could perform like 24 R420 pipelines (clock for clock). Who knows, eh? But it sounds like extreme pipelines are not needed
icon_biggrin.gif


I wonder what this means for R580. It seems to me that a 48 pipe design (if that's really how things scale up) would be best implemented as 3 shader arrays. Which makes me think that perhaps RV530 is also 3 shader arrays.

So it all gets a bit messy at this point, and I'll hold off speculating some more...

I'm mostly interested in the possibility of small-batch support and the consequent efficiency boosts in texturing, dynamic branching and avoiding stalls, rather than numbers and types of pipelines... I think it'd be rather groovy if R5xx gets this advanced scheduler
icon_mrgreen.gif


Jawed
 
Back
Top