RXXX Series Roadmap from AnandTech

geo said:
So what's everyones performance expectations for R520XL? I'm guessing that in general (non-HDR) it will fall between stock GT and stock GTX at a GT-like price. That they are going for doing to NV with it what NV did to them with 6800GT last time.
That would be very nice if they do it, I loved the 6800GT and then the X800XL.
 
In my opinion why are people expecting a spring launch for R600 or G80(NV50?)???

Its to early IMO and i think this chips are going LATE 06 or early 07.
 
overclocked said:
In my opinion why are people expecting a spring launch for R600 or G80(NV50?)???

Its to early IMO and i think this chips are going LATE 06 or early 07.

I'd expect them before the lanch of Vista.
 
Jawed said:
The other thing I've realised is that 48 pipes for R580, arranged in three arrays of 16 pipes each, wouldn't be such a huge device.

I would put say that the pipelines will still be arranged in 4 MIMD, regionalised "quads".
 
Dave Baumann said:
I would put say that the pipelines will still be arranged in 4 MIMD, regionalised "quads".

MIMD would mean a quad could work with different shaders per pixel?
 
tEd said:
MIMD would mean a quad could work with different shaders per pixel?

Self explanatory: multiple instruction multiple data.

I'm not entirely sure, but weren't the current quads of Radeons also rated as MIMD?
 
Obviously a single quad on a current Radeon is SIMD, but across the quads they are MIMD (otherwise the tile rendering wouldn't work).
 
Dave Baumann said:
Obviously a single quad on a current Radeon is SIMD, but across the quads they are MIMD (otherwise the tile rendering wouldn't work).

Hmmm.....I'm still describing Xenos as a 3* 16-way SIMD.
 
If their tile size on R520 is the same as on R3xx/R4xx and if a batch can't be bigger than a tile per quad we could expect better dynamic branching performance than NV40.
 
nAo, I'm not convinced a 256-fragment batch is small enough to make dynamic flow control much more usable than it is in NVidia hardware. Admittedly, the batch size for NV40 seems to be vast, 4096. The batch size for G70 seems like it's 1024 - but the results for the two cards in the PS3.0 branching test:

http://graphics.stanford.edu/projects/gpubench/results/

don't seem to be that different from each other. Those tests aren't documented very well - I think there was some detailed discussion of them on B3D but...

Jawed
 
Ailuros said:
Hmmm.....I'm still describing Xenos as a 3* 16-way SIMD.
I think that's correct.

At the next higher level, you can describe the shader arrays as a group that's configured as 3-way MIMD.

Jawed
 
Dave Baumann said:
I would put say that the pipelines will still be arranged in 4 MIMD, regionalised "quads".
So, you're saying that there would 4 arrays, each of 12 pipes, rather than 3 arrays, each of 16 pipes?

Further, that each array of 12 pipes owns a "tile" in the render target, in a similar fashion to the tiles of R3xx...R4xx...?

12 seems like a bloody awkward number of pipes to make the "base unit" for an architecture. The kind of number you only use when cost/size constraints (e.g. mid-range) come into play. Hmm...

Jawed
 
nAo said:
If their tile size on R520 is the same as on R3xx/R4xx and if a batch can't be bigger than a tile per quad we could expect better dynamic branching performance than NV40.

Flip-back on Eric's own comments regarding that topic:

The R3xx and the R4xx have a rather interesting way of tiling things. Our setup unit sorts primitives into tiles, based on their area coverage. Some primitives fall into 1 tile, some into a few, some cover lots. Each of our backend pixel pipes is given tiles of work. The tiles themselves are programmable in size (well, powers of 2), but, surprisingly, we haven't found that changing their size changes performance that much (within reason). Most likely due to the fact that with high res displays, most primitives are large. There is a sweet spot in the performance at 16, and that hasn't changed in some time. Even the current X800 use 16, though I think we need to revisit that at some point in the future. Possibly on a per application basis, different tile sizes would benefit things. On our long list of things to investigate.

Anyway, each pipe has huge load balancing fifos on their inputs, that match up to the tiles that they own. Each pipe is a full MIMD and can operate on different polygons, and, in fact, can be hundreds of polygons off from others. The downside of that is memory coherence of the different pipes. Increasing tile size would improve this, but also requires larger load balancing. Our current setup seems reasonably optimal, but reviewing that, performance wise, is on the list of things to do at some point. We've artificially lowered the size of our load balancing fifos, and never notice a performance difference, so we feel, for current apps, at least, that we are well over-designed.

In general, we have no issues keeping all our units busy, given the current polygons. I could imagine that if you did single pixel triangles in one tile over and over, that performance could drop due to tiling, but memory efficiency would shoot up, so it's unclear that performance overall would be hurt. The distribution of load accross all these tiles is pretty much ideal, for all the cases we've tested. Super tiling is built on top of this, to distribute work accross multiple chips.

As well, just like other vendors, we have advance sequences that distribute alu work load to our units, allocate registers and sequence all the operations that need to be done, in a dynamic way. That's really a basic requirement of doing shading processing. This is rarely the issue for performance.

Performance issues are still very texture fetch bound (cache efficiency, memory efficiency, filter types) in modern apps, as well as partially ALU/register allocation bound. There's huge performance differences possible depending on how your deal with texturing and texture fetches. Even Shadermark, if I recall correctly, ends up being texture bound in many of its cases, and it's very hard to make any assumptions on ALU performance from it. I know we've spent many a time in our compiler, generating various different forms of a shader, and discovering that ALU & register counts don't matter as much as texture organization. There are no clear generalizable solutions. Work goes on.

http://www.beyond3d.com/forum/showthread.php?p=342652&highlight=tiles#post342652

My highlight and question mark....;)
 
I'd just like to jump in here and point out a bit at the end:

Performance issues are still very texture fetch bound (cache efficiency, memory efficiency, filter types) in modern apps, as well as partially ALU/register allocation bound. There's huge performance differences possible depending on how your deal with texturing and texture fetches.

Jawed
 
Yes, well, on the other hand he also says as late as Oct 04, shortly before R520 tapeout:

There are no clear generalizable solutions. Work goes on.

Which is not to say I don't hope that they made a significant advancement in that area --just being honest broker guy.
 
Back
Top