Next gen graphics and Vista.

Jawed · Sep 15, 2005

DevilsRejection said:
sorry to sound dumb but when seeing things like:

R520 is "16-1-1-1," R580 "16-1-3-1," RV515 "4-1-1-1," RV530 "4-1-3-2."

What exactly do these numbers mean? i know the first one is pipelines but the rest?

If you open the threads:

http://www.beyond3d.com/forum/showthread.php?t=21970
http://www.beyond3d.com/forum/showthread.php?t=20454
http://www.beyond3d.com/forum/showthread.php?t=18270

And put the following search term into the "Search this thread" drop-down:

16-1-1-1 or 16-1-3-1 or 4-1-1-1 or 4-1-3-2

You'll get an interesting history of speculation...

I think there may be some other threads, too. But that's more than enough!

Jawed

Jawed · Sep 15, 2005

Graham said:
I know this is quite off topic, but:

I was wondering if anyone knows approximate fill rate of the Xenos? when doing normal textured polygons but also z-only or stencil-only writes.

I'm very interested in it's ability to render things like shadow maps and perticuarly stencil-shadows... Combined with the (what I'd expect) phenomenal stencil-only performance and the ability to write to memory, it would be an absolute beast at generating and rendering stencil shadows.

I was surprised, for example, to see Saints Row using full scene stencil shadows on an entire city (from the sun), at 720p 4xaa that is an awful lot of fillrate for 60fps+.

I know it is a damn fast peice of kit after seeing it at siggraph (it was too fast ironically), but does anyone know numbers?

4Gp/s of colour (8 ROPs running at 500MHz). z/stencil is either double that, or quadruple that. It's not clear to me - heard conflicting suggestions.

What makes Xenos healthy at that kind of thing is the fact it can dedicate all 48 pipelines to vertex shaders during the shadow pass(es).

http://www.beyond3d.com/articles/xenos/index.php?p=05http://www.beyond3d.com/articles/xenos/

Jawed

Jawed · Sep 15, 2005

Bob said:
NV4x solves this by design: all shader pipes run in lock-step (for a certain definition of lock-stepping). So you get nice long predictable coherent data streams. If you can schedule arbitrary threads at arbitrary times, you lose that.

Bob, do you know how G70 solves the "out of order" problem, because it seems to me that it doesn't run its 6 quads in lock-step.

Is that correct? If so, how does G70 keep fragments in order and separate across the quads. Does it use tiling like ATI's designs?

Jawed

RoOoBo · Sep 15, 2005

_xxx_ said:
No, you linked to some files which are locally on your hard drive and not on the internet

Wrong. May be I don't want to any prove that I put those files online before a given date. And now ... I'm going to remove the hints.

Dave Baumann · Sep 15, 2005

Jawed said:
4Gp/s of colour (8 ROPs running at 500MHz). z/stencil is either double that, or quadruple that. It's not clear to me - heard conflicting suggestions.

The ROP's are designed to always hangle FSAA so that 4GPixels/s is also no-penalty 16G (MSAA) Samples/s - in both cases (FSAA and non FSAA) Z/Stencil runs at twice the colour rate, so 8GSamples/s without FSAA and up to 32GSamples/s with FSAA.

Jawed said:
Is that correct? If so, how does G70 keep fragments in order and separate across the quads. Does it use tiling like ATI's designs?

I'm almost certain that it isn't tiling - during the G70 editors day I had a discussion with Tony Tamasi and others about Crossfire SuperTiling and the costs and pointed out that ATI's pipes are already tiling at the quad level, he couldn't quite believe it. Judging by his reaction, the answer is no.

_xxx_ · Sep 15, 2005

RoOoBo said:
Wrong. May be I don't want to any prove that I put those files online before a given date. And now ... I'm going to remove the hints.

url=/docs/vmoya-ShaderPerformance.doc
url=/docs/EmbeddedGPU.pdf

^^ these are not valid URL's, and that's what was in your links. So even if you have put it online, we can't see it since you didn't provide the correct address...

But whatever, not that important

Jawed · Sep 15, 2005

Dave Baumann said:
The ROP's are designed to always hangle FSAA so that 4GPixels/s is also no-penalty 16G (MSAA) Samples/s - in both cases (FSAA and non FSAA) Z/Stencil runs at twice the colour rate, so 8GSamples/s without FSAA and up to 32GSamples/s with FSAA.

Thanks. I guess I was confused, before, thinking that z/stencil-only and AA were mutually exclusive - and that kinda stuck in my head.

So that means that when Xenos does 2xAA is it wasting AA sample bandwidth :?:

The only saving in 2xAA comes from using less tiles to render the entire frame.

I'm almost certain that it isn't tiling - during the G70 editors day I had a discussion with Tony Tamasi and others about Crossfire SuperTiling and the costs and pointed out that ATI's pipes are already tiling at the quad level, he couldn't quite believe it. Judging by his reaction, the answer is no.

That's pretty surprising they didn't know, what with all the talk (well talk here, not sure where else) of the multi-GPU flight simulator gear.

Jawed

RoOoBo · Sep 15, 2005

_xxx_ said:
^^ these are not valid URL's, and that's what was in your links. So even if you have put it online, we can't see it since you didn't provide the correct address...

But whatever, not that important

That's so fun ... At the end I made a little google search for a few of the papers in the Micro program and most are already online. I guess it's to be expected in the current age, no one bothers to read the papers at the conference anymore

so at least give the chance to read them before.

I guess it depends on people, for the GH2005 at least half of the papers were online too.

Check page below.

_xxx_ · Sep 15, 2005

Now I found it on your page

The link was "http://personals.ac.upc.edu/vmoya/docs/vmoya-ShaderPerformance.pdf"
and you posted"/docs/vmoya-ShaderPerformance.pdf", so you missed "http://personals.ac.upc.edu/vmoya/" in your links

Interesting read.

Dave Baumann · Sep 15, 2005

Jawed said:
So that means that when Xenos does 2xAA is it wasting AA sample bandwidth The only saving in 2xAA comes from using less tiles to render the entire frame.

Yes, in this implementation the costs are shifted back up the pipeline to geometry rather than fragment and fill costs, where commands span tiles - if it spans more than one tile then that command need to be reprosessed through to geometry setup where the pixels outside of the current tile being rendered can be clipped; the fewer tiles the fewer commands will require reprocessing.

Geo · Sep 15, 2005

Chalnoth said:
ATI, on the other hand, will only be about one year into the R5xx architecture, and will be keen on milking that architecture as much as they can.

Well, Orton has said that 2005 is the investment year and 2006 the payoff, so that would strongly suggest to me that they don't see the changes they need to make in 2006 as nearly as complex as those they are making for 2005.

Jawed · Sep 15, 2005

RoOoBo said:
Check page below.

I've just read ShaderPerformance. Fascinating stuff and a very impressive achievement.

So, how does your "GPU" solve the green triangle/red triangle "out of order" problem mentioned earlier :?:

It seems that your GPU processes groups out of order - is that right?

Lots of other goodies. The paper deserves its own thread.

Jawed

Geo · Sep 15, 2005

Dave Baumann said:
Look at a graph of a benchmark plotted overtime and the FPS is bouncing around all over the place â€“ at any of these points in time the bottlenecks encountered are shifting from one element of the processing system to another, its next to impossible to have an even load across the all the units through even a few thousand frames, let alone the course of an entire game. No, the game can never be expected to provide the balance of power between PS/VS (and GS) utilisation, and nor should it be a task of the developer to try to (outside of the reasonable bounds of the expected hardware capabilities), the only question is whether dedicated units can still be more optimal than a unified structure in order to best hide/minimise such bottlenecks.

Umm, did you just say that one benefit of a unified architecture is a relative smoothing out of FPS in games? That's kinda cool. I hadn't thot of it in those terms, but I can see now how it might. . .

Geo · Sep 15, 2005

Jawed said:
Sadly I can't bump your reputation, because I've already bumped it up recently...

Allow me. . .

Jawed · Sep 15, 2005

http://personals.ac.upc.edu/vmoya/docs/EmbeddedGPU.pdf

Also a very good read. It's a pity you didn't have more time to cherry pick a configuration that better balanced some of the parameters. Plenty more work!

Also it would have been interesting to see a graph of transistor count (or area) for each of the configurations.

Jawed

RoOoBo · Sep 15, 2005

Jawed said:
So, how does your "GPU" solve the green triangle/red triangle "out of order" problem mentioned earlier It seems that your GPU processes groups out of order - is that right?

Reorder queues for both vertices and fragments.

Jawed · Sep 15, 2005

RoOoBo said:
Reorder queues for both vertices and fragments.

Is that the 8 entry queue for Primitive Assembly and the 64 entry queue for Color Write in Table 1? Is the latter 64 quads of fragments?

How many batches (of fragments, say) in flight does ATTILA support? It seems a batch consists of 128 threads, so a batch is broken down into 32 groups. Is that correct? Does ATTILA have the concept of a fragment batch, as such or is all scheduling done solely at the group level?

Does ATTILA's (unified architecture) scheduler prioritise batches in any way? I couldn't find a discussion of the maintenance of a batch queue. I see a discussion of the register file, but no discussion of the scheduler's assessment of shader status - merely that groups have a shader status. Also, no mention of checking how full/empty certain key queues are, as input for the scheduler.

Also - it seems that each shader pipe has a dedicated TMU. Do you have any plans to simulate the pooling of TMUs, in a fully-decoupled organisation like that in Xenos?

It's occurred to me that because ATTILA doesn't simulate a fully decoupled TMU array, there's less "freedom" for out of order execution? Is that fair?

Jawed

RoOoBo · Sep 15, 2005

I wouldn't bother with the queue sizes in the table as I may be changing them with each new experiment. Eight entries for the PA are too few for the configured latency (which may be too large) for sending all the data for a triangle down the pipeline so it may become a bottleneck for vertex limited batches (small vertex program and few fragments generated). I think in the second paper I put it at 32 ... The color write and z stencil have queues for 64 quads, as the third column of the table tells size per fragment x 4. There are few more queues that are not mentioned anywhere. And a lot of the pipeline discussion was removed because of the paper 10 pages limitation.

Until late July (the original paper was submitted in May-June or so) there wasn't a fragment distribution policy implemented for the shader units. Fragments were generated on an 8x8 tile basis and quads would be removed before shading by HZ and ZST. The quads would then be assigned to a free shader unit in a round robin basis. It wasn't very texture cache friendly ... Now, after July, there isn't a propper distribution mechanism implemented yet but the assignament is made on N+ (N being large, in the experiments I think it was set at 128) fragments per shader unit, and when one becomes full to the next shader with free resources. Very weird things happen with different configured Ns. A propper and configurable distribution mechanism is what I should be working on right now (likely to be tile based).

We don't have that concept of a batch yet and I'm unlikely to call it that way, too confusing with the other batches. May be shader work assignment group or unit or something ...

Decoupled TMUs is something that is on the large 'to be done' list. The shaders work out of order because the texture cache and texture unit are out of order. They can return results in a different order than the requests were issued (that was also implemented in July ...). But when a shader input is in the shader it only gets out after the processing is completed. Shader inputs are selected each cycle in groups of four to fetch and execute n instructions in a FIFO or round robin (I'm not sure now) order and if they aren't blocked waiting for a texture result or another kind of dependency (so it could be called a thread window with some kind of priority).

The only scheduling that is done outside the shader unit is to send vertex inputs to the shader before sending fragment inputs (vertex first scheduling). As the shader unit doesn't have a penalty for fetching instructions from either kind of input each cycle they just get mixed. And the number of vertex inputs is limited by the queues in the geometry pipeline. Another 'to be done' is downgrading the quite idealized shader unit to work in SIMD way (so a whole batch must execute the same fetched instruction before starting with the next). But I don't think that fetching an instruction (or group of instructions) every cycle or a few cycles is that problematic. CPUs implement higher fetch bandwidth at higher frequencies.

KimB · Sep 15, 2005

geo said:
Well, Orton has said that 2005 is the investment year and 2006 the payoff, so that would strongly suggest to me that they don't see the changes they need to make in 2006 as nearly as complex as those they are making for 2005.

Sure, but Vista won't be out until late 2006. If one takes a pessimistic view, it may be a statement that ATI won't have a DX10 part available at the launch of Vista.

Geo · Sep 15, 2005

Chalnoth said:
Sure, but Vista won't be out until late 2006.

But when would you expect the work to be more-or-less complete to support it? It seems clear to me (for ATI anyway) this work must be principally completed ("soft ground" kind of issues aside!) mid-2006 or so, and thus the relative effort between 2005 and 2006 that Orton is pointing at is both relevant and indicative of a strategy of leveraging previous bits and pieces and the relative proportions of existing work to new work that entails.

But I've been accused of reading too much into that before. <shrugs>

Edit: All I'm saying, is that if you look at the delta of work required to get from R420 to R600, I believe that by the release of R520 that ATI believes they will have completed something well north of 50% of that effort.

Next gen graphics and Vista.

Jawed

Jawed

Jawed

RoOoBo

Dave Baumann

Gamerscore Wh...

_xxx_

Jawed

RoOoBo

_xxx_

Dave Baumann

Gamerscore Wh...

Geo

Mostly Harmless

Jawed

Geo

Mostly Harmless

Geo

Mostly Harmless

Jawed

RoOoBo

Jawed

RoOoBo

KimB

Geo

Mostly Harmless

Similar threads