Next gen graphics and Vista.

Well, the thing is, I think the actual layout and design of the chip is such a huge portion of the cost that even though they may benefit from getting the R520 out before the R600 in terms of teething on new technology, they still have a lot of costs tied into the R520 itself that they may not have time to recoup.
 
ATI has more to gain financially (and mindshare-wise) by positioning well their upcoming DX10 generation of products, rather than dwelling on the current generation in an effort to make sure costs are recouped. Profits are profits and if the DX10 goods can pay for themselves, pay the difference from this gen, and make a profit on top of it all (which I think they can), well then it's clear that this gen should be viewed more as a holding action in preperation for R600's launch. R520 to tell you the truth I'm not all that excited about, but I'm interested in seeing what R580 brings to the table, as it seems more the hybrid.

But all of that aside, R600 is what they've been working towards.
 
xbdestroya said:
But all of that aside, R600 is what they've been working towards.

Couldn't agree more. :LOL: And possibly I should replace R520 with R580 in mine above, but I think that just moves the needle a little bit further to the right --in my view it's all very much on the same continuum, and when we look back a few years from now we'll be more likely to draw any bright lines between R4xx and R5xx/Xenos, than between R5xx/Xenos and R600.

But I reserve the right to change my mind when I see R520! :LOL:
 
Chalnoth said:
Well, the thing is, I think the actual layout and design of the chip is such a huge portion of the cost that even though they may benefit from getting the R520 out before the R600 in terms of teething on new technology, they still have a lot of costs tied into the R520 itself that they may not have time to recoup.
Those costs will be associated with the delays to R520, they wouldn't artificially hold the rest of the roadmap up because of issues with a particular chip. Silicon layout costs for a single chip are not necessarily that expensive when spread across the range of chips that the architecture covers - many of the elements will have the same layout across all the chips, with specific chip layouts "filling in the gaps" between these items.
 
Dave Baumann said:
Those costs will be associated with the delays to R520, they wouldn't artificially hold the rest of the roadmap up because of issues with a particular chip. Silicon layout costs for a single chip are not necessarily that expensive when spread across the range of chips that the architecture covers - many of the elements will have the same layout across all the chips, with specific chip layouts "filling in the gaps" between these items.
Well, the delays are a blow, certainly. But you have to see that even if ATI had launched in time for the "back to school" crowd, they'd still only have about a year of life in the R5xx architecture.

Basically, I think they kind of screwed themselves financially by deciding to make the R420 a SM2 part. The timing is just much easier on nVidia because they chose to go SM3 then, and now can just ride that same architecture all the way to Vista's release.
 
By the end of the DX9 lifecycle they will have both done pretty much the same, just at different points - ATI produced R300, with a smaller architectural change for R420, then moved to R520 with a larger change; NVIDIA had NV30, then made a larger architectural change to NV40 with a smaller change for G70 - they've both had 3 generations and for both of them one has been a larger architectural change, with the other two being closer generational shifts. NV3x only lasted a year, what makes ATI different to NVIDIA? (Even before considering that quite a lot of ATI's next generation has already been prototyped in a fully usable environment)
 
Dave Baumann said:
NV3x only lasted a year, what makes ATI different to NVIDIA?
Well, nVidia really f'ed up with the NV3x. I'd say if it lasted any longer, they'd basically be shooting themselves in the foot. ATI, however, seems to have planned for the R5xx to last only a year, which just doesn't seem like a great thing to do.
 
RoOoBo said:
I wouldn't bother with the queue sizes in the table as I may be changing them with each new experiment.
Ah - I was leading up to trying to see if the green/red triangle problem was, essentially, incapable of arising in your architecture, due solely to queue sizes or whether it was more subtle than that (i.e. not treating the queues as strictly FIFO).

Until late July (the original paper was submitted in May-June or so) there wasn't a fragment distribution policy implemented for the shader units. Fragments were generated on an 8x8 tile basis and quads would be removed before shading by HZ and ZST. The quads would then be assigned to a free shader unit in a round robin basis. It wasn't very texture cache friendly ... Now, after July, there isn't a propper distribution mechanism implemented yet but the assignament is made on N+ (N being large, in the experiments I think it was set at 128) fragments per shader unit, and when one becomes full to the next shader with free resources. Very weird things happen with different configured Ns. A propper and configurable distribution mechanism is what I should be working on right now (likely to be tile based).
In a tile-based rasteriser, presumably you could multi-thread the rasteriser. I know in the other paper you've done a shader-implemented rasteriser - which in itself suggests that multi-threading is possible (if there's more than one shader pipeline to run the rasteriser program).

Could you also multi-thread the primitive assembly and triangle setup engines? I presume you could because those steps are being removed as fixed-function in DX10, and implemented as shader programs too, aren't they?

We don't have that concept of a batch yet and I'm unlikely to call it that way, too confusing with the other batches. May be shader work assignment group or unit or something ...
I dare say you're in a good position to set the standard here, since the IHVs seem so coy about this subject.

The only scheduling that is done outside the shader unit is to send vertex inputs to the shader before sending fragment inputs (vertex first scheduling). As the shader unit doesn't have a penalty for fetching instructions from either kind of input each cycle they just get mixed. And the number of vertex inputs is limited by the queues in the geometry pipeline.
One thing that puzzles me about the unified pipeline is whether running multiple vertices in a work unit (e.g. in Xenos it's 16 vertices) will cause problems with vertex batch granularity. Put simply, if you've got 18 vertices to shade with a specific program, before the next batch uses a slightly different program, then in a traditional MIMD pipeline GPU, each vertex progresses individually through a pipe, quite happily. There's no issue of granularity as there is no work unit, as such.

But in a unified architecture, you have two work units: 16 vertices and 2. The second work unit wastes 14 threads' worth of resources. It just seems to me that vertex shading prefers finer-grained parallelism than fragment shading. Is that fair?

Another 'to be done' is downgrading the quite idealized shader unit to work in SIMD way (so a whole batch must execute the same fetched instruction before starting with the next). But I don't think that fetching an instruction (or group of instructions) every cycle or a few cycles is that problematic. CPUs implement higher fetch bandwidth at higher frequencies.
Since current GPUs are repeatedly executing a single instruction, it seems that they can "cut out" instruction decode from the primary pipeline (e.g. make it a separate task that runs "every so often" in a dedicated decode unit, delivering the decoded instruction and register file indices "just in time"). But in terms of the main pipeline organisation, does this actually amount to anything useful?

Jawed
 
Chalnoth said:
Well, the thing is, I think the actual layout and design of the chip is such a huge portion of the cost that even though they may benefit from getting the R520 out before the R600 in terms of teething on new technology, they still have a lot of costs tied into the R520 itself that they may not have time to recoup.

From the series start on, R520 causes no more costs if the production is already running. It earns money already.
 
_xxx_ said:
From the series start on, R520 causes no more costs if the production is already running. It earns money already.
Right, but if the series doesn't last very long, it may well end up being a loss by the time it's retired.
 
Chalnoth said:
Right, but if the series doesn't last very long, it may well end up being a loss by the time it's retired.

That doesn't matter if it's a well calculated loss and not too much. But I think they're on the less dangerous side financially, they milked lots of money out of R3xx architecture with comparably little invenstments in it since. They're very strong with TV-chips and such gadgets, so don't worry about them going down that easily.

Thanks to the work on XBox360 and the associated MS-engineering team, they'll pretty sure be working on a chip 95% tailored on DX10, those 5% still being subject to changes. And they'll probably let someone develop 2-3 different layout bits here and there, which are essentially swapable, so they're actually probably almost done with the basic layout and will now start the first optimizing phase. Just guessing, but probable IMHO.
 
Only triangle setup is performed in the shader unit. Remember how 3D graphic cards worked in the 'dark ages' (late 90s), the CPU did all the work but fragment generation, (single) texturing and updating the framebuffer. It's just the same idea, as setup has been historically something relatively easy to implement in a CPU or geometric processor you could move it to the shader as a third kind of input. How much hardware do you save from removing setup? I guess that depends on the triangle setup and rasterization algorithm you use ... But the 2DH method also saves from performing true clipping so it may be worth it.

I don't see any benefit of moving primitive assembly anywhere. Primitive assembly just takes three of (at least) four stored vertices and groups them as a triangle in the proper order. There is no computation (just a few compares to detect degenerates for indexed primitives). And I doubt the geometry pipeline requires a high vertex/triangle throughput. Xenos seems to be limited to one triangle (vertex?) per cycle and I wonder if PC GPUs go beyond that (if they do I doubt it's more than 2 or 3 triangles per cycle). If you are limited by triangle setup or triangle throughput (so very small triangles) in the geometry pipeline I would say it's a sign that you have implemented the wrong architecture for that graphic application. Go Reyes or ray tracing instead.

Primitive assembly and geometry shader or whatever is called are different things. I don't implement any high order curve/surface algorithm but getting a shader unit to work in any other kind of input is trivial with the simulator. For the shader unit vertices, fragments, triangles or whatever don't exist as such, only as input streams, programs and associated state and output streams.

The inneficiency when you can't complete a group of vertices exists but in the paper experiments is minimal as only 4 vertices form a group. If all your primitive batches have less than 4 vertices your problem isn't the inefficiency of shader input grouping but too small batches (very large stage change overhead and very large CPU overhead because of the number of API calls). For an implementation using larger input groups it is still happening once per primitive batch and there are quite a few more reasons because IHVs ask developers for larger primitive batches. As long as the other queues in the geometry pipeline are large enough that a whole vertex group can be constructed there shouldn't be a problem.
 
Last edited by a moderator:
Figured I would attempt to add something to the discussion:

I noticed Jawed quoted 'There is an old saying: "Jack of all trades, master of none."'

Here's where it came from as far as I can tell:

Question: Are GPU architectures and Direct3D evolving toward a design where the distinction between vertex and pixel shaders essentially goes away?—davesalvator

David Kirk: For hardware architecture, I think that's an implementation detail, not a feature.

For sure, the distinction between the programming models and instruction sets of vertex shaders and pixel shaders should go away. It would be soooo nice for developers to be able to program to a single instruction set for both.

As to whether the architectures for vertex and pixel processors should be the same, it's a good question, and time will tell the answer. It's not clear to me that an architecture for a good, efficient, and fast vertex shader is the same as the architecture for a good and fast pixel shader. A pixel shader would need far, far more texture math performance and read bandwidth than an optimized vertex shader. So, if you used that pixel shader to do vertex shading, most of the hardware would be idle, most of the time. Which is better—a lean and mean optimized vertex shader and a lean and mean optimized pixel shader or two less-efficient hybrid shaders? There is an old saying: "Jack of all trades, master of none."

This came from this link: http://www.extremetech.com/article2/0,1697,1745060,00.asp
 
What is better then, 4 or 6 vertex shaders doing nothing or an additional quad unified shader working on fragments? What is better, 32 fragment shaders doing nothing or working on a very large vertex program? So if your shaders are not unified most of the vertex or fragment shader hardware is idle, most of the time ;)

The answer? As with everything else, depends. But I think that unified (or more like general purpose) shader units are the future (if not already the present).

Vertex programs already support texturing and in any case there isn't a physics law that requires the texture unit to be directly attached to a shader unit. Fragment programs aren't anymore at the 1:1 math to texture ratio either. Fragment programs will require more math and less texturing so the difference between a vertex program and a fragment program in terms of texturing and math becomes reduced.

A Fragment Shader already has all (or at least most) what is required to shade vertices (and with textures) at a fast rate. You don't need to downgrade it to the vertex shader level. Unused resources in any kind processors is the norm. Your P4 or K8 (or any other superscalar CPU) isn't executing 2 integer ops, 2 fp or SIMD ops and 2 memory operations every cycle either.

And hardware resources that remain idle an important percentage of the time isn't new to GPUs. The ROPs are mostly idle for fragment limited shaders and no AA and the bandwidth of high end GPUs isn't fully used unless high AF and AA modes are enabled. Why did ATI, NVidia or others implement 2 or more TMUs per fragment pipe? When one of the TMUs could be idle in 'many' cases.
 
Very well said.

And, anyway, NVidia's prolly done a fair amount of design work already. If NVidia's behind ATI on the actual implementation, it shouldn't be that long.

Thanks, Ged, for saving me the bother of the rummage!

Jawed
 
RoOoBo said:
What is better then, 4 or 6 vertex shaders doing nothing or an additional quad unified shader working on fragments? What is better, 32 fragment shaders doing nothing or working on a very large vertex program? So if your shaders are not unified most of the vertex or fragment shader hardware is idle, most of the time ;)
Yes, but the question still is:
Is (extra utilization) - (lost processing power due to added complexity) > (better optimization for current task) - (idle execution units) ?

I think David Kirk hit the nail on the head when he stated that unified pipelines aren't a feature, they're an implementation detail. There's a lot to like about unified pipelines, but I'm still not sold that they'll be better except in extreme circumstances.
 
Chalnoth said:
Yes, but the question still is:
Is (extra utilization) - (lost processing power due to added complexity) > (better optimization for current task) - (idle execution units) ?

I think David Kirk hit the nail on the head when he stated that unified pipelines aren't a feature, they're an implementation detail. There's a lot to like about unified pipelines, but I'm still not sold that they'll be better except in extreme circumstances.

If we could only use instrumented public drivers to see exactly what units are doing what, that question would be quite easy to answer. I think the answer might be surprising (a lot more idle silicon than people think). Of course, the IHVs have that performance data.

I think obtaining higher efficiency is one of the things to focus on as your stream processors get more generalised, and it seems going unified with a shader pool is worth the engineering and silicon budget. At least for ATI, who seem to be nailing it with C1.
 
Rys said:
If we could only use instrumented public drivers to see exactly what units are doing what, that question would be quite easy to answer.
Just a thought... as I haven't looked into it in a huge amount of detail... but you can get a lot of vendor specific stats from ATI's plugin for PIXfW and NVidia's NVPerfHUD from what I've seen/heard.

Of course, the really interesting (and/or sensitive) stuff is still private, but a bit of instrumentation there might well shed some light on things? :)

Cheers,
Jack
 
Back
Top