G80 to have unified shader architechture??!?!

Because Xenos is a unfied architecture, where you can normally mark a pixel to be killed, you can also mark a vertex to be killed - the setup engine will just ignore vertices marked for kill.
 
Ailuros said:
Are you saying that simply because Xenos doesn't have a piece of hardware designed for this task that it can't do it?

In collaboration with the CPU definitely; and in that department a console and especially with the CPUs the consoles incorporate, consoles have an advantage over the PC IMHO for the moment.

Sorry, I literally can't make head nor tail from what you're saying here.

Besides that a tesselation unit is no longer needed for WGF2.0.

OK, why is this relevant to whether Xenos can tessellate or not?

I dare say it wouldn't surprise me if ATI and M$ went off into a corner and decided that regardless of WGF2.0's functionality, Xenos was going to do this.

The Xenos GPU is still not WGF2.0 compliant.

But why would that impact tessellation? Since Xenos isn't built for WGF2.0 I don't understand the point you're making.

You've just highlighted the avdancements in the I/O Model of WGF2.0. Can we keep for a second tesselation and topology for a second apart?

No, I've highlighted the technique by which tessellation is performed within Xenos (the I/O model is intrinsic to this process with Xenos). The technique requires that replacement/new vertex data is written to the vertex buffer, or a post-vertex-shader buffer.

One of the rendering techniques in Xenos is the Z-only pass which, as far as we can tell, marks-up the vertex buffer with data describing which tile(s) each triangle is in. So that's another example of Xenos modifying the vertex buffer during vertex shading.

What you are refering to with the highlights above in MEMEXPORT, is part of the new I/O model and thus what is described in the graph above as "reusable stream output" in two locations, one for the tesselation unit and one for the geometry shader.

Well how else is the GPU going to modify the vertex buffer?

Xenos can generate an arbitrary collection of new vertices in addition to or in replacement of the input vertices.

Does it delete any?

OK, now you're kidding me. Why not? It can read in a vertex buffer and re-write it how it likes. Of course it can delete.

It's my understanding that the optional tesselation unit in WGF2.0 was removed (even as being optional) because it turned out with every other draft being less and less programmable and there it makes sense to just get rid of it all along IMHO.

And since Xenos is not a WGF2.0 part, how is that relevant?

Jawed
 
DaveBaumann said:
Because Xenos is a unfied architecture, where you can normally mark a pixel to be killed, you can also mark a vertex to be killed - the setup engine will just ignore vertices marked for kill.
Well, it'd better ignore all triangles connected to any vertices marked for kill.
 
Jawed said:
Sorry, I literally can't make head nor tail from what you're saying here.

That all possible tesselation functions aren't going to get executed by the GPU alone on Xenos; there are many cases where it'll cooperate with the CPU and all I meant is that it's easier on a console than on a PC.

OK, why is this relevant to whether Xenos can tessellate or not?

Not relevant at all. You're the one that claims to have interpreted MEMEXPORT as the basis of "geometry shading". Now please tell me what we are talking about here exactly: the new I/0 model, geometry shading or tesselation? ***

But why would that impact tessellation? Since Xenos isn't built for WGF2.0 I don't understand the point you're making.

*** see above. Why isn't the Xenos GPU WGF2.0 compliant? Because it lacks a geometry shader maybe? Your own quote:

I dunno if it's fair, but I've interpreted Xenos's MEMEXPORT functionality as the basis for geometry shading.

No, I've highlighted the technique by which tessellation is performed within Xenos (the I/O model is intrinsic to this process with Xenos). The technique requires that replacement/new vertex data is written to the vertex buffer, or a post-vertex-shader buffer.

One of the rendering techniques in Xenos is the Z-only pass which, as far as we can tell, marks-up the vertex buffer with data describing which tile(s) each triangle is in. So that's another example of Xenos modifying the vertex buffer during vertex shading.

I'm still missing the point where all that is related to geometry shading.


And since Xenos is not a WGF2.0 part, how is that relevant?

You tell me considering the above.
 
When I said "geometry shading" I meant it in a generic way (incorporating all manipulations of geometry, vertex shading, tessellation, higher-order surfaces etc.) rather than in the specific WGF2.0 sense. If I'd been familiar with the WGF2.0-specific meaning I'd have chosen another phrase - sorry :oops:

As far as I can tell Xenos is designed explicitly for general purpose tessellation, which it performs by multi-passing the vertex data - in other words it's forced to render the tessellation results into a buffer, which then provides the input for final vertex shading.

Jawed
 
The NVidia patent may be describing a pipelined geometry engine within a GPU, such that an intermediate write to a buffer in main memory is not required (not sure, frankly).

I haven't read that patent in detail, but NVidia seems to be at pains to create sections that are fixed function, but at the same time provide programmable functional blocks, so that the overall result is highly programmable.

What's interesting about Xenos is that it's an extremely parallel general purpose vector processor, tuned to execute any program at peak with minimal corner cases capable of introducing stalls. Rather than adding in extra hardware pipelines to perform tessellation, ATI seems to have chosen to focus on the ability to execute a software tessellator ultra-efficiently (and very rapidly if all 48 ALUs are working on tessellation).

Obviously this costs both memory space (and bandwidth) as well as non-tessellation rendering capability (i.e. it impacts vertex and fragment shading capacity).

What's curious, to me, about NVidia's apparent approach is that they are introducing yet more functional blocks into a GPU design, which goes against the general drift of recent GPU designs towards removing functional blocks in favour of executing them as programmable operations. An example of this shift being the fog ALU which is implemented in code in SM3.

Jawed
 
From what I can tell NVIDIA has quite a few different patents issued on either geometry shading or PPP related stuff.

What's curious, to me, about NVidia's apparent approach is that they are introducing yet more functional blocks into a GPU design, which goes against the general drift of recent GPU designs towards removing functional blocks in favour of executing them as programmable operations. An example of this shift being the fog ALU which is implemented in code in SM3.

Assuming GS units will be seperate units (temporarily) in first WGF2.0 architectures, I'd expect a trend like that.

Wouldn't geometry shading also suggest geometry textures eventually? In a core with unified PS/VS shader units, latency for vertex texture fetches stops being a headache anymore.

Rather than adding in extra hardware pipelines to perform tessellation, ATI seems to have chosen to focus on the ability to execute a software tessellator ultra-efficiently (and very rapidly if all 48 ALUs are working on tessellation).

I'd guess that full on chip programmable tesselation has been scratched from WGF2.0 because IHVs were probably screaming that it won't fit all into their transistor budgets.

While the above might be a quite efficient for a console (where GPU and CPU can handle way more efficiently resources), I'm not so sure it'll work as well in the PC space (hence all my other related comments).

Using all 48 ALUs for one specific task would presuppose IMHO that there aren't any other bottlenecks in the system. While I could eventually also think it can use all 48 ALUs for VS functions, it'll be held back by the limitations of the triangle setup.

To avoid misunderstandings I was hoping to see advanced tesselation in the future, but from I can see this far, I doubt we'll even see capabilities as on XBox360 in the PC space any time soon :(
 
Ailuros said:
Using all 48 ALUs for one specific task would presuppose IMHO that there aren't any other bottlenecks in the system. While I could eventually also think it can use all 48 ALUs for VS functions, it'll be held back by the limitations of the triangle setup.

The nature of all GPUs is that work is batched. The triangle set-up might only produce a batch of vertices to be shaded every 16 cycles (i.e. a batch of 16 vertices - though it's not clear to me what batch size Xenos uses).

But once that batch is created it's simply a matter of load-balancing that batch against all the other batches that are extant. At any instant it's possible for any combination of batches to make up the 64 concurrently executing threads, with each batch accounting for 16 threads.

A tessellation batch would fall into exactly the same scheme of batching. The source data for tessellation might be 4 vertices (a quad) per command thread, with 16 command threads in a batch, for example.

The execution of tessellation batches would compete with vertex batches and fragment batches according to overall batch scheduling. As each tessellation batch proceeds it will create what are in effect vertex batches which obviously need to be load-balanced too.

I don't see tessellation as being a "special case" in Xenos - and I don't see why it would be a special case in R600 either.

If Xenos consists of 232M + 70M transistors (parent+daughter) for logic, a 300M transistor 90nm GPU in the summer of 2006 seems reasonable.

Jawed
 
What's the rated (and sustained) peak geometry throughput of Xenos? W/o being entirely sure I think it's ~500MVertices/sec (if not I stand corrected).

If Xenos consists of 232M + 70M transistors (parent+daughter) for logic, a 300M transistor 90nm GPU in the summer of 2006 seems reasonable.

That's a very daring prediction. Assuming NVIDIA targets let's say twice the performance of a G70, that one above sounds more than just conservative.

I don't see tessellation as being a "special case" in Xenos - and I don't see why it would be a special case in R600 either.

Because the GPU and CPU in those consoles are able to share resources eventually.
 
Ailuros said:
What's the rated (and sustained) peak geometry throughput of Xenos? W/o being entirely sure I think it's ~500MVertices/sec (if not I stand corrected).

1 vertex per clock according to the leak, which is why I said a batch of 16 vertices could be made every 16 cycles (best case). Assuming that Xenos's batch size is 16 vertices.

If Xenos consists of 232M + 70M transistors (parent+daughter) for logic, a 300M transistor 90nm GPU in the summer of 2006 seems reasonable.

That's a very daring prediction. Assuming NVIDIA targets let's say twice the performance of a G70, that one above sounds more than just conservative.

I can't tell if you think I'm aiming too low or too high here :?

I don't see tessellation as being a "special case" in Xenos - and I don't see why it would be a special case in R600 either.

Because the GPU and CPU in those consoles are able to share resources eventually.

It'll be interesting to see if games on either console use the CPU for vertex manipulation. I think PS3 will be forced to because RSX will be pretty weak in ultra-high geometry algorithms (e.g. shadowing pass) - but I don't think Xenos is going to crumble under this kind of workload.

Also WGF2.0 is supposed to bring about a massive reduction in DirectX overheads. As far as I can tell those overheads hit vertex data pretty hard, which is basically why 3DMk05 is showing as CPU-limited.

I'm sure R600 won't enjoy the API efficiency of a console, but M$ is at least making noises that significantly greater performance in the API is a primary goal for WGF.

Jawed
 
I can't tell if you think I'm aiming too low or too high here.

Besides both IHVs throwing around with meaningless GFLOP numbers, my guess is that at the end of the day both GPUs won't have any drastical differences (even if who could measure it anyway?). Of course will ATI claim superior performance for it's sollution and NVIDIA the same for it's own.

To be frank I don't even buy the 1080p claim for PS3, even if then for very undemanding games and AA becomes questionable (assuming a 60Hz target).

As I said I'd expect at least twice the performance of a PC/G70 today; reaching/exceeding that target with merely 70M more transistors sounds quite impossible to me at this stage.

It'll be interesting to see if games on either console use the CPU for vertex manipulation. I think PS3 will be forced to because RSX will be pretty weak in ultra-high geometry algorithms (e.g. shadowing pass) - but I don't think Xenos is going to crumble under this kind of workload.

Both systems might end up using their CPUs for vertex data; if and up to which degree remains to be seen. I don't see why RSX will have any particular problems with shadowing passes for instance, since it has it's own video memory bus which isn't shared by the CPU.

Actually if you'd ask me I think for both consoles the real blessings are their GPUs; those developers that have dealt with either/or CPU don't seem to show any enthusiasm (to put it mildly) so far.

Also WGF2.0 is supposed to bring about a massive reduction in DirectX overheads. As far as I can tell those overheads hit vertex data pretty hard, which is basically why 3DMk05 is showing as CPU-limited.

Agreed on the first. For the latter I'm unsure yet if FutureMark really has managed to "foresee" what future games will look like.

I'm sure R600 won't enjoy the API efficiency of a console, but M$ is at least making noises that significantly greater performance in the API is a primary goal for WGF.

WGF2.0 doesn't pre-suppose though specific hardware (at least not anymore); merely an architecture that can handle unified shader calls. I've no doubt that unified shader units ARE the future. If though NVIDIA choses temporarily to stay with separate units for it's first WGF2.0 GPU, it'll be an interesting match to watch.

All IMHLO (L stands for layman) :p
 
Ailuros said:
Actually if you'd ask me I think for both consoles the real blessings are their GPUs; those developers that have dealt with either/or CPU don't seem to show any enthusiasm (to put it mildly) so far.

Agreed. I can imagine there's more pleasure than pain in having to learn Cg or HLSL (for SM3+ ) if you've been used to programming PS2 or DX8.

Jawed
 
Jawed said:
Ailuros said:
Actually if you'd ask me I think for both consoles the real blessings are their GPUs; those developers that have dealt with either/or CPU don't seem to show any enthusiasm (to put it mildly) so far.

Agreed. I can imagine there's more pleasure than pain in having to learn Cg or HLSL (for SM3+ ) if you've been used to programming PS2 or DX8.

Jawed

Anand re-wrote that one (you've obviously read the final one already):

http://groups-beta.google.com/group/alt.games.video.sony-playstation2/msg/62ff83d96ea78ea9?hl=en
 
One thing that discussions of unified hardware haven't touched on so far is the "width" of a single pipeline.

In G70's fragment pipelines there's a tuned version of the fatally "too-wide" NV30, able to dual-issue two vec4 (or vec3 + scalar) MADs and other combinations/co-issues, with a free FP16 normalise thrown in for good measure.

G70's vertex pipelines are narrower, being simply Vec4+scalar - but in general they aren't a GPU's bottleneck, so I'm not going to dwell on them.

In contrast it seems that Xenos's pipelines are as narrow as they can get, supporting a single vec4+scalar, (vertices require wider ALUs than pixels). And not to forget that Xenos has 16 texturing pipelines where texture address calculation ALUs run, in a similarly "narrow" configuration (presumably vec3 ALUs?).

So what this leads to is a comparison of wide and narrow pipelines:

1. easier for a compiler if it only has to organise co-issues, and never has to worry about finding dual-issuable instructions - dual-issue complexities limit NV40, and are not completely removed in G70 (due to texture address calculations and the FP32 register read limit which prevents the dual-issue of two independent FP32 MADs, for example).

2. the minimum utilisation of an ALU is at least a single scalar operation in Xenos, whereas in G70 it appears that either ALU can be entirely un-used due to instruction dependencies.

3. presumably Xenos's TMU ALUs will spend a fair amount of time idle, presuming that the initial texture access is all that exercises them (loops for anisotropic filtering don't require further address calculations?) - EDTI: hmm would these ALUs ever be used in surface texturing anyway?

Obviously, this is all at the cost of a significantly more complex thread arbiter and batch scheduler inside Xenos. So what's saved in terms of per-pipe width is traded against the increased non-pipe complexity. Presumably Xenos's huge leap in parallel pipeline count compensates for the high cost of entry into a unified architecture.

So, it'll be interesting to see what NVidia chooses to build when it makes a unified GPU - will each pipeline be wide or narrow?...

Jawed
 
Well, I wouldn't call the NV4x's pipes "wide" since you don't typically run parallel instructions in the separate units: you run subsequent instructions.

Anyway, an architecture like the NV4x is going to be less transistor-intensive than an architecture like the Xenos for the peak processing power available. So, the question really is, at what point does it become too hard to schedule instructions in such a way to feed the multiple instructions per clock that you'd be better off with more pipelines instead of more instructions per pipeline?
 
...at what point does it become too hard to schedule instructions in such a way to feed the multiple instructions per clock that you'd be better off with more pipelines instead of more instructions per pipeline?

That's a dilemma IHVs have obviously already faced for WGF2.0 GPUs IMHO. A "pipeline" can't get too "wide" in the end, nor can "pipeline" counts increase forever either.


So, it'll be interesting to see what NVidia chooses to build when it makes a unified GPU - will each pipeline be wide or narrow?...

Does that even make sense? Can you compare a R4xx/5xx quad (4-way SIMD) with a Xenos' 16-way SIMD? Pipeline != pipeline really in that case, does it?
 
Ailuros said:
Does that even make sense? Can you compare a R4xx/5xx quad (4-way SIMD) with a Xenos' 16-way SIMD? Pipeline != pipeline really in that case, does it?
Er, I don't think he's talking about that, but rather about the way, for example, that the G70 can issue two MAD's in a single cycle in a single pixel pipeline. Of course, I think "wide" is totally the wrong word for this functionality.
 
What's the rated (and sustained) peak geometry throughput of Xenos? W/o being entirely sure I think it's ~500MVertices/sec (if not I stand corrected).
It is, but you won't be setup limited if the vertex shader program is of any significant length. Serious Sam 2 has average pixel shader lengths of 15-20 instructions, but vertex shaders greater than 100. It's easy to dedicate all ALU's to vertex processing and not be setup limited. The bigger problem is the tons of pixels that are generated as a result require processing too. But, as has been stated many times before, Z-only passes, etc. should just scream by.

So, the question really is, at what point does it become too hard to schedule instructions in such a way to feed the multiple instructions per clock that you'd be better off with more pipelines instead of more instructions per pipeline?
Doesn't the inefficiency that Xenos is designed to tackle relate more to handling the different latencies associated with each instruction than ILP efficiency. (ie. Sure you can get peak performance single cycle MAD's if your shader just runs MAD after MAD with no dependancies., but real shaders use a mix of different instructions each with different latencies that create pipeline bubbles in the ALU's. Xenos attempts to fill those gaps by constantly switching different threads into them.)
 
Chalnoth said:
Ailuros said:
Does that even make sense? Can you compare a R4xx/5xx quad (4-way SIMD) with a Xenos' 16-way SIMD? Pipeline != pipeline really in that case, does it?
Er, I don't think he's talking about that, but rather about the way, for example, that the G70 can issue two MAD's in a single cycle in a single pixel pipeline. Of course, I think "wide" is totally the wrong word for this functionality.

Ok got that one. From that perspective it was the only way to get 48 MADs/cycle in a design for both the PC and console.
 
I think CPU evolution has shown that ILP and OoOE hits a limit. Be if software compiler, or hardware OoOE optimizer, maximizing throughput runs up against a barrier for given workloads, and you spend a shtload of transistors on it, but get diminishing returns.

Since graphics is "embarassingly parallel" and the workloads fit the profile of TLP instead of ILP, that suggests to me that the ideal graphics architecture maximizes concurrent threads, rather than trying to use ILP to increase instruction throughput within a thread. Bulking up each pipeline with more and more execution units invariable will lead to idle units, since the workloads won't always allow optimal scheduling/packing multiple instructions into a multiple dispatch.
 
Back
Top