Because Xenos is a unfied architecture, where you can normally mark a pixel to be killed, you can also mark a vertex to be killed - the setup engine will just ignore vertices marked for kill.
Ailuros said:Are you saying that simply because Xenos doesn't have a piece of hardware designed for this task that it can't do it?
In collaboration with the CPU definitely; and in that department a console and especially with the CPUs the consoles incorporate, consoles have an advantage over the PC IMHO for the moment.
Besides that a tesselation unit is no longer needed for WGF2.0.
I dare say it wouldn't surprise me if ATI and M$ went off into a corner and decided that regardless of WGF2.0's functionality, Xenos was going to do this.
The Xenos GPU is still not WGF2.0 compliant.
You've just highlighted the avdancements in the I/O Model of WGF2.0. Can we keep for a second tesselation and topology for a second apart?
What you are refering to with the highlights above in MEMEXPORT, is part of the new I/O model and thus what is described in the graph above as "reusable stream output" in two locations, one for the tesselation unit and one for the geometry shader.
Xenos can generate an arbitrary collection of new vertices in addition to or in replacement of the input vertices.
Does it delete any?
It's my understanding that the optional tesselation unit in WGF2.0 was removed (even as being optional) because it turned out with every other draft being less and less programmable and there it makes sense to just get rid of it all along IMHO.
Well, it'd better ignore all triangles connected to any vertices marked for kill.DaveBaumann said:Because Xenos is a unfied architecture, where you can normally mark a pixel to be killed, you can also mark a vertex to be killed - the setup engine will just ignore vertices marked for kill.
Jawed said:Sorry, I literally can't make head nor tail from what you're saying here.
OK, why is this relevant to whether Xenos can tessellate or not?
But why would that impact tessellation? Since Xenos isn't built for WGF2.0 I don't understand the point you're making.
No, I've highlighted the technique by which tessellation is performed within Xenos (the I/O model is intrinsic to this process with Xenos). The technique requires that replacement/new vertex data is written to the vertex buffer, or a post-vertex-shader buffer.
One of the rendering techniques in Xenos is the Z-only pass which, as far as we can tell, marks-up the vertex buffer with data describing which tile(s) each triangle is in. So that's another example of Xenos modifying the vertex buffer during vertex shading.
And since Xenos is not a WGF2.0 part, how is that relevant?
What's curious, to me, about NVidia's apparent approach is that they are introducing yet more functional blocks into a GPU design, which goes against the general drift of recent GPU designs towards removing functional blocks in favour of executing them as programmable operations. An example of this shift being the fog ALU which is implemented in code in SM3.
Rather than adding in extra hardware pipelines to perform tessellation, ATI seems to have chosen to focus on the ability to execute a software tessellator ultra-efficiently (and very rapidly if all 48 ALUs are working on tessellation).
Ailuros said:Using all 48 ALUs for one specific task would presuppose IMHO that there aren't any other bottlenecks in the system. While I could eventually also think it can use all 48 ALUs for VS functions, it'll be held back by the limitations of the triangle setup.
If Xenos consists of 232M + 70M transistors (parent+daughter) for logic, a 300M transistor 90nm GPU in the summer of 2006 seems reasonable.
I don't see tessellation as being a "special case" in Xenos - and I don't see why it would be a special case in R600 either.
Ailuros said:What's the rated (and sustained) peak geometry throughput of Xenos? W/o being entirely sure I think it's ~500MVertices/sec (if not I stand corrected).
If Xenos consists of 232M + 70M transistors (parent+daughter) for logic, a 300M transistor 90nm GPU in the summer of 2006 seems reasonable.
That's a very daring prediction. Assuming NVIDIA targets let's say twice the performance of a G70, that one above sounds more than just conservative.
I don't see tessellation as being a "special case" in Xenos - and I don't see why it would be a special case in R600 either.
Because the GPU and CPU in those consoles are able to share resources eventually.
I can't tell if you think I'm aiming too low or too high here.
It'll be interesting to see if games on either console use the CPU for vertex manipulation. I think PS3 will be forced to because RSX will be pretty weak in ultra-high geometry algorithms (e.g. shadowing pass) - but I don't think Xenos is going to crumble under this kind of workload.
Also WGF2.0 is supposed to bring about a massive reduction in DirectX overheads. As far as I can tell those overheads hit vertex data pretty hard, which is basically why 3DMk05 is showing as CPU-limited.
I'm sure R600 won't enjoy the API efficiency of a console, but M$ is at least making noises that significantly greater performance in the API is a primary goal for WGF.
Ailuros said:Actually if you'd ask me I think for both consoles the real blessings are their GPUs; those developers that have dealt with either/or CPU don't seem to show any enthusiasm (to put it mildly) so far.
Jawed said:Ailuros said:Actually if you'd ask me I think for both consoles the real blessings are their GPUs; those developers that have dealt with either/or CPU don't seem to show any enthusiasm (to put it mildly) so far.
Agreed. I can imagine there's more pleasure than pain in having to learn Cg or HLSL (for SM3+ ) if you've been used to programming PS2 or DX8.
Jawed
...at what point does it become too hard to schedule instructions in such a way to feed the multiple instructions per clock that you'd be better off with more pipelines instead of more instructions per pipeline?
So, it'll be interesting to see what NVidia chooses to build when it makes a unified GPU - will each pipeline be wide or narrow?...
Er, I don't think he's talking about that, but rather about the way, for example, that the G70 can issue two MAD's in a single cycle in a single pixel pipeline. Of course, I think "wide" is totally the wrong word for this functionality.Ailuros said:Does that even make sense? Can you compare a R4xx/5xx quad (4-way SIMD) with a Xenos' 16-way SIMD? Pipeline != pipeline really in that case, does it?
It is, but you won't be setup limited if the vertex shader program is of any significant length. Serious Sam 2 has average pixel shader lengths of 15-20 instructions, but vertex shaders greater than 100. It's easy to dedicate all ALU's to vertex processing and not be setup limited. The bigger problem is the tons of pixels that are generated as a result require processing too. But, as has been stated many times before, Z-only passes, etc. should just scream by.What's the rated (and sustained) peak geometry throughput of Xenos? W/o being entirely sure I think it's ~500MVertices/sec (if not I stand corrected).
Doesn't the inefficiency that Xenos is designed to tackle relate more to handling the different latencies associated with each instruction than ILP efficiency. (ie. Sure you can get peak performance single cycle MAD's if your shader just runs MAD after MAD with no dependancies., but real shaders use a mix of different instructions each with different latencies that create pipeline bubbles in the ALU's. Xenos attempts to fill those gaps by constantly switching different threads into them.)So, the question really is, at what point does it become too hard to schedule instructions in such a way to feed the multiple instructions per clock that you'd be better off with more pipelines instead of more instructions per pipeline?
Chalnoth said:Er, I don't think he's talking about that, but rather about the way, for example, that the G70 can issue two MAD's in a single cycle in a single pixel pipeline. Of course, I think "wide" is totally the wrong word for this functionality.Ailuros said:Does that even make sense? Can you compare a R4xx/5xx quad (4-way SIMD) with a Xenos' 16-way SIMD? Pipeline != pipeline really in that case, does it?