NV30, HOS and OGL2

no_way

Regular
This paper got me interested:
what comes after 4 ?
we already have vertex and pixel processors...
we also need to add a primitive processor
with access to the connectivity information of the mesh
so that you can destroy primitives inside the GPU e.g. local LOD management where it counts
So that you can create primitives inside the GPU
Implementation of arbitrary HOS in your app without me having to tell my architecture team exactly what you want as much as three years before you want it
mmkay .. what exactly are they talking about here ? Programmable primitive processor as in ... ? Displacement mapping ( that shouldnt have anything to do with tessellation and LOD tho ) ? NURBS ?

remember the OGL2 rendering architecture ?
image5.gif

Where does aforementioned "programmable primitive processor" fit in here ?
I remember the question about OGL2 and HOS support being asked before, thus i did a little search on the web.. and came up with nothing official.. ARB meeting notes do not even touch the subject.
I found this article @ digit-life tho...
Question: How are the geometry problems solved in the OpenGL 2.0?

Answer: in no way. Arrays are objects. It's possible to access them directly, you can't use any compression or optimization of indices or caching. ......Geometrical objects. I think, it is a severe flaw of the engineers. Lack of compression in the API and the old conception of geometry structuring are not what I like.

and one old proposal from opengl.org discussion boards
3.) Tesselation Program. The set of comands is the same as one for the vertex shader plus 3 predefined functions:
gl_Begin(GLint);
gl_End(void);
gl_Vertex(void) or probably better gl_EjectVertex(void);
The output of the Tesselation Program can be stored as OpenGL Array Object.
nothing further ... So .. is that "programmable primitive processor" THE feature in NV30 that will take definition of GPU a generation further ? If so, will it be used through proprietary extensions again ?
Does anyone have any insight on this topic ?

BTW, P10 has support for "N-patches, Bezier, B-Splines, NURBS" quoted .. makes one wonder what exactly that means in terms of usable features at the moment ? I would expect at least both DX8 D3DDEVCAPS_RTPATCHES and D3DDEVCAPS_NPATCHES to be present ...[/img][/quote][/url]
 
Again, this just sounds as though people are going a similar route that 3Dlabs already have done - building general purpose processing units rather than fixed function. This would just be the natural evolution of hardware 3D processing.

As to what P10 supports at the moment, that will largely depend on how much development the drivers have had and what importance Developers are telling them it has.

And P10 also supports creation and deletion of vertices, as does OpenGL2.0 AFAIK.
 
I wouldn't say that others are following in the P10's footsteps. Let's be realistic. Whatever ATI and NVidia are delivering has been in the pipeline for a long time.

It also doesn't sound like the "primitive processor" is replacing the vertex shader or pixel shader with a generic unit. Instead, it will probably be another stage with it's own limitations, e.g.

Primitive Processor -> Vertex Shader (DX9) -> Pixel Shader (DX9)

Layered architectures are good. Trying to unify everything into a super-generic-processor that can do anything just takes us back to software rendering and will probably kill performance.
 
I wouldn't say that others are following in the P10's footsteps. Let's be realistic.

:rolleyes: I didn't, I don't know why you are. I'm saying 3Dlabs got there first. I've long since said, after having learnt of P10's abilities, that more general pipelines and programmability will be the order of the day say 3D graphics progresses, which is why I also said I think this to be the natural evolution of 3D processing - this is just further evidence to that belief IMO.
 
DaveBaumann said:
building general purpose processing units rather than fixed function. This would just be the natural evolution of hardware 3D processing.

In a way, "General purpose processing units" != "hardware 3D processing". Kinda opposites, in a way. Either you have general purpose units, or dedicated 3D processing.

Or is this nitpicking? I'm not sure. Just that the term "general purpose" appeared too broad, paired with "hardware 3D processing"; dunno. Am I nitpicking too much? ;-)
 
Or is this nitpicking? I'm not sure. Just that the term "general purpose" appeared too broad, paired with "hardware 3D processing"; dunno. Am I nitpicking too much?

General purpose, to a degree - and possibly becoming more general purpose as timegoes by; i.e. SA has already had the discussions of bringing the vertex and pixel shaders into one unit at some point. However, although the P10's Vertex Array deals with vertices its not strictly limited to just the vertex shader elements but HOS tesselation etc as well - it not necessarily just limited to that either as 3Dlabs have pointed out.
 
We'll see what's the better approach once the NV30/R300 are benched against the P10 running equivalently complicated scenes.

I'm going to argue that the "pure"/"functional" (in computer science lingo) approach of vertex shaders, wherein you can only operate on the one vertex per shader, streamed into the unit, at a time will scale way faster than any general purpose processor for the forseeable future.

Once you introduce inter-vertex dependencies into the shader programming model, you kill opportunities for parallelism. It becomes increasing hard to keep your 4 or 8 vertex shader units at capacity since you now have to untangle the dependencies and try to schedule parallel execution. Also, introducing unrestricted branching will do the same thing. Now'd we're back to CPU style optimizations, branch prediction, speculative execution, the whole shebang that AMD and Intel have been struggling with for years. The only difference is it is a vector processor instead of a scalar one.

But we know how general purpose vector processors perform. We only need to look at the emotion engine or the Cray architecture. In other words, it looks to be a performance killer for the forseeable future.
 
I agree. Vertex and pixel shaders should not be completely flexible like general CPU's or you go back to square one. They should complement the CPU, being very fast at doing things that don't run into branch predition problems, or latency issues from arbitrary memory access.
 
Here's a possible way to make such interdependencies much more viable into the future (certainly not right now):

Imagine, for a moment, a scene graph that operates at the driver level. It would be able to manage the scene as well as talk to individual registers on the graphics card.

Such an implementation could conceivably allow only one object per vertex pipeline. While this would be hard on the pixel pipelines, it could be handled by a very efficient memory controller, such as crossbar controller combined with large caches.

There may also be possibilities in the much nearer future for API functions that give the graphics card "hints" to help the drivers optimize for certain situations.

In the end, I think it's possible, with developer help, for very interdependent pipelines to work quite efficiently, but I think that's still a ways off.
 

mmkay .. what exactly are they talking about here ? Programmable primitive processor as in ... ? Displacement mapping ( that shouldnt have anything to do with tessellation and LOD tho ) ? NURBS ?


I believe that they're talking adding an additional programmable unit before the vertex shader as DemoCoder pointed out. Essentially taking TruForm/N-Patches to the next level by allowing the programmer the flexibility of creating his own algorithms for adding vertices for tesselation and deleting vertices for custom LOD algorithms. This can be based on normals like PN-Triangles or whatever other method is desired. Displacement mapping is something different that would likely be applied at the vertex shader stage.
 
DemoCoder said:
It also doesn't sound like the "primitive processor" is replacing the vertex shader or pixel shader with a generic unit. Instead, it will probably be another stage with it's own limitations, e.g.

Primitive Processor -> Vertex Shader (DX9) -> Pixel Shader (DX9)

Correct.
 
Is this in any way related to the NVAutoshaper that is supposed to pre-charge graphics primitives on the card to speed up rendering?
 
The only difference is it is a vector processor instead of a scalar one.

Not if you're 3Dlabs, since they are already scalar based.

However, if NVIDIA are going to lob another unit on the front for this type of work then its an indication that they are likely to stick with vector units for vertex processing for some time and they want a slightly more flexible unit for primitives. I seem to remember some ARB notes that also pointed to something similar.
 
I think the 3DLabs argument is dumb (atleast the presented on 3d sites). 99.9% of the time, you are going to be processing vec3/vec4. Whether it is geometry vectors (vertices, normals, etc) or color-space vectors (RGB, etc) Why waste silicon or bother with the less common case where perhaps, you are working with a scalar or 2d coordinate vector? The scalar performance just isn't that importance, and in any case, you can always collate a bunch of scalar ops into a vec4 and do them together to hide the inefficiency.

We don't know the underlying way NVidia or ATI vertex shaders work, it could be a bunch of 32-bit FMACs that are grouped together as needed by the device driver which compiles the vertex shader to the underlying hardware, however, I would hope that they are optimized to handle vec3 and vec4 operations.


The proof will be in the pudding. The 256-bit memory bus P10 vs the R300/NV30. Let the benchmarks roll.
 
DemoCoder said:
I think the 3DLabs argument is dumb (atleast the presented on 3d sites). 99.9% of the time, you are going to be processing vec3/vec4. Whether it is geometry vectors (vertices, normals, etc) or color-space vectors (RGB, etc) Why waste silicon or bother with the less common case where perhaps, you are working with a scalar or 2d coordinate vector? The scalar performance just isn't that importance, and in any case, you can always collate a bunch of scalar ops into a vec4 and do them together to hide the inefficiency.

democoder, you said it yourself. the majority of the data is vec3 and vec4 - that'd be 25% waste of resources when passing a vec3 through a vec4 pipeline.
 
darkblu said:
democoder, you said it yourself. the majority of the data is vec3 and vec4 - that'd be 25% waste of resources when passing a vec3 through a vec4 pipeline.

3Dlabs seem to think that up 30% of the instructions in a standard OpenGL transformation pipe may not be Vec4 either; let alone what other operations will be carried out with increased programmability.
 
Back
Top