hello first post- igp combo

***Warning fairly long post***

First off i wanted to say hello... Hello this is my first post. Well hello and this might be the wrong forum for this post.

Now to the nitty gritty, I have an simple idea that should be simple to implement if it is possible. Its possiblity can only be determined by the likes of ATI, Nvidia, and Intel. The idea is just that alot of chipsets have integrated video so why not enable certain functionality of the 'IGP' to complement a graphics card when installed.

Case one (good background read is a paper called batchbatchbatch.pdf from nvidia)

The following is written from the perspective of directx, as per my looking into making a 3d engine. I'll skip the background and skip to needing to batch triangles and how depending on processor speed and how much CPU you want leftover for 'other stuff' (AI,physics...) you choose a value of how many batches per frame (drawprim() calls) to aim for. This being choosen (according to some pdfs from both nvidia and ati 200 is a good number for this) one then has to choose a method on how to fit N 'spaces' into M (again lets say 200) draw calls. I won't talk about texture state changes and associated speedups but skip to Transform state changes and say that you have the option of pretransforming vertexes on the on the CPU ('default way') or as per the above mentioned paper use one bone matrix palette skinning in a vertex shader for simple (non-skeletal) meshes (faster, less cpu overhead, allows for more triangles per model since gpu TL engine so much faster than CPU). Now at this point all is fine and dandy until you decide on a lighting/shadowing method which for the most part will have you process your geometry again once for each light or once for every for lights (best case). This isn't so much a problem for non-skeletal meshes, but when skeletal meshes with skining comes into play, Bam all of a sudden not only do you need more draw calls, even worst you can become triangle/Vertex shader limited. So you are back to transforming vertexes (into world space) on the CPU (speed is an concern since you will be performing skinning which is alot more expensive on the CPU) once and then using the GPU to apply the view/projection matrix for the camera and each light. Well here is where that trusty integrated gf4mx or radeon 9000 core can come in real handy. If it is possible to use the integrated TL units to process the vertices into another vertex buffer (in world space) then you free the cpu to do other things and reduce the load on your primary gpu. This should allow for a nice frame rate boost and more alive worlds by allowing for the use of less static geometry.

I was going to do a case 2 and three but this is a long post as it is, and my brain hurts. So what do you think? ( case 2 - shadow maps on integrated or any render to texture for that matter case 3 - i forgot i'll post when i remember)
 
I'm not sure why you say you get more draw calls when you do skinning? Basically you'd just need to set more than one world matrix for each draw call, but that's it?
Also, why would you need the CPU for skinning? Vertexshaders should do fine?
Or are you working from the assumption that even the 'good' 3d card doesn't have shaders?

Other than that the problem is ofcourse that I don't know of any integrated chipsets that would allow to write out the results of the T&L stage to memory. And even if they could, you'd probably have to copy from one card to the other via the CPU, which is still expensive, especially when you're not using PCI-e.

Besides, I'm not very impressed with integrated chipset performance...
I have a laptop with a Radeon IGP340M myself... Which has a decent T&L engine with vs1.1
Anyway, I sometimes test some of my code on it, to see how compatible it is, and how well it performs...
Recently I tested a vertexshader on it, with a simple model of about 6000 vertices... The laptop got around 170 fps. My 9600Pro got about 1100 fps on the same model. So, I would say that the onboard chipset is so slow, that the speed increase would be negligible against a decent card (imagine a 9800 or even X800, those are much faster than my 9600Pro still).
And this is probably one of the better chipsets too. Intel doesn't have hardware T&L at all, and onboard NV stuff seems to be considerably slower than Radeons aswell.

So my guess is that it would be more trouble than it's worth, even if it were possible.
 
As I understand it, what you're trying to say is: allow the results of a CPU vertex shader to be easily piped into the T&L unit for further processing.

It was a good idea a couple of years back (I say it was a good idea because I thought about it, too). It's not that relevant these days, since currently available hardware either supports vertex shaders, or does both shaders and T&L in software (for integrated chipsets - including ATI ones). A lot of modern hardware even has T&L implemented via shaders. Developers want to get as much work done on the GPU, and as GPUs increase in sophistication that's the way things will go.

While there's still enough hardware around that could benefit from a hybrid approach, doing it in an organised manner (that is, part of an API and drivers) is unlikely, because driver work is done mainly for newer chips, and because even if the companies involved decided to do such a thing, even faster, more advanced chips will be made available by the time there are any results, and there'd be less benefit for this.
 
No what i'm saying is when doing lighting worst case is one has to process geometry once per light in additon to from the viewpoint of the camera. When doing skinning of anysort it is more efficient to get the vertices themselves into world space ONCE (this would be a relatively long shader, lets call this SL) and then process the world space vertices from the perspective of each camera/light (extremely short shader, lets call this SS). If you don't process the vertices once you wind up doing (SL+SS)*N (where N is the number of camera and lights) shader work, where as if you process the vertices once ahead of time you do SL+(SS*N) shader work. All I'm saying is if one could do the long shader on the T&L unit of the integrated graphics you are saving the CPU alot of work, work that could be spent doing other things i.e. AI, physics, better z-sorting for all objects... and since essential all you are trying to do is the skinning even a fixed function T&L unit in a integrated GF4mx would help. And since the output could be streamed into a vertex buffer setup in the installed video cards ram the integrated T&L wouldn't be as bogged down sharing bandwidth from main RAM. So the essence of what i'm saying is If I have a T&L unit or Vertex Shader in my integrated graphics why not use it to enhance my gaming experience? In ATI's case it could easily offset any gain Nvidia gained from enabling 'geometry instancing'. In addition the installed video card driver wouldn't even need to know about it, only the application.
 
I still don't get what you mean, really.
As I said before, a good graphics card is a lot faster than integrated chipsets or CPU. It is probably faster to execute the long shader N times on the fast GPU than once on the CPU, or a slow GPU. N will never be very large anyway (maybe 3-4 in most cases?).
And doing it all in a vertexshader will mean that the CPU is completely free for AI and other stuff.

Also, hardware such as gf4mx has very bad skinning options, only 2 bones at a time, non-indexed, if I'm not mistaken. This is virtually useless for most animation today. Vertexshaders can do much more, and also more efficiently (indexing will avoid splitting up your meshes into parts with only 2 bones affecting them, and rendering each mesh separately).
 
"Also, hardware such as gf4mx has very bad skinning options, only 2 bones at a time, non-indexed, if I'm not mistaken. This is virtually useless for most animation today. Vertexshaders can do much more, and also more efficiently (indexing will avoid splitting up your meshes into parts with only 2 bones affecting them, and rendering each mesh separately)."

Agreed but in truth the post was really geared towards the ATI northbridge, but the technique could be generalized to any IGP with integrated T&L or Vertex Shaders.

As far as a good graphics card being much faster than integrated chipsets, i'm not asking the IGP to render anything I'm asking it to act as a geometry transformation coprocessor.

The number of lights in a scene will vary and the simple fact of the matter is light maps pretty much suck. When you start trying to do huge levels all those light maps add up to alot of Megabytes and are static to boot. Why limit your content creators to 4 lights, beyond which there would be no real need for these larger floating point colors if all you had to deal with four lights you wouldn't be able to saturate. Besides what about outdoor scenes, what about that rpg with fire arrows, or the shooter at night where the squad of 8 people hunting you down all have flashlights attached to there guns... street lights.
 
Agreed but in truth the post was really geared towards the ATI northbridge, but the technique could be generalized to any IGP with integrated T&L or Vertex Shaders.

Well, you brought up gf4mx and skinning, so...

As far as a good graphics card being much faster than integrated chipsets, i'm not asking the IGP to render anything I'm asking it to act as a geometry transformation coprocessor.

Slow is slow, in my book, whether it renders or not. As I said before, my Radeon IGP340M gets about 16% of the performance of my Radeon 9600Pro. That's not a whole lot. Even if it were possible to use the full power of the IGP in addition to my 3d card, without any extra overhead, I'd only get a 16% boost. Rather insignificant. I'd get more boost by just buying the next higher model videocard. Besides, my PC doesn't even have an integrated chipset... Not many do, I think. Most boards either have no integrated chipset at all, or they don't have an AGP slot to put a 3d card in.
So the way I see it, you can theoretically obtain an insignificant performance-boost on a very small subset of all computer systems.
Even if it did work, I doubt it's worth looking into.

The number of lights in a scene will vary and the simple fact of the matter is light maps pretty much suck. When you start trying to do huge levels all those light maps add up to alot of Megabytes and are static to boot. Why limit your content creators to 4 lights,

I don't think anyone likes the limits that still exist in graphics, but we just have to work with them until more power becomes available.
I doubt that your idea will magically solve the power-shortage that we have anyway.

beyond which there would be no real need for these larger floating point colors if all you had to deal with four lights you wouldn't be able to saturate.

I think saturation is not at all related to the number of lights, but solely to the intensity of these lights. But that could just be me. Even with a single lightsource you could create lovely HDR lighting (the sun, for example).
 
Scali said:
As far as a good graphics card being much faster than integrated chipsets, i'm not asking the IGP to render anything I'm asking it to act as a geometry transformation coprocessor.

Slow is slow, in my book, whether it renders or not. As I said before, my Radeon IGP340M gets about 16% of the performance of my Radeon 9600Pro. That's not a whole lot. Even if it were possible to use the full power of the IGP in addition to my 3d card, without any extra overhead, I'd only get a 16% boost. Rather insignificant. I'd get more boost by just buying the next higher model videocard. Besides, my PC doesn't even have an integrated chipset... Not many do, I think. Most boards either have no integrated chipset at all, or they don't have an AGP slot to put a 3d card in.
So the way I see it, you can theoretically obtain an insignificant performance-boost on a very small subset of all computer systems.
Even if it did work, I doubt it's worth looking into.

Why are you comparing the frame rate of the igp to the graphics card? If you use the IGP as a geometry coprocessor
1. the igp VS unit is still faster than the CPU at skinning
2. it would off load the workload from the CPU
3. it would allow you more instruction slots and allow for more effects to occur in the VS in the installed graphics card
4. Just because the performance of the IGP is 16% of your 9600 pro doesn't mean its geometry transformation capability is also only 16% of your 9600pro's.

Besides not everyone can afford or wants to spend 500 dollars on a video card. Besides even if the boost in perfomance is only 16% that isn't a trivial amount, especially for something someone already has in there PC and has essentially paid for. Alot of chipsets have integrated graphics, and the performance boost isn't really theoretical, it WILL help its just a matter of if its possible and if hardware manufacturers would want to expose said functionality.
 
Why are you comparing the frame rate of the igp to the graphics card?

It's the only way I know of to judge T&L performance. I don't know how to make it do T&L without actually rendering the triangles.

As for your points:

1. I agree, I think the CPU is about 6 times as slow (got 30 fps with swvp enabled) as the IGP.
However, I didn't compare against CPU, I compared against the 'fast' GPU. And it is significantly slower than that GPU.
2. Agreed, although it's more workload for the CPU to set up 2 rendering devices than just one, so it's still better to have one vertexshader handle all, in terms of CPU-usage, I suppose.
3. True, although this is purely theoretical, under the assumption that such a system could actually improve performance. Otherwise, you might aswell do more render-passes, and do the extra effects that way, or perhaps with some help from the CPU.
4. Since my program was far from fillrate-limited, I don't think it's far off. Also, I wouldn't be surprised if the vertexshader is relatively even slower than the entire rendering process.

Besides not everyone can afford or wants to spend 500 dollars on a video card.

That's not the point. As I say, I have a Radeon 9600Pro, which is far from 500 dollars. I paid 100 euros for it. It's one of the cheapest SM2.0 cards available at the moment. And already it totally makes mince-meat out of the IGP. And this IGP is actually relatively good. GF2Go is much slower, and let's not even get started on Intel Extreme. The only stuff that comes close, is the high end GeForce FX or Radeon 9x00 mobile stuff, but you can't find those on regular motherboards, and they're very expensive.
The stuff on regular motherboards today is mostly at around the same level of my IGP340M, or worse... Meaning CPU T&L/shaders.
 
Guys, the IGP doesn't have hardware T&L or shaders. If this entire discussion is based on the premise that it does, then just forget it and move on. The discussion of what's better to do on the CPU or the IGP's T&L is just silly when the IGP's T&L is implemented on the CPU.

Secondly, regarding lighting, using T&L lighting is out of vogue. Everything is moving to per pixel lighting, and the rest can be done with spherical harmonics, whose calculation speed isn't affected by the number of lights (although you don't get the accuracy of separate lighting calculations, that may actually be a good thing for realism).

Thirdly, as I said before, you're talking about a future chip. Yes, in the future it's planned that chips will be able to write vertex shader results back into a buffer for further processing. That might indeed be useful for skinning. But it isn't relevant to current chips, because the hardware can't do it.

But to go back to the first and most important point: there's no hardware vertex processing on IGPs![/b]
 
Guys, the IGP doesn't have hardware T&L or shaders.

Some of the better mobile chipsets do. I'm not sure about desktop PCs though. What about nForce3 for example?

Secondly, regarding lighting, using T&L lighting is out of vogue. Everything is moving to per pixel lighting

I'm not sure if I get that one. Doing per-pixel lighting still requires the vertex setup, and therefore still the skinning. So I don't see how this changes the situation.
 
"Guys, the IGP doesn't have hardware T&L or shaders."

ati 9000 igp VS 1.4
ati 9100 igp VS 1.4
nvidia nforce/1/2/3 dx7 T&L

"Secondly, regarding lighting, using T&L lighting is out of vogue. Everything is moving to per pixel lighting"

The primary reason anyone would even do this is for offloading geometry transformation especially skinning from the CPU. The reason for not transforming the vertices on your primary video card are in an above post, but the basic gist of it is you save yourself from doing the same work over and over again since you would have to skin the same model everytime you were using it as a light occluder. Another technique that would benefit is the pre-transformation of certain vertices that don't need to be updated every frame but every 5-10 frames. This is normally done on the CPU but could be accelerated if the T&L on the IGP could be setup to process the vertices for it.

"2. Agreed, although it's more workload for the CPU to set up 2 rendering devices than just one, so it's still better to have one vertexshader handle all, in terms of CPU-usage, I suppose."

I didn't say setup two rendering devices, I didn't say anything about exact implementation. If the functionality was possible it could be implemented/exposed in a variety of ways.

"3. True, although this is purely theoretical, under the assumption that such a system could actually improve performance. Otherwise, you might aswell do more render-passes, and do the extra effects that way, or perhaps with some help from the CPU."

I'm saying in a case where it is more effective to pre-transform skinned meshes into world space before it hits the main GPU and the developer is already going to do this, the application will recieve a performance boost by offloading the work to the IGP's VS or TL unit. Not only would the IGP be faster at doing the skinning it would free the CPU to do otherthings including more aggressive batching, sorting, and reduce the setup time of doing collision and physics before rendering of the current frame begins. Giving yourself more VS instruction slots might enable you to do things in one pass instead of two. So I really don't see what is so theoretical about the performance boost, there are no negatives.
 
ati 9000 igp VS 1.4
ati 9100 igp VS 1.4

vs1.4 doesn't exist, I'm sure you mean ps1.4, which is not capable of vertexprocessing in any way (fixedpoint pixelshaders).

I didn't say setup two rendering devices, I didn't say anything about exact implementation. If the functionality was possible it could be implemented/exposed in a variety of ways.

However you are going to expose it, you will have to physical rendering devices, with two sets of execution units, and two sets of registers. Meaning that the CPU will have to set data to the memory and registers of both devices and send commands to both units. Whether you wrap it up into a single API or not doesn't change much about that.

So I really don't see what is so theoretical about the performance boost, there are no negatives.

It's about as theoretical as doing arithmetic with the GPU and reading back the framebuffer. On AGP-based devices, this is excruciatingly slow.
Take for example the Goldfeather CSG algorithm... In theory the idea of zbuffer merging by glReadPixels()/glDrawPixels() sounds very nice, but in practice, it's extremely slow on PCs. Your idea sounds very similar to me. In theory it would be possible, but in practice it may be more overhead than it's worth to get the data from one GPU to the other.
And the other theoretical aspect, as pointed out many times before: there aren't many IGP chipsets with hardware vertexshaders (fixedfunction is pretty much useless for skinning, as said before), and even if there would be, it is impossible to make them output data after the T&L stage, since the vertexshaders are hardwired to the rasterizer. Some GPUs have very limited render-to-vertexbuffer support (like XBox), but as far as I know, this is not exposed on PC anyway, and isn't supported by any IGPs either.
So as it stands, it is both purely a theoretical idea, since it cannot be implemented in practice without redesigning at least the IGP itself (this is not going to happen), and even if it were possible, the gain is probably going to be small, because of the immense performance difference between IGPs and GPUs, and possible overhead issues for transferring the data from IGP to GPU, and extra setup costs.
Plenty of negatives, which have been mentioned a few times already, you just keep ignoring them. And any positives are purely theoretical.
 
The Radeon 9100 IGP lacks a hardware vertex shader unit. Vertex shaders and fixed function T&L are executed on the host CPU's floating point vector processor. (SSE)
 
Scali said:
So as it stands, it is both purely a theoretical idea, since it cannot be implemented in practice without redesigning at least the IGP itself (this is not going to happen), and even if it were possible, the gain is probably going to be small, because of the immense performance difference between IGPs and GPUs, and possible overhead issues for transferring the data from IGP to GPU, and extra setup costs.

It will likely be possible in a few years, since writing vertex processing results will be possible with future chips, and even IGPs will have some vertex procesing power once pixel and vertex shaders are unified. Also, transferring data back to the CPU should be faster with PCIe. That said, things will be different enough then that the CPU/GPU thinking will likely be different than it is today.
 
My mistake I had meant vertex shader 1.1, I had been reading some pdf's on PS 1.4 and number must've stuck in my head. I looked up a few reviews of the 9100 igp and some say 'hardware support for vertex and pixel shaders" yet others say the vertex shader unit was removed like you said akira888. Yet again my mistake, i should've have checked my information more thoroughly, seeing as how the sites that said hardware support didn't give a number as to how many vertex shader units it had.Anyhow I still think its a good idea, provided the IGP will have some decent vertex shader units to handle skinning.

"However you are going to expose it, you will have to physical rendering devices, with two sets of execution units, and two sets of registers. Meaning that the CPU will have to set data to the memory and registers of both devices and send commands to both units. Whether you wrap it up into a single API or not doesn't change much about that."

Lets say you expose the interface as a subset of directx but not a part of directx, take away all renderstates that have to do with pixel shaders and texturing. If you make the spec so its vs1.x/2.x only no fixed function, you get rid of half the rest of the rendering states including cliping since you won't want to clip at all. Whats left over setting up your vertex shader inputs, setting the constant registers, setting what vertex shader program you'd like to run, and finally where and in what format to output the transformed vertexes. In addition add the ability to aquire the hardware in 'exclusive' mode and add a resource lock so if not in exclusive mode the app could have a thread wait to aquire the device, all in all it seems pretty light weight to me especially if it functionality basically amount to directx9's processvertices() call.

As to your saying 'in theory the idea of zbuffer...', if the igp hardware is capable of doing vertex shaders and capable of writing the output to vertex buffers, its already capable of writing to a buffer in a known format. When you setup a vertex buffer for dynamic vertex data its usually setup in agp memory which is just system memory that is already mapped to the GPU's address space using the GART. So just lock the buffer in your app, use the exposed interface to have the IGP process the vertices into said locked VB. When the its done processing the vertices it will trigger an interupt and the driver will send a message or some form of IPC to the app signalling that its done. The app can then unlock the vertex buffer and do whatever it wants with it. Since the installed video card driver will have already decided where the optimal place for the vertex buffer to be is (assuming you used the proper creation and locking flags), the GPU is unaware whether the CPU or something else processed the vertices just that it didn't do it. As to being hardwired to the rasterizer, that was one of the points of the post if it was possible, ie i was hoping someone who worked for ati or nvidia would comment on the possibilty of this being implemented as a feature in current or future IGP's.
 
if the igp hardware is capable of doing vertex shaders and capable of writing the output to vertex buffers, its already capable of writing to a buffer in a known format.

Yes, but:
1) It isn't capable of any of that
2) Normally the IGP writes to its own spot in system memory (I don't think we will ever see onboard chips with dedicated memory in desktops. Might aswell stick in a complete card, you lose the cost-advantage). We don't know if it is possible to write directly to the videoram of the other card. There are two problems here:
- The bandwidth of the entire PC's memory is reduced because of the shared memory interface.
- The CPU may have to copy the data from system-memory to videoram every frame.

So how fast will it be in practice? Perhaps it's more expensive to copy the data with the CPU every frame than to just let the fast vertexshaders do a bit more work? I think this is highly likely for reasonably detailed geometry.

When you setup a vertex buffer for dynamic vertex data its usually setup in agp memory which is just system memory that is already mapped to the GPU's address space using the GART.

It is? I generally write directly to a buffer in videomemory, for performance-reasons. D3D likes it that way (D3DPOOL_DEFAULT and D3DUSAGE_DYNAMIC).

Edit:
Just did a small test... A simple ~6000 vertex model rendered at about 800 fps with D3DPOOL_MANAGED, and 135 fps with D3DPOOL_SYSTEMMEM... And that is without actually touching the data, the buffer was created readonly, and static. In fact, that is actually slower than rendering the same model on my laptop's IGP... it still gets 175 fps. So I guess systemmem is not an option (is this already one of the practical problems that your theory overlooked?).

ie i was hoping someone who worked for ati or nvidia would comment on the possibilty of this being implemented as a feature in current or future IGP's.

Don't count on that happening anytime soon. Perhaps a few years after regular hardware can render-to-vertexbuffer, when IGPs reach the same level of sophistication, but no sooner. And even then you need to hope for an API modification which will easily allow you to use more than one device, with shared resources. I think it's not going to give any significant gain, and I don't think the API modification will ever happen, because the target audience is too small, and the gain too insignificant.
 
1) It isn't capable of any of that

I think thats pretty much been established, at least with the available public information of the concerned hardware.

2) Normally the IGP writes to its own spot in system memory (I don't think we will ever see onboard chips with dedicated memory in desktops. Might aswell stick in a complete card, you lose the cost-advantage). We don't know if it is possible to write directly to the videoram of the other card. There are two problems here:

Wow its like you ignored the entire sample implentation of my previous post. It will occur transparently to the installed GPU.

- The bandwidth of the entire PC's memory is reduced because of the shared memory interface.
- The CPU may have to copy the data from system-memory to videoram every frame.

If the CPU is going to transform the vertexes its going to use the same amount of memory bandwidth as the IGP would if it processed the vertices. Again you ignored the sample implementation, the main gpu is ignorant to the process, you output to a vertex buffer setup on the installed gpu with the proper usage flags. The installed gpu's driver chooses the best place for the vertex buffer, just as its done now. The only problem might be in how to deal with lost devices but that is delving deeper into implementation issues.

So how fast will it be in practice? Perhaps it's more expensive to copy the data with the CPU every frame than to just let the fast vertexshaders do a bit more work? I think this is highly likely for reasonably detailed geometry.

No because the CPU won't need to copy anything, and if it did it would have needed to do it anyway. If there aren't many skeletally animated meshes on the screen at once and only one light (of course this will vary depending on exact hardware configuration) then yes it might be faster to just do it on the installed video card. However If I'm not mistaken if you a doing stencil shadows you will need to generate the silhouette on the CPU anyway so again the tranformed vertices need to be accessible to the CPU for reading, which like you said is painfully slow on agp.


It is? I generally write directly to a buffer in videomemory, for performance-reasons. D3D likes it that way (D3DPOOL_DEFAULT and D3DUSAGE_DYNAMIC).

The driver decides where the optimal location is based on the flags its not guaranteed to be in a specific type of memory the driver knows its hardware best and how much of it is available at the time. It might be in agp memory, it might be in video memory, if you use D3DUSAGE_WRITEONLY when you lock you will have a better chance to be in video memory otherwise AGP memory is the prefered place.

Don't count on that happening anytime soon. Perhaps a few years after regular hardware can render-to-vertexbuffer, when IGPs reach the same level of sophistication, but no sooner. And even then you need to hope for an API modification which will easily allow you to use more than one device, with shared resources. I think it's not going to give any significant gain, and I don't think the API modification will ever happen, because the target audience is too small, and the gain too insignificant.

I just wanted feedback on the idea in general even if it wasn't possible in the present, after all I don't work for a hardware company so i don't have the necessary information to know if it is possible. The api wouldn't need to be modified, again the sample implementation is transparent to the main gpu. I don't think the target audience is small it will benefit high end video cards as well as low end depending on the case at hand (stencil shadows for example).
 
Wow its like you ignored the entire sample implentation of my previous post. It will occur transparently to the installed GPU.

Your sample-implementation is completely unrealistic given current hardware. I might aswell suggest we put more vertexshader units on the GPU and make them faster. As theoretical as your 'implementation', and it will yield even better results!

Again you ignored the sample implementation, the main gpu is ignorant to the process, you output to a vertex buffer setup on the installed gpu with the proper usage flags.

I didn't ignore it, I pointed out that this is not possible, and may not ever be possible.

No because the CPU won't need to copy anything, and if it did it would have needed to do it anyway.

That is under the assumption that the IGP can render directly into the videomemory of the other GPU. Which isn't realistic, as I pointed out earlier. Perhaps you should read my posts more clearly. I feel like I'm only repeating myself.

However If I'm not mistaken if you a doing stencil shadows you will need to generate the silhouette on the CPU anyway so again the tranformed vertices need to be accessible to the CPU for reading, which like you said is painfully slow on agp.

Not at all. There are many ways to implement shadowvolumes. 3dmark03 generates them entirely through vertexshaders for example (including skinning). Another possibility is to extrude in object-space, so even if the CPU is generating the volumes, it will not need to do a transform.
And if you want transformed vertices, why would you read them over AGP? If it's faster to use the CPU to transform them, then do that.

The driver decides where the optimal location is based on the flags its not guaranteed to be in a specific type of memory the driver knows its hardware best and how much of it is available at the time. It might be in agp memory, it might be in video memory, if you use D3DUSAGE_WRITEONLY when you lock you will have a better chance to be in video memory otherwise AGP memory is the prefered place.

It's nice that you can copy some info from the SDK, but trust me, the resource was created in videomemory on the cards I tested with, the performance clearly indicates that.
So I don't really see your point in writing this stuff here anyway.

I just wanted feedback on the idea in general even if it wasn't possible in the present

I gave you feedback, you just didn't seem to like it, and kept clinging to your idea, even though it is not possible to implement, and arguable if it will even get any significant gain if it were implementable.

The api wouldn't need to be modified, again the sample implementation is transparent to the main gpu.

The implementation of the API would be, that's what I meant.

I don't think the target audience is small it will benefit high end video cards as well as low end depending on the case at hand (stencil shadows for example).

I think you need to do three things:

1) Get statistics on people owning an IGP, 3d card, or both.
2) Study ways to efficiently render shadows/skinned meshes.
3) Benchmark the speed of various things, like vertex-processing power of actual IGPs (see how slow software-emulation is, or how slow even real shaders can be, on a low budget), and resources in AGP-mem.

Then rethink your idea.
 
Back
Top