View Full Version : Performance issues re: multiple vertex streams
MrFloopy
17-Jul-2002, 23:04
Has anyone performed any tests on the relative performance of using multiple vertex streams in Dx8 as opposed to only one?
MDolenc
18-Jul-2002, 17:51
Well I haven't actually measured this but still:
There should be near zero performance penalty on GeForce class hardware (using latest drivers that is) and I would assume that this is also true for Radeon cards. However software will die when you use multiple vertex streams (due to linear memory access nature).
So use them when you run in hardware vertex processing and try to avoid them when you fall to software (non hardware t&l cards, or software vertex shaders on GeForce 1&2 and Radeon).
MrFloopy
18-Jul-2002, 19:02
Thanks, I had hoped that was the case. Since our app will only be running on Dx8 hardware this should be fine then.
Has anyone performed any tests on the relative performance of using multiple vertex streams in Dx8 as opposed to only one?
On Geforces, It depends on how many bytes your reading in each of the streams. I've heard it claimed that it can actually be faster to use 2 streams if the length of the entry exceeds 32 bytes, although I've never measured this.
There is a read ahead buffer for the command stream, to avoid latency reading vertex data so I'd imagine that any additional overhead would be minimal.[/quote]
MrFloopy
20-Jul-2002, 01:53
Am hearing conflicting reports all over the place. Initial thoughts were there may be no or little overhead, however thinking about it, would there be some problem with non-sequential memory reads. If data had to go through CPU then I can see problems, but I guess if it's a dma access straight to AGP, maybe this is no issue. thoughts?
It's possible that you may incur an increase in the number of page breaks, but it's not really very predictable.
IMO this shouldn't be a major design consideration.
Unless your trying to get that last 10 % out of the GPU it'll most likely make no difference, and on a PC assuming the vertex buffers are in local memory it is more likely that either the CPU creating the pushbuffer or the AGP bus when the pushbuffer is read will bottleneck before the DMA time for the vertices becomes an issue. Even then it's most likely only an issue if the vertex shader is very short.
MrFloopy
20-Jul-2002, 09:16
I'm guessing that the overall traffic / memory footprint that will be saved will far outway any page hits. Also will make overall design a lot simpler.
jjokiranta
02-Sep-2002, 17:27
Sorry if I am interfering in your discussion, but I think you guys can tell me the answer I need.
I'm writing an aquarium simulation as my first major Direct3D8 application. When I started to test it on other people's computers I noticed that it has some severe bottleneck. When I run it on my Duron@900Mhz Matrox G450 system I get about 40-50 fps and the numbers are just about the same on my friends Thunderbird@1200MHz Geforce2 GTS system. He can even run it 1600x1200 with 4xAA and score the same results.
I use D3DXMESHes and their DrawSubset calls to render all my objects. As I read through this thread it came to my mind that those meshes probably all have their own vertex and index buffers. So there must be atleast 40 stream switches for each frame. At the moment my application forces sofware T&L.
How expensive those stream switches really are? Could those cause the bottleneck?
Thank you.
40 should be insignificant.
My guess would be that your either explicitly CPU bound (all the time in the DrawPirim calls) or implicitly CPU Bound, (you hava lock on a resource that you later use, causing the GPU to sit and spin waitring on the resource).
If yopu call lock on a resource used during the same frame make sure it is absolutely necessary, Lock causes GPU and CPU synchronisation, which can be extremly expensive if you don't understand how much CPU/GPU time has elapsed at that point in the frame. More often than not double buffering the resource can avoid the lock.
jjokiranta
02-Sep-2002, 18:28
Thank you for your quick reply ERP.
I really hoped that would have been the cause because now I'm again totally out of ideas.
I don't think my program is CPU bound as the refresh rate doesn't increase on a significantly faster system. I have also tried to leave out almost everything except rendering code but fps count still stays the same.
I use resource locks only in my BlendMeshes function that I use to animate my fish. Though it is quite expensive call it isn't be the cause of my bottleneck.
I use resource locks only in my BlendMeshes function that I use to animate my fish. Though it is quite expensive call it isn't be the cause of my bottleneck.
Locks can and more often than not do end up wasting a lot of CPU time, heres how.
Lets say you submit a bunch of meshes to be rendered, then you lock the mesh, and do work on it. what happens is that the CPU has to wait until that mesh is no longer in use, that can worst case mean finishing all currently submitted rendering, OK so now you get the lock and again in the worst case the GPU just sits there waiting for more primitives while you work on the BlendMesh, then you submit somemore triuangles and you finally get some parrallelism back.
The biggest cause of performance bottle necks in DX are massive numbers of calls to DrawPrim, followed by locks of resources used in the same frame. Always double or tripple buffer any resource you lock and use in the same frame.
The other thing you need to know is that working an any Vertex Buffer with the CPU is very slow, this is especially true if you read the vertex buffer since they are usually in write combined memory which is uncached and has horrible read behaviour.
If your not reading the Vertex buffer you should at least add pref instructions, properly placed they can double or more the speed of memory latency limited functions.
First of all, this topic was initially about using multiple vertex streams simultaneously, not about switching between them...
Always double or tripple buffer any resource you lock and use in the same frame.
This should be done with dynamic vertex buffers.
They designed for this kind of operation, and when used correctly, locking them is cheap.
This should be done with dynamic vertex buffers.
They designed for this kind of operation, and when used correctly, locking them is cheap
Yes I'd forgotten that Dynamic vertex buffers are double buffered by D3D on the PC, the XBox version ignores this flag and you have to double buffer explicitly.
[/quote]
question, what exactly do you guys mean by using multiple vertex streams simultaneously? (is this just a partitioning of data across multiple streams, i.e. positions in 1 stream, normals in another, etc...)?
My assumption is that multiple vertex streams usually is used to vertex data that is split up in memory say
All Positions
All Normals
All TexCoords
versus data that would be read
Position Normal Tex
Position Normal Tex
Position Normal Tex
.
.
.
At a hardware level on a NV2? there are just 16 DMA channels that have source address, stride and format fields, and the only associated penalties are to do with memory access patterns and cache behaviour. The driver always has to set up one "stream" for every vertex attribute.
MrFloopy
01-Oct-2002, 13:49
At a hardware level on a NV2? there are just 16 DMA channels that have source address, stride and format fields, and the only associated penalties are to do with memory access patterns and cache behaviour. The driver always has to set up one "stream" for every vertex attribute.
That's pretty much the question I asked at the beginning. What are the memory access patterns used in multi-channel dma transfer. Am getting conflicting answers from the HW vendors.
Some are saying that in DX8, the data is read by channel by cache line, others say by vertex. Any Ideas? You'd think they would know but I suspect that developer relation guys are perhaps not as clued in to HW specifics but rather API specifications. (Long for support of mid 90's where often these questions were answered by the HW engineers).
I don't have a definitive answer, but I believe that at least on NV hardware the reading is done in 32 byte chunks.
I've heard it speculated (by nvidia) that in vertices with a size over 32 bytes it's actually faster the use multiple streams with each stream being 32 bytes the last stream containing the remaining bytes. I have not tested this.
The largest vertex I've ever personally used in an app that I've cared about performance on was about 28 bytes.
Given the amount of caching that takes place on the front end I would be surprised if multiple streams made a significant (>2 or 3% in either direction) difference to an applications overall performance.
Of course all my knowledge and experience here is based on NV2?, I have no clue what ATI do.
MrFloopy
02-Oct-2002, 01:52
Given the amount of caching that takes place on the front end I would be surprised if multiple streams made a significant (>2 or 3% in either direction) difference to an applications overall performance.
That has been my experience so far, but have not really stressed things enough yet to be sure.
I am sure they do 32 Byte chunks But take the worst case scenario:
Stream 1 xyz(w)
Stream 2 ijk
Stream 3 uv1
etc
Will HW get 32 bytes from stream1 then 32 bytes of stream 2 or 32 bytes across the stream vertex by vertex?
Would think the prior example would be better but doubt this happens.
Of course all my knowledge and experience here is based on NV2?, I have no clue what ATI do.
Not fussed with ATI specifically but would be interested to know that too.
Will HW get 32 bytes from stream1 then 32 bytes of stream 2 or 32 bytes across the stream vertex by vertex?
Would think the prior example would be better but doubt this happens.
I'm not sure I understand what your asking.
My understanding and this is heresay (so take it with a huge grain of salt), is that it behaves like a cache (32 bytes from each stream), it could work in a number of ways, but there should be no need for it to reread bytes read in a previous vertex fetch, if the data is still in the cache.
MrFloopy
03-Oct-2002, 03:56
Sorry I wasn't very clear.
In a nut shell I am concerned with CPU cache and memory access. assuming the data for each stream are stored in seperate arrays, it would make sense to load a cache line worth of data from each stream no? and then assemble the seperate streams on the Gfx card. However reading a single vertex from the various streams and combining one at a time on the CPU would I think cause problems.
OK then my previous answer stands.
The predominant reason I believe this to be true is that the hardware has no concept of a stream in the DX9 sense.
As far as the hardware is concerned it always has seperate streams, it's just that they can be interleaved in memory.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.