Display Lists versus VBOs for static data on ATI/NVIDIA hardware?

Discussion in 'Rendering Technology and APIs' started by Arun, Aug 4, 2005.

  1. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    300
    Location:
    UK
    As I said in another thread, I'm currently into the process of improving - and thus also optimizing - a DX7-era engine (released in 2002) for a modding project. Obviously, we use a much bigger polygon budget than the original game did, and the problem is that it doesn't seem to manage to keep up, even on my 6800GT. And since we're targetting GeForce4 Ti4200/Radeon 8500+ hardware as a minimum (even though we'll have a fallback for pre-shading hardware, this is just a performance restriction), that's a bit of a problem obviously.

    Apparently, the current code is horribly CPU-limited at least when it comes to terrain rendering (with no real non-drawing overhead for terrain). My current analysis is that the CPU overhead is coming from the index buffer not being in a VBO, even though the vertex data is. I would at least assume the drivers not to be tuned for that.

    So basically I got two options: adding support for Index Buffer VBOs and hope for a CPU overhead reduction, or rewrite it with Display Lists. Considering the kinda crappy way the terrain engine was written, the first option would not be easier than the second, no matter how counter-intuitive that might feel. And yeah, don't get me started on that.

    So what I'm asking is basically how good display lists are for 100% static data, compared to VBOs, on modern hardware. Particularly so CPU-wise, since I'd assume a proper implementation of either principle would result in near-maximum GPU throughput. Also, how memory-intensive are Display Lists? This is an information I couldn't find anywhere, plus I'd assume it would be different from vendor to vendor.

    There already was a thread about the performance of Dynamic VBOs compared to other methods among which CVAs (and particularly so on ATI hardware), but I couldn't find anything about this, and the few threads on forums such as gamedev didn't feel very, errr, trustworthy. So I'd appreciate any suggestion on this - if required, I'll just code a small test program, but there might be pitfalls to avoid in the VBO implementation (and considering I'd base myself on the original engine's implementation, I would most likely repeat them).

    Thanks!


    Uttar
     
  2. Ostsol

    Veteran

    Joined:
    Nov 19, 2002
    Messages:
    1,765
    Likes Received:
    0
    Location:
    Edmonton, Alberta, Canada
    The amount of memory a display list uses does vary depending on the implementation, but the following FAQ entry on OpenGL.org appears to indicate that they use a significant amount of memory.

    As for which is faster, I have an old little test program that generates a fBm heightmap. There's no LOD algorithm involved and there's no rendering optimizations. It's brute-force rendering using quads. With a 1024x1024 heightmap, display lists gave me 33 fps while VBOs gave me about 17 fps.

    I also had Windows Task Manager running while the program was running. With VBOs memory usage spiked up to about 90 MB while I was generating the data, but once buffering was complete and I deleted the data from system memory, the program ran using 13 MB. With display lists memory usage was up to 90 MB during data generation, then up to more than 400 MB while the display list was being generated and compiled, then finally 100 MB during rendering.

    I don't know about video memory usage, for display lists, but the VBOs should be using about 24 MB plus 16 MB for the indices. Display lists definitely have a speed advantage, but have quite a large overhead during compilation and runtime.

    EDIT: I do recall that discussion about dynamic VBOs. From my own tests, it appeared that glBuffer(Sub)Data calls were particularily expensive, for some reason. I didn't try mapping buffers, though. . .

    EDIT2: Oh, and this was all tested only on my system: AthlonXP 2100+, 768 MB 266 MHz PC2100 DDR-SDRAM, Radeon 9700 Pro /w Cat5.7 drivers. Nothing is overclocked.
     
    #2 Ostsol, Aug 4, 2005
    Last edited by a moderator: Aug 4, 2005
  3. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    300
    Location:
    UK
    Interesting stuff, thanks. I'm wondering though - are you rendering the whole thing in a single drawcall? Such a huge VBO would most likely be suboptimal according to the few devdocs I read about it, while I would assume Display Lists to love that. The engine I'm modifying works in a basic "chunk" basis (32x32 by default, but I'll up it to 64x64 most likely), but it's not very different than that (although I'll implement basic per-chunk LODing if I got the time, but if you saw my TODO list, you'd most likely think I'll never get around to it, and you're probably right).

    Still, even the "big block" system shouldn't cause a 50% performance drop I'd assume, so I'll probably go with Drawing Lists. "Spikes" shouldn't be a problem considering I'll be using 64x64 chunks, and I can use even smaller ones easily in the editor to improve editing performance.

    I'm intruiged by what that FAQ says regarding memory usage too. From that, I might be tempted to use glVertex3s and glNormal3s to save 50% memory on this. Considering the scales we're talking of in the engine and that most things are round anyway, that should work just fine for my purposes. Also, if memory is a problem, I can just free the non-display list vertex/color/etc. data completely once the display list has been initialized. Thanks god I refused to implement deformable terrain ;)
    Thanks again, I'm still curious about the performance on smaller chunks though (I'd expect to batch about 1000 quads on average - and yes, considering a cool feature I've been working on, I *am* required ot use quads or at least triangle lists).

    Uttar
    EDIT: Damnit, should have read that FAQ a bit more carefully - sounds like the driver stores it as it sees fit, which is genrally in FP32, and specifying it in short would change nothing. Oh well.
     
    #3 Arun, Aug 4, 2005
    Last edited by a moderator: Aug 4, 2005
  4. Ostsol

    Veteran

    Joined:
    Nov 19, 2002
    Messages:
    1,765
    Likes Received:
    0
    Location:
    Edmonton, Alberta, Canada
    Yeah, I just tried smaller blocks (512x512, then 256x256, then 128x128) and found that VBOs now take the lead in performance. . . Looks like for you, VBOs might be the better choice. I should mention, however, that I was using an index buffer. If I don't use an index buffer, then display lists once again take the lead.
     
    #4 Ostsol, Aug 5, 2005
    Last edited by a moderator: Aug 5, 2005
  5. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    300
    Location:
    UK
    Ostsol: NVIDIA performance docs say that anything above 64K vertices in a VBO is suboptimal, so that seems possible. However, your comment on indices confuses me; are you implying 16MB was with a 4:1 index ratio, and same for the VBO vs Display List performance? If so, then display lists would remain a better choice in my case, considering the low indexing possibilities of my rendering scheme.

    Uttar
     
  6. Ostsol

    Veteran

    Joined:
    Nov 19, 2002
    Messages:
    1,765
    Likes Received:
    0
    Location:
    Edmonton, Alberta, Canada
    Yes, that's the index ratio that I ended up having, and VBOs were faster only when an index buffer was used. Just keep in mind that display lists have a much higher memory footprint.

    I find it really weird that implementing index buffers would be so much more of a hassle than implementing display lists. . . If you're chunking up the terrain, they'd be really useful considering that you'd only need one index buffer to service all the chunks.
     
  7. t0y

    t0y
    Newcomer

    Joined:
    Mar 28, 2004
    Messages:
    149
    Likes Received:
    6
    Location:
    Portugal
    Instead of hardcoding one of the options now, you could try to create an interface that supports both and only implement DLs for now. You can try VBOs later if you have the time.

    DLs are very convenient when dealing with variable frequency data ( 4 vertices sharing the same texture coordinate, for example) but drivers will tend to aim for (in general cases) maximum performance and fill the gaps for you, hence the high memory requirements, and if this explains why it's easier for you, nothing beats creating VBOs directly instead of letting the driver guess what's best.

    Keep in mind OpenGL is a state machine and you can still mix and match different approaches to simplify development and/or increase performance (using immediate mode, keeping some data in regular VBs, standard index buffers, etc). Look out for API calls that require the pipeline to flush, though.
     
  8. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    300
    Location:
    UK
    Okay, I finally got around to doing a fair bit of testing, so here goes. I used a modified version of OpenGL Geometry Benchmark to which I added VBO support. The only REAL data used is positions in all cases, which takes 12 bytes; the rest is just padding. Also, there are 16 drawcalls, so the per-batch triangle counts ar about a 1/16th of the shown ones, and there are nearly 2 triangles per vertex, too.
    Anyhow, here goes. Results are a bit surprising IMO, but still roughly explainable. I was very surprised by the high-performance of things like 240 byes per vertex; there could be optimizations going on, but I'd personally guess there aren't any big ones. However, performance DIES at 256 bytes/vertex, heh. And nope, it's not buffer-size related in this case, because otherwise the smaller batch count performance wouldn't die.
    Everything was benchmarked on a non-OCed AGP 8x NVIDIA GeForce 6800GT, with the 77.72 drivers (instrumental settings disabled). Motherboard is a ASUS nForce2 A7N8X Deluxe. All rendering was as expected, obviously. glBufferSubDataARB was used for dynamic VBO testing; interestingly, performance was the same no matter the usage flags given to glBufferDataARB.

     
  9. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    300
    Location:
    UK
    A slight precision: regarding dynamic VBOs (well, streaming basically, but usage flags on NVIDIA hardware gave me no performance changes), what I did in those tests is keep TWO VBOs for each sphere being rendered. So frame A I'd use the first 16 VBOs for the 16 spheres, and frame B I'd use the last 16 VBOs (0->15 for A, 16->31 for B). More VBOs gave me no performance benefit.

    Performance with 1 VBO for all 16 spheres, so it would be streamed to 16 times per frame:
    Performance with 1 VBO per sphere:
    Performance with 2 VBOs per sphere, alternating between frames (see above):
    Performance with 3 VBOs per sphere (for frames A, B and C):
    Hopefully all this info will help someone :) Also, regarding VERY small batches (16 vertices, about 30-35 triangles I think), dynamic VBO performance is FASTER than Immediate Mode by a few % if using the "Alternating Frame" principle, and lower by a few % if not using it.
    BTW, from my limited testing, using glMapBuffe/glUnmapBuffer with a NULL glBufferDataARB beforehands is nearly as fast as using glBufferSubDataARB; slightly higher call overhead and possibly higher overhead in your own program though, obviously, but nothing significant if done right.

    Uttar
    P.S.: Testing CVAs and DLs with more data and testing VBOs with actual data rather than padding might be interesting, but I'd expect performance to be highly similar, so I didn't really bother much except for a few rare testcases iirc which gave me that rough idea too. Ah well, this is more than enough information for my own uses, although I'll have to check on another dev's 9700Pro to check if everything is working as expected on his card too.
     
    #9 Arun, Aug 12, 2005
    Last edited by a moderator: Aug 12, 2005
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...