"Yes, but how many polygons?" An artist blog entry with interesting numbers

I benchmarked full screen quads as well. Performance dropped to about 156,000 quads per second.

Woa there, that is better than I ever expected. In typical game scenarios overdraw and transparencies usually hit the fillrate wall way way sooner by producing much less total overdraw in the screen (many orders of magnitude less as I understand, no?). What makes your peak-peeformance test perform that much better?
 
Woa there, that is better than I ever expected. In typical game scenarios overdraw and transparencies usually hit the fillrate wall way way sooner by producing much less total overdraw in the screen (many orders of magnitude less as I understand, no?). What makes your peak-peeformance test perform that much better?
That's just the time it takes to feed vertex data to the hardware that writes out the display list, so it's basiclly benchmarking how fast it can write pointers to the tile polygon lists. I didn't actually try rendering it to see what the real achievable fillrate it.
 
That's just the time it takes to feed vertex data to the hardware that writes out the display list, so it's basiclly benchmarking how fast it can write pointers to the tile polygon lists. I didn't actually try rendering it to see what the real achievable fillrate it.

Oh, I see. If you aren't drawing anything, then how does the specific position of vertex affect performance? Is there a speed difference between pushing verts for a fullscreen quad and a smaller 50x50px one in the screen center?
 
Oh, I see. If you aren't drawing anything, then how does the specific position of vertex affect performance? Is there a speed difference between pushing verts for a fullscreen quad and a smaller 50x50px one in the screen center?
The position of a vertex doesn't affect performance, but the size of a strip's bounding box does.

Most people here probably know this already, but the GPU in the Dreamcast works differently than most GPUs. Instead of drawing to the frame buffer immediately when it receives a command, the GPU buffers all vertex data in video RAM first. When it actually does the rendering from the saved polygon data, it draws small sections of the screen (32x32 pixel tiles on the DC) to an on-chip mini-frame buffer and depth buffer, then writes the on-chip FB to the real FB in RAM. So if you send the GPU commands to draw a stuff, it's not really a draw command, but a "write this stuff to the display list for later rendering" command.

By doing rendering on-chip, it can use wider internal buses and do rendering optimizations that allow the GPU pretty much skip rendering any pixels that end up covered by opaque polygons regardless of submission order (saving fillrate), save video memory bandwidth by not needing to access RAM for the depth buffer, and do extra things like very efficient shadow volumes and order-independent translucency.

In order to not have to run through the entire display list for each tile, each tile gets an extendable array of pointers to what is potentially visible in that tile. Determining exactly what tiles something covers would require rasterizing it, which was way too expensive back then, so the GPU just calculates its screen space bounding box and adds pointers to every tile the box touches.

The GPU works mainly on triangle strips (it also supports individual quads, but does not support fans). You can send a strip of any size, but they are broken down into substrips of either 1, 2, 4, or 6 triangles. The bounding box is calculated and pointers are written for each substrip. A long 5,000 poly triangle strip snaking across the entire screen will not have pointers to the entire strip written to the entire screen, but just the substrips and their bounding boxes.

Since it takes time to write the pointers to memory, a strip that covers a large area will take more time to write out then a strip that covers a small area. A triangle that doesn't cross tile boundaries only needs one pointer written, while the full screen 640x480 quad needs (640x480)/(32x32)=300 pointers written out, so there's a big difference in processing time. The benchmark was to get an idea of how much time it takes the GPU to write out a lot of pointers vs very little or none.

The time it takes to write the pointers can slow the CPU down if the CPU submits vertices directly. The part of the GPU that writes the display lists has a input buffer, and when the buffer gets full whatever is writing data to it has to stall. Another approach is to buffer commands in main RAM, then DMAing them to the GPU. That way the DMA controller gets the stall signal rather than the CPU, which can continue to do work (I think that might be what happens. The DMA is driven by GPU, so it might work by requesting buffer sized chunks or something.). The downside of that is that you have to reserve a large buffer to store the commands, and it wastes bandwidth since all the commands need to cross the bus twice (CPU to buffer, buffer to GPU) rather than once (CPU to GPU).

Writing directly to the GPU with the CPU might not get stalls if the data isn't written too quickly and doesn't have large bounding boxes. If you do all your T&L then submit the vertices (the most CPU efficient way) you will get stalls because you're doing back-to-back submissions that will fill the GPU's input buffer quickly. But if you do T&L between each vertex sent, that might give the GPU enough time to write the polygon data without the CPU stalling. It's can get more parallelism by having command writing and T&L happen at the same time. My 6 Mvert/sec routine works like this.

From what I've heard, most games using the official SDK used DMA only (and they double buffered the main RAM buffers, so something like Shenmue II probably spends ~3MB of main RAM for DMA buffers).

Something I'd like to test would be a mixed approach. If the object has no animation, a single light or prelit, isn't animated, and has small triangles, you should submit it directly. For very large triangles, or more complex things like skeletal animation, write the results to a buffer and DMA it.
 
The position of a vertex doesn't affect performance, but the size of a strip's bounding box does.

Most people here probably know this already, but the GPU in the Dreamcast works differently than most GPUs. Instead of drawing to the frame buffer immediately when it receives a command, the GPU buffers all vertex data in video RAM first. When it actually does the rendering from the saved polygon data, it draws small sections of the screen (32x32 pixel tiles on the DC) to an on-chip mini-frame buffer and depth buffer, then writes the on-chip FB to the real FB in RAM. So if you send the GPU commands to draw a stuff, it's not really a draw command, but a "write this stuff to the display list for later rendering" command.

By doing rendering on-chip, it can use wider internal buses and do rendering optimizations that allow the GPU pretty much skip rendering any pixels that end up covered by opaque polygons regardless of submission order (saving fillrate), save video memory bandwidth by not needing to access RAM for the depth buffer, and do extra things like very efficient shadow volumes and order-independent translucency.

In order to not have to run through the entire display list for each tile, each tile gets an extendable array of pointers to what is potentially visible in that tile. Determining exactly what tiles something covers would require rasterizing it, which was way too expensive back then, so the GPU just calculates its screen space bounding box and adds pointers to every tile the box touches.

The GPU works mainly on triangle strips (it also supports individual quads, but does not support fans). You can send a strip of any size, but they are broken down into substrips of either 1, 2, 4, or 6 triangles. The bounding box is calculated and pointers are written for each substrip. A long 5,000 poly triangle strip snaking across the entire screen will not have pointers to the entire strip written to the entire screen, but just the substrips and their bounding boxes.

Since it takes time to write the pointers to memory, a strip that covers a large area will take more time to write out then a strip that covers a small area. A triangle that doesn't cross tile boundaries only needs one pointer written, while the full screen 640x480 quad needs (640x480)/(32x32)=300 pointers written out, so there's a big difference in processing time. The benchmark was to get an idea of how much time it takes the GPU to write out a lot of pointers vs very little or none.

The time it takes to write the pointers can slow the CPU down if the CPU submits vertices directly. The part of the GPU that writes the display lists has a input buffer, and when the buffer gets full whatever is writing data to it has to stall. Another approach is to buffer commands in main RAM, then DMAing them to the GPU. That way the DMA controller gets the stall signal rather than the CPU, which can continue to do work (I think that might be what happens. The DMA is driven by GPU, so it might work by requesting buffer sized chunks or something.). The downside of that is that you have to reserve a large buffer to store the commands, and it wastes bandwidth since all the commands need to cross the bus twice (CPU to buffer, buffer to GPU) rather than once (CPU to GPU).

Writing directly to the GPU with the CPU might not get stalls if the data isn't written too quickly and doesn't have large bounding boxes. If you do all your T&L then submit the vertices (the most CPU efficient way) you will get stalls because you're doing back-to-back submissions that will fill the GPU's input buffer quickly. But if you do T&L between each vertex sent, that might give the GPU enough time to write the polygon data without the CPU stalling. It's can get more parallelism by having command writing and T&L happen at the same time. My 6 Mvert/sec routine works like this.

From what I've heard, most games using the official SDK used DMA only (and they double buffered the main RAM buffers, so something like Shenmue II probably spends ~3MB of main RAM for DMA buffers).

Something I'd like to test would be a mixed approach. If the object has no animation, a single light or prelit, isn't animated, and has small triangles, you should submit it directly. For very large triangles, or more complex things like skeletal animation, write the results to a buffer and DMA it.

Yeah, of course. DC had a deferred renderer gpu.... I never knew about all these details of its operation. Thank you so much for such a thorough answer, this was an incredibly fun read!
 
So, just out of curiosity, as I understand it: the dev sends full strips from CPU to GPU, and those are broken down into substrips by the GPU itself as it writes them to its input buffer. Then later, the strips of the input buffer get bounding box tested against the screen tiles and for every positive a pointer with that strip's ID is written down into that each touched tile's individual render list. Then all tiles are rendered one by one taking advantage of on-chip memory for the small tile's framebuffer to get incredibly fast fillrate and z-testing out of that.

So, @TapamN, does a dev know the state of each one of those steps the GPU goes through or is it all black boxed to him? Does it fill its input buffer completely before doing the bounding box texts or is that a parallel procedure? Are the substrips generated consistently across frames with models in different positions, or does the GPU try to do something clever to get optimal substrips there? Is the gpu loading the next frame's strips into its geometry buffers as it is rasterizing the current frame in parallel? Does the dev get to ask it when to raster/load verts and when to not if he wants to?

Thanks again.
 
Fun times, Artstation gives a peek at what artists would prefer to do while still being somewhat reasonable, we'll see what the Xsx and PS5 are actually up to but I've often seen objects in this poly range (4.5k tris axe).
neil-houari-2.jpg

3x a detailed character model from Dreamcast... just for an axe. Characters, without specific game concerns, often end up in 80k-160k range over there.
 
So, @TapamN, does a dev know the state of each one of those steps the GPU goes through or is it all black boxed to him?
I guess a lot of it's black boxes. Once you have the hardware set up to recieve polygons, you just send "set render state" commands and vertex commands and it does everything automatically. The advantage of knowing how it works is that you get a better idea of how to optimize for the hardware. By measuring the timing and some internal registers, you can get an idea of how the hardware works.

For example, there's a register that you can use to measure how much RAM is used by polygon data. So you can send a single triangle and see that it takes up 84 bytes of space. Then you try sending a 2-triangle strip and it takes up 108 bytes of space, 24 bytes more. You can guess that the extra 24 bytes are the position (12 bytes), UVs (8 bytes), and color (4 bytes), So the formula for space used for a strip is probably 12 + 24 * vert_count bytes. You try a 3 triangle strip, and it takes 192 bytes. That's odd, the formula didn't work. Then you try a 4 triangle strip. That one follows the formula. Oh, that document on the GPU mentioned something about strip sizes of 1, 2, 4, and 6, but not 3. Maybe the rendering side of the hardware doesn't 3 triangle strips. If you treat the 3 triangle strip as a 2 triangle and 1 triangle strip, the formula works again.

Doing experiments like that can figure out a lot.

Does it fill its input buffer completely before doing the bounding box texts or is that a parallel procedure?
It probably starts writing stuff to memory as soon as either when it gets a end-of-strip vertex or the buffer fills up enough that you have a 6-triangle substrip ready. It would be tricky to measure exactly when it triggers a write, but I think that's pretty likely that's how it works. There's actually a "finish this list" command that you need to send it once you've submitted everything that causes it to flush some unwritten data before you render the list, but I'm not sure exactly what it writes.

Are the substrips generated consistently across frames with models in different positions, or does the GPU try to do something clever to get optimal substrips there?
I think substrip generation was just "keep using the largest substrip size we can until we're done", so on-screen position doesn't really matter. You don't really need to do more than that to be optimal within a single strip. When you generate the "full" strips when creating an entire model, it's probably worth trying to avoid odd numbered strip lengths, so that it doesn't need to generate single triangle substrips.

Is the gpu loading the next frame's strips into its geometry buffers as it is rasterizing the current frame in parallel?
The polygon data is double buffered, so while you're submitting data for one frame, it can also rendering the previous frame.

Does the dev get to ask it when to raster/load verts and when to not if he wants to?
I'm not sure I understand what your getting at. Er, the hardware runs when you ask it to? You aren't forced to immediately start rendering when a list finishes, or submit polygons when rendering finishes.

To build a display list, you initialize some stuff in video RAM, set some registers to tell the hardware where you want the list written to, then write a register to tell it to get ready to accept polygon data. Then you feed it vertices, and when you're done, you tell it to finish the list. Then you get an interrupt when the list is ready to render.

To render a display list, you set some registers to tell it where the list is, then write another register to tell it to start. When it finishes, you get another interrupt.

Both of these can be happening at the same time.
 
I guess a lot of it's black boxes. Once you have the hardware set up to recieve polygons, you just send "set render state" commands and vertex commands and it does everything automatically. The advantage of knowing how it works is that you get a better idea of how to optimize for the hardware. By measuring the timing and some internal registers, you can get an idea of how the hardware works.

For example, there's a register that you can use to measure how much RAM is used by polygon data. So you can send a single triangle and see that it takes up 84 bytes of space. Then you try sending a 2-triangle strip and it takes up 108 bytes of space, 24 bytes more. You can guess that the extra 24 bytes are the position (12 bytes), UVs (8 bytes), and color (4 bytes), So the formula for space used for a strip is probably 12 + 24 * vert_count bytes. You try a 3 triangle strip, and it takes 192 bytes. That's odd, the formula didn't work. Then you try a 4 triangle strip. That one follows the formula. Oh, that document on the GPU mentioned something about strip sizes of 1, 2, 4, and 6, but not 3. Maybe the rendering side of the hardware doesn't 3 triangle strips. If you treat the 3 triangle strip as a 2 triangle and 1 triangle strip, the formula works again.

Doing experiments like that can figure out a lot.


It probably starts writing stuff to memory as soon as either when it gets a end-of-strip vertex or the buffer fills up enough that you have a 6-triangle substrip ready. It would be tricky to measure exactly when it triggers a write, but I think that's pretty likely that's how it works. There's actually a "finish this list" command that you need to send it once you've submitted everything that causes it to flush some unwritten data before you render the list, but I'm not sure exactly what it writes.


I think substrip generation was just "keep using the largest substrip size we can until we're done", so on-screen position doesn't really matter. You don't really need to do more than that to be optimal within a single strip. When you generate the "full" strips when creating an entire model, it's probably worth trying to avoid odd numbered strip lengths, so that it doesn't need to generate single triangle substrips.


The polygon data is double buffered, so while you're submitting data for one frame, it can also rendering the previous frame.


I'm not sure I understand what your getting at. Er, the hardware runs when you ask it to? You aren't forced to immediately start rendering when a list finishes, or submit polygons when rendering finishes.

To build a display list, you initialize some stuff in video RAM, set some registers to tell the hardware where you want the list written to, then write a register to tell it to get ready to accept polygon data. Then you feed it vertices, and when you're done, you tell it to finish the list. Then you get an interrupt when the list is ready to render.

To render a display list, you set some registers to tell it where the list is, then write another register to tell it to start. When it finishes, you get another interrupt.

Both of these can be happening at the same time.
That answers all my questions. Great write up. I'm impressed you even remember some of this stuff so well today.
 
Next Shenmue 2. Was a pain to get models so inspected even less than part 1, tools seem to designed to work with recent pc release but I got it to work barebones. The maps seem to alot bigger in shenmue 2. They also seem to be more dense polygon wise, ryos model seem to be slightly different from part 1. I forgot to note where I got the files for the mountain paths but they are from the last disc.

Ryo hazuki gameplay - 2061 tris
shenmue2-6ryo.jpg


Ryo Hazukis high poly hand - right - 604 tris
shenmue2-5hands.jpg


Pier - wharehouses - 61,057 tris
shenmue2-4pier.jpg


Lucky quarter 57,181 tris
shenmue2-3luckqr.jpg


Winding mountain path - last disc - 193,188 tris
shenmue2-1.jpg


No idea where but it leads to a cave - 91,515 tris
shenmue2-2.jpg
 
Next Shenmue 2. Was a pain to get models so inspected even less than part 1, tools seem to designed to work with recent pc release but I got it to work barebones. The maps seem to alot bigger in shenmue 2. They also seem to be more dense polygon wise, ryos model seem to be slightly different from part 1. I forgot to note where I got the files for the mountain paths but they are from the last disc.

Ryo hazuki gameplay - 2061 tris
shenmue2-6ryo.jpg


Ryo Hazukis high poly hand - right - 604 tris
shenmue2-5hands.jpg


Pier - wharehouses - 61,057 tris
shenmue2-4pier.jpg


Lucky quarter 57,181 tris
shenmue2-3luckqr.jpg


Winding mountain path - last disc - 193,188 tris
shenmue2-1.jpg


No idea where but it leads to a cave - 91,515 tris
shenmue2-2.jpg


Almost 200k tris on a DC game.....Holly fuuuuuuck! Can´t believe it! There are PS2 or PSP games than can reach that levels of geometry?
 
Gonna put a psp game now . I figure this is relative because alot of people consider psp and dreamcast to be more or less on par with each other. So it will be interesting to see how a high end psp game stacks up. So this is the third 3rd Birthday by Square Enix. Square was always good with their art and thats something the dreamcast sorely lacked support on. The polygon counts for this game is actually super modest. If anything reminds be of illbleed on the dc in that regard.



Main character - 1,582 tris
3rd1.jpg


Main character in Lightning costume - 1,602 tris
3rd2.jpg


Npc hyde - 1,050 tris
3rd3.jpg


pistol - 50 tris
3rd4.jpg


Assault rifle - 92 tris
3rd5.jpg


Final boss platform - 31,947 tris
3rd6.jpg


City street - 14,571 tris
3rd7.jpg


Regular enemy - 600 tris
3rd8.jpg


Final boss - 2,998 tris
3rd9.jpg
 
Almost 200k tris on a DC game.....Holly fuuuuuuck! Can´t believe it! There are PS2 or PSP games than can reach that levels of geometry?

it's not all rendered on screen at the same time, that's a whole level.

There were Jak and daxter 3 that was pushing more than 10millions polygons per second on screen.
 
found a nice article about poly count evolution in some PS games

https://blog.us.playstation.com/201...evolution-of-5-iconic-playstation-characters/
I think they made a mistake with Kratos poly count in God of War 3. It says 20k in this article as opposed to 64k.
https://www.playstationlifestyle.net/2010/02/27/god-of-war-iii-god-of-war-ii-comparison/
20k is much more likely for a PS3 game's poly budget and would suit the 4 times increase to PS4's 80k Kratos model for better.
Assuming a typical new gen increase we should see a 320k-400k Kratos on PS5 and games like Detroit, Until Dawn might push to 1 million main character for the giggles.
 
Gonna put a psp game now . I figure this is relative because alot of people consider psp and dreamcast to be more or less on par with each other. So it will be interesting to see how a high end psp game stacks up. So this is the third 3rd Birthday by Square Enix. Square was always good with their art and thats something the dreamcast sorely lacked support on. The polygon counts for this game is actually super modest. If anything reminds be of illbleed on the dc in that regard.



Main character - 1,582 tris
3rd1.jpg


Main character in Lightning costume - 1,602 tris
3rd2.jpg


Npc hyde - 1,050 tris
3rd3.jpg


pistol - 50 tris
3rd4.jpg


Assault rifle - 92 tris
3rd5.jpg


Final boss platform - 31,947 tris
3rd6.jpg


City street - 14,571 tris
3rd7.jpg


Regular enemy - 600 tris
3rd8.jpg


Final boss - 2,998 tris
3rd9.jpg


That really shows how important is a good art direction into a game.....Those models look pretty superior to its real polycound, specially characters models...i can´t believe those characters are less than 2000. So, we can safe assume 3rd Birthday would be possible on Dreamcast 1:1?
 
I found something really interesting. Someone on Sonic Retro ripped the models from the Genesis version of Virtua Racing! (I was planning to do this myself at some point...) For anyone who wants to look at them, the .OBJs are here.

Here are some poly counts:
1P Player's car body without shadow, no wheels: 87 tris
Player car wheel: 22 tris
The four wheels combined are 1 more triangle than the car body!
Player car shadow: 4 tris
1P Player's car total: 179 tris

upload_2020-4-17_23-1-55.png
Beginner course: 5816 tris
Medium course: 6584 tris
Expert course: 7712 tris

Super high poly credits sequence car: 520 tris body, 108 tri each wheel, 952 total

Judging by the credits sequence, it looks like the game does around 1000 tris per frame at 15 FPS. When they advertised it as doing 300-500 polys per frame, they must have meant quads.
 
I found something really interesting. Someone on Sonic Retro ripped the models from the Genesis version of Virtua Racing! (I was planning to do this myself at some point...) For anyone who wants to look at them, the .OBJs are here.

Here are some poly counts:
1P Player's car body without shadow, no wheels: 87 tris
Player car wheel: 22 tris
The four wheels combined are 1 more triangle than the car body!
Player car shadow: 4 tris
1P Player's car total: 179 tris

View attachment 3875
Beginner course: 5816 tris
Medium course: 6584 tris
Expert course: 7712 tris

Super high poly credits sequence car: 520 tris body, 108 tri each wheel, 952 total

Judging by the credits sequence, it looks like the game does around 1000 tris per frame at 15 FPS. When they advertised it as doing 300-500 polys per frame, they must have meant quads.

Oh wow , that's awesome. Always wondered how this game performed. So 15,000 a second huh. I wonder what this means for the 32x version who has more detail from what I hear runs at 20 fps. Hmm I haven't checked on ps1 games but I always heard vagrant story did 3,000 polygons per frame at 30 fps. Virtua racing Genesis isn't doing too bad at all.

BTW tapmn did you get activated at sega16 ? Or maybe I should repost that request.
 
Back
Top