"Yes, but how many polygons?" An artist blog entry with interesting numbers

milk · May 4, 2020

TapamN said:
I benchmarked full screen quads as well. Performance dropped to about 156,000 quads per second.

Woa there, that is better than I ever expected. In typical game scenarios overdraw and transparencies usually hit the fillrate wall way way sooner by producing much less total overdraw in the screen (many orders of magnitude less as I understand, no?). What makes your peak-peeformance test perform that much better?

TapamN · May 4, 2020

milk said:
Woa there, that is better than I ever expected. In typical game scenarios overdraw and transparencies usually hit the fillrate wall way way sooner by producing much less total overdraw in the screen (many orders of magnitude less as I understand, no?). What makes your peak-peeformance test perform that much better?

That's just the time it takes to feed vertex data to the hardware that writes out the display list, so it's basiclly benchmarking how fast it can write pointers to the tile polygon lists. I didn't actually try rendering it to see what the real achievable fillrate it.

milk · May 4, 2020

TapamN said:
That's just the time it takes to feed vertex data to the hardware that writes out the display list, so it's basiclly benchmarking how fast it can write pointers to the tile polygon lists. I didn't actually try rendering it to see what the real achievable fillrate it.

Oh, I see. If you aren't drawing anything, then how does the specific position of vertex affect performance? Is there a speed difference between pushing verts for a fullscreen quad and a smaller 50x50px one in the screen center?

TapamN · May 4, 2020

milk said:
Oh, I see. If you aren't drawing anything, then how does the specific position of vertex affect performance? Is there a speed difference between pushing verts for a fullscreen quad and a smaller 50x50px one in the screen center?

The position of a vertex doesn't affect performance, but the size of a strip's bounding box does.

Most people here probably know this already, but the GPU in the Dreamcast works differently than most GPUs. Instead of drawing to the frame buffer immediately when it receives a command, the GPU buffers all vertex data in video RAM first. When it actually does the rendering from the saved polygon data, it draws small sections of the screen (32x32 pixel tiles on the DC) to an on-chip mini-frame buffer and depth buffer, then writes the on-chip FB to the real FB in RAM. So if you send the GPU commands to draw a stuff, it's not really a draw command, but a "write this stuff to the display list for later rendering" command.

By doing rendering on-chip, it can use wider internal buses and do rendering optimizations that allow the GPU pretty much skip rendering any pixels that end up covered by opaque polygons regardless of submission order (saving fillrate), save video memory bandwidth by not needing to access RAM for the depth buffer, and do extra things like very efficient shadow volumes and order-independent translucency.

In order to not have to run through the entire display list for each tile, each tile gets an extendable array of pointers to what is potentially visible in that tile. Determining exactly what tiles something covers would require rasterizing it, which was way too expensive back then, so the GPU just calculates its screen space bounding box and adds pointers to every tile the box touches.

The GPU works mainly on triangle strips (it also supports individual quads, but does not support fans). You can send a strip of any size, but they are broken down into substrips of either 1, 2, 4, or 6 triangles. The bounding box is calculated and pointers are written for each substrip. A long 5,000 poly triangle strip snaking across the entire screen will not have pointers to the entire strip written to the entire screen, but just the substrips and their bounding boxes.

Since it takes time to write the pointers to memory, a strip that covers a large area will take more time to write out then a strip that covers a small area. A triangle that doesn't cross tile boundaries only needs one pointer written, while the full screen 640x480 quad needs (640x480)/(32x32)=300 pointers written out, so there's a big difference in processing time. The benchmark was to get an idea of how much time it takes the GPU to write out a lot of pointers vs very little or none.

The time it takes to write the pointers can slow the CPU down if the CPU submits vertices directly. The part of the GPU that writes the display lists has a input buffer, and when the buffer gets full whatever is writing data to it has to stall. Another approach is to buffer commands in main RAM, then DMAing them to the GPU. That way the DMA controller gets the stall signal rather than the CPU, which can continue to do work (I think that might be what happens. The DMA is driven by GPU, so it might work by requesting buffer sized chunks or something.). The downside of that is that you have to reserve a large buffer to store the commands, and it wastes bandwidth since all the commands need to cross the bus twice (CPU to buffer, buffer to GPU) rather than once (CPU to GPU).

Writing directly to the GPU with the CPU might not get stalls if the data isn't written too quickly and doesn't have large bounding boxes. If you do all your T&L then submit the vertices (the most CPU efficient way) you will get stalls because you're doing back-to-back submissions that will fill the GPU's input buffer quickly. But if you do T&L between each vertex sent, that might give the GPU enough time to write the polygon data without the CPU stalling. It's can get more parallelism by having command writing and T&L happen at the same time. My 6 Mvert/sec routine works like this.

From what I've heard, most games using the official SDK used DMA only (and they double buffered the main RAM buffers, so something like Shenmue II probably spends ~3MB of main RAM for DMA buffers).

Something I'd like to test would be a mixed approach. If the object has no animation, a single light or prelit, isn't animated, and has small triangles, you should submit it directly. For very large triangles, or more complex things like skeletal animation, write the results to a buffer and DMA it.

milk · May 4, 2020

TapamN said:
The position of a vertex doesn't affect performance, but the size of a strip's bounding box does.

Most people here probably know this already, but the GPU in the Dreamcast works differently than most GPUs. Instead of drawing to the frame buffer immediately when it receives a command, the GPU buffers all vertex data in video RAM first. When it actually does the rendering from the saved polygon data, it draws small sections of the screen (32x32 pixel tiles on the DC) to an on-chip mini-frame buffer and depth buffer, then writes the on-chip FB to the real FB in RAM. So if you send the GPU commands to draw a stuff, it's not really a draw command, but a "write this stuff to the display list for later rendering" command.

By doing rendering on-chip, it can use wider internal buses and do rendering optimizations that allow the GPU pretty much skip rendering any pixels that end up covered by opaque polygons regardless of submission order (saving fillrate), save video memory bandwidth by not needing to access RAM for the depth buffer, and do extra things like very efficient shadow volumes and order-independent translucency.

In order to not have to run through the entire display list for each tile, each tile gets an extendable array of pointers to what is potentially visible in that tile. Determining exactly what tiles something covers would require rasterizing it, which was way too expensive back then, so the GPU just calculates its screen space bounding box and adds pointers to every tile the box touches.

The GPU works mainly on triangle strips (it also supports individual quads, but does not support fans). You can send a strip of any size, but they are broken down into substrips of either 1, 2, 4, or 6 triangles. The bounding box is calculated and pointers are written for each substrip. A long 5,000 poly triangle strip snaking across the entire screen will not have pointers to the entire strip written to the entire screen, but just the substrips and their bounding boxes.

Since it takes time to write the pointers to memory, a strip that covers a large area will take more time to write out then a strip that covers a small area. A triangle that doesn't cross tile boundaries only needs one pointer written, while the full screen 640x480 quad needs (640x480)/(32x32)=300 pointers written out, so there's a big difference in processing time. The benchmark was to get an idea of how much time it takes the GPU to write out a lot of pointers vs very little or none.

The time it takes to write the pointers can slow the CPU down if the CPU submits vertices directly. The part of the GPU that writes the display lists has a input buffer, and when the buffer gets full whatever is writing data to it has to stall. Another approach is to buffer commands in main RAM, then DMAing them to the GPU. That way the DMA controller gets the stall signal rather than the CPU, which can continue to do work (I think that might be what happens. The DMA is driven by GPU, so it might work by requesting buffer sized chunks or something.). The downside of that is that you have to reserve a large buffer to store the commands, and it wastes bandwidth since all the commands need to cross the bus twice (CPU to buffer, buffer to GPU) rather than once (CPU to GPU).

Writing directly to the GPU with the CPU might not get stalls if the data isn't written too quickly and doesn't have large bounding boxes. If you do all your T&L then submit the vertices (the most CPU efficient way) you will get stalls because you're doing back-to-back submissions that will fill the GPU's input buffer quickly. But if you do T&L between each vertex sent, that might give the GPU enough time to write the polygon data without the CPU stalling. It's can get more parallelism by having command writing and T&L happen at the same time. My 6 Mvert/sec routine works like this.

From what I've heard, most games using the official SDK used DMA only (and they double buffered the main RAM buffers, so something like Shenmue II probably spends ~3MB of main RAM for DMA buffers).

Something I'd like to test would be a mixed approach. If the object has no animation, a single light or prelit, isn't animated, and has small triangles, you should submit it directly. For very large triangles, or more complex things like skeletal animation, write the results to a buffer and DMA it.

Yeah, of course. DC had a deferred renderer gpu.... I never knew about all these details of its operation. Thank you so much for such a thorough answer, this was an incredibly fun read!

milk · May 4, 2020

So, just out of curiosity, as I understand it: the dev sends full strips from CPU to GPU, and those are broken down into substrips by the GPU itself as it writes them to its input buffer. Then later, the strips of the input buffer get bounding box tested against the screen tiles and for every positive a pointer with that strip's ID is written down into that each touched tile's individual render list. Then all tiles are rendered one by one taking advantage of on-chip memory for the small tile's framebuffer to get incredibly fast fillrate and z-testing out of that.

So, @TapamN, does a dev know the state of each one of those steps the GPU goes through or is it all black boxed to him? Does it fill its input buffer completely before doing the bounding box texts or is that a parallel procedure? Are the substrips generated consistently across frames with models in different positions, or does the GPU try to do something clever to get optimal substrips there? Is the gpu loading the next frame's strips into its geometry buffers as it is rasterizing the current frame in parallel? Does the dev get to ask it when to raster/load verts and when to not if he wants to?

Thanks again.

Frenetic Pony · May 5, 2020

Fun times, Artstation gives a peek at what artists would prefer to do while still being somewhat reasonable, we'll see what the Xsx and PS5 are actually up to but I've often seen objects in this poly range (4.5k tris axe).

3x a detailed character model from Dreamcast... just for an axe. Characters, without specific game concerns, often end up in 80k-160k range over there.

TapamN · May 5, 2020

milk said:
So, @TapamN, does a dev know the state of each one of those steps the GPU goes through or is it all black boxed to him?

I guess a lot of it's black boxes. Once you have the hardware set up to recieve polygons, you just send "set render state" commands and vertex commands and it does everything automatically. The advantage of knowing how it works is that you get a better idea of how to optimize for the hardware. By measuring the timing and some internal registers, you can get an idea of how the hardware works.

For example, there's a register that you can use to measure how much RAM is used by polygon data. So you can send a single triangle and see that it takes up 84 bytes of space. Then you try sending a 2-triangle strip and it takes up 108 bytes of space, 24 bytes more. You can guess that the extra 24 bytes are the position (12 bytes), UVs (8 bytes), and color (4 bytes), So the formula for space used for a strip is probably 12 + 24 * vert_count bytes. You try a 3 triangle strip, and it takes 192 bytes. That's odd, the formula didn't work. Then you try a 4 triangle strip. That one follows the formula. Oh, that document on the GPU mentioned something about strip sizes of 1, 2, 4, and 6, but not 3. Maybe the rendering side of the hardware doesn't 3 triangle strips. If you treat the 3 triangle strip as a 2 triangle and 1 triangle strip, the formula works again.

Doing experiments like that can figure out a lot.

milk said:
Does it fill its input buffer completely before doing the bounding box texts or is that a parallel procedure?

It probably starts writing stuff to memory as soon as either when it gets a end-of-strip vertex or the buffer fills up enough that you have a 6-triangle substrip ready. It would be tricky to measure exactly when it triggers a write, but I think that's pretty likely that's how it works. There's actually a "finish this list" command that you need to send it once you've submitted everything that causes it to flush some unwritten data before you render the list, but I'm not sure exactly what it writes.

milk said:
Are the substrips generated consistently across frames with models in different positions, or does the GPU try to do something clever to get optimal substrips there?

I think substrip generation was just "keep using the largest substrip size we can until we're done", so on-screen position doesn't really matter. You don't really need to do more than that to be optimal within a single strip. When you generate the "full" strips when creating an entire model, it's probably worth trying to avoid odd numbered strip lengths, so that it doesn't need to generate single triangle substrips.

milk said:
Is the gpu loading the next frame's strips into its geometry buffers as it is rasterizing the current frame in parallel?

The polygon data is double buffered, so while you're submitting data for one frame, it can also rendering the previous frame.

milk said:
Does the dev get to ask it when to raster/load verts and when to not if he wants to?

I'm not sure I understand what your getting at. Er, the hardware runs when you ask it to? You aren't forced to immediately start rendering when a list finishes, or submit polygons when rendering finishes.

To build a display list, you initialize some stuff in video RAM, set some registers to tell the hardware where you want the list written to, then write a register to tell it to get ready to accept polygon data. Then you feed it vertices, and when you're done, you tell it to finish the list. Then you get an interrupt when the list is ready to render.

To render a display list, you set some registers to tell it where the list is, then write another register to tell it to start. When it finishes, you get another interrupt.

Both of these can be happening at the same time.

milk · May 5, 2020

TapamN said:
I guess a lot of it's black boxes. Once you have the hardware set up to recieve polygons, you just send "set render state" commands and vertex commands and it does everything automatically. The advantage of knowing how it works is that you get a better idea of how to optimize for the hardware. By measuring the timing and some internal registers, you can get an idea of how the hardware works.

For example, there's a register that you can use to measure how much RAM is used by polygon data. So you can send a single triangle and see that it takes up 84 bytes of space. Then you try sending a 2-triangle strip and it takes up 108 bytes of space, 24 bytes more. You can guess that the extra 24 bytes are the position (12 bytes), UVs (8 bytes), and color (4 bytes), So the formula for space used for a strip is probably 12 + 24 * vert_count bytes. You try a 3 triangle strip, and it takes 192 bytes. That's odd, the formula didn't work. Then you try a 4 triangle strip. That one follows the formula. Oh, that document on the GPU mentioned something about strip sizes of 1, 2, 4, and 6, but not 3. Maybe the rendering side of the hardware doesn't 3 triangle strips. If you treat the 3 triangle strip as a 2 triangle and 1 triangle strip, the formula works again.

Doing experiments like that can figure out a lot.

It probably starts writing stuff to memory as soon as either when it gets a end-of-strip vertex or the buffer fills up enough that you have a 6-triangle substrip ready. It would be tricky to measure exactly when it triggers a write, but I think that's pretty likely that's how it works. There's actually a "finish this list" command that you need to send it once you've submitted everything that causes it to flush some unwritten data before you render the list, but I'm not sure exactly what it writes.

I think substrip generation was just "keep using the largest substrip size we can until we're done", so on-screen position doesn't really matter. You don't really need to do more than that to be optimal within a single strip. When you generate the "full" strips when creating an entire model, it's probably worth trying to avoid odd numbered strip lengths, so that it doesn't need to generate single triangle substrips.

The polygon data is double buffered, so while you're submitting data for one frame, it can also rendering the previous frame.

I'm not sure I understand what your getting at. Er, the hardware runs when you ask it to? You aren't forced to immediately start rendering when a list finishes, or submit polygons when rendering finishes.

To build a display list, you initialize some stuff in video RAM, set some registers to tell the hardware where you want the list written to, then write a register to tell it to get ready to accept polygon data. Then you feed it vertices, and when you're done, you tell it to finish the list. Then you get an interrupt when the list is ready to render.

To render a display list, you set some registers to tell it where the list is, then write another register to tell it to start. When it finishes, you get another interrupt.

Both of these can be happening at the same time.

That answers all my questions. Great write up. I'm impressed you even remember some of this stuff so well today.

Cloofoofoo · May 5, 2020

Next Shenmue 2. Was a pain to get models so inspected even less than part 1, tools seem to designed to work with recent pc release but I got it to work barebones. The maps seem to alot bigger in shenmue 2. They also seem to be more dense polygon wise, ryos model seem to be slightly different from part 1. I forgot to note where I got the files for the mountain paths but they are from the last disc.

Ryo hazuki gameplay - 2061 tris

Ryo Hazukis high poly hand - right - 604 tris

Pier - wharehouses - 61,057 tris

Lucky quarter 57,181 tris

Winding mountain path - last disc - 193,188 tris

No idea where but it leads to a cave - 91,515 tris

xaeroxcore · May 6, 2020

Cloofoofoo said:
Next Shenmue 2. Was a pain to get models so inspected even less than part 1, tools seem to designed to work with recent pc release but I got it to work barebones. The maps seem to alot bigger in shenmue 2. They also seem to be more dense polygon wise, ryos model seem to be slightly different from part 1. I forgot to note where I got the files for the mountain paths but they are from the last disc.

Ryo hazuki gameplay - 2061 tris

Ryo Hazukis high poly hand - right - 604 tris

Pier - wharehouses - 61,057 tris

Lucky quarter 57,181 tris

Winding mountain path - last disc - 193,188 tris

No idea where but it leads to a cave - 91,515 tris

Almost 200k tris on a DC game.....Holly fuuuuuuck! Can´t believe it! There are PS2 or PSP games than can reach that levels of geometry?

Cloofoofoo · May 9, 2020

Gonna put a psp game now . I figure this is relative because alot of people consider psp and dreamcast to be more or less on par with each other. So it will be interesting to see how a high end psp game stacks up. So this is the third 3rd Birthday by Square Enix. Square was always good with their art and thats something the dreamcast sorely lacked support on. The polygon counts for this game is actually super modest. If anything reminds be of illbleed on the dc in that regard.

Main character - 1,582 tris

Main character in Lightning costume - 1,602 tris

Npc hyde - 1,050 tris

pistol - 50 tris

Assault rifle - 92 tris

Final boss platform - 31,947 tris

City street - 14,571 tris

Regular enemy - 600 tris

Final boss - 2,998 tris

Karamazov · May 9, 2020

xaeroxcore said:
Almost 200k tris on a DC game.....Holly fuuuuuuck! Can´t believe it! There are PS2 or PSP games than can reach that levels of geometry?

it's not all rendered on screen at the same time, that's a whole level.

There were Jak and daxter 3 that was pushing more than 10millions polygons per second on screen.

Karamazov · May 9, 2020

found a nice article about poly count evolution in some PS games

https://blog.us.playstation.com/201...evolution-of-5-iconic-playstation-characters/

ultragpu · May 11, 2020

Karamazov said:
found a nice article about poly count evolution in some PS games

https://blog.us.playstation.com/201...evolution-of-5-iconic-playstation-characters/

I think they made a mistake with Kratos poly count in God of War 3. It says 20k in this article as opposed to 64k.
https://www.playstationlifestyle.net/2010/02/27/god-of-war-iii-god-of-war-ii-comparison/
20k is much more likely for a PS3 game's poly budget and would suit the 4 times increase to PS4's 80k Kratos model for better.
Assuming a typical new gen increase we should see a 320k-400k Kratos on PS5 and games like Detroit, Until Dawn might push to 1 million main character for the giggles.

xaeroxcore · May 13, 2020

Cloofoofoo said:
Gonna put a psp game now . I figure this is relative because alot of people consider psp and dreamcast to be more or less on par with each other. So it will be interesting to see how a high end psp game stacks up. So this is the third 3rd Birthday by Square Enix. Square was always good with their art and thats something the dreamcast sorely lacked support on. The polygon counts for this game is actually super modest. If anything reminds be of illbleed on the dc in that regard.

Main character - 1,582 tris

Main character in Lightning costume - 1,602 tris

Npc hyde - 1,050 tris

pistol - 50 tris

Assault rifle - 92 tris

Final boss platform - 31,947 tris

City street - 14,571 tris

Regular enemy - 600 tris

Final boss - 2,998 tris

That really shows how important is a good art direction into a game.....Those models look pretty superior to its real polycound, specially characters models...i can´t believe those characters are less than 2000. So, we can safe assume 3rd Birthday would be possible on Dreamcast 1:1?

TapamN · May 14, 2020

I found something really interesting. Someone on Sonic Retro ripped the models from the Genesis version of Virtua Racing! (I was planning to do this myself at some point...) For anyone who wants to look at them, the .OBJs are here.

Here are some poly counts:
1P Player's car body without shadow, no wheels: 87 tris
Player car wheel: 22 tris
The four wheels combined are 1 more triangle than the car body!
Player car shadow: 4 tris
1P Player's car total: 179 tris

Beginner course: 5816 tris
Medium course: 6584 tris
Expert course: 7712 tris

Super high poly credits sequence car: 520 tris body, 108 tri each wheel, 952 total

Judging by the credits sequence, it looks like the game does around 1000 tris per frame at 15 FPS. When they advertised it as doing 300-500 polys per frame, they must have meant quads.

Karamazov · May 14, 2020

BILLIONS OF POLYGONS PER SECOND !

Cloofoofoo · May 14, 2020

TapamN said:
I found something really interesting. Someone on Sonic Retro ripped the models from the Genesis version of Virtua Racing! (I was planning to do this myself at some point...) For anyone who wants to look at them, the .OBJs are here.

Here are some poly counts:
1P Player's car body without shadow, no wheels: 87 tris
Player car wheel: 22 tris
The four wheels combined are 1 more triangle than the car body!
Player car shadow: 4 tris
1P Player's car total: 179 tris

View attachment 3875
Beginner course: 5816 tris
Medium course: 6584 tris
Expert course: 7712 tris

Super high poly credits sequence car: 520 tris body, 108 tri each wheel, 952 total

Judging by the credits sequence, it looks like the game does around 1000 tris per frame at 15 FPS. When they advertised it as doing 300-500 polys per frame, they must have meant quads.

Oh wow , that's awesome. Always wondered how this game performed. So 15,000 a second huh. I wonder what this means for the 32x version who has more detail from what I hear runs at 20 fps. Hmm I haven't checked on ps1 games but I always heard vagrant story did 3,000 polygons per frame at 30 fps. Virtua racing Genesis isn't doing too bad at all.

BTW tapmn did you get activated at sega16 ? Or maybe I should repost that request.

Karamazov · May 14, 2020

Virtua racing had an accelerator in the cardbridge. Was so awesome at the time.

"Yes, but how many polygons?" An artist blog entry with interesting numbers

milk

Like Verified

TapamN

milk

Like Verified

TapamN

milk

Like Verified

milk

Like Verified

Frenetic Pony

TapamN

milk

Like Verified

Cloofoofoo

xaeroxcore

Cloofoofoo

Karamazov

Karamazov

ultragpu

xaeroxcore

TapamN

Karamazov

Cloofoofoo

Karamazov

Similar threads