Fast software renderer

Jawed · Aug 15, 2009

Jawed said:
One way to think about register pressure is to consider the maximum amount of thread state possible, while the ALUs hide their own systematic latency. In GT200 6 threads per multiprocessor are required to hide latency (register read after write), in RV770 4 threads (clause switch).

So per thread:

GT200 - 64KB / (32 strands * 6 threads) = 333

RV770 - 256KB / (64 strands * 4 threads) = 1024

On NVidia it's possible to get away with only 2 threads, but the code must be compiled with no serial instruction dependencies within 3 instructions of each other, to maintain full throughput. One thread will work on NVidia, to obtain 2KB, but I'm doubtful that will use more than half the ALU cycles available. One thread on ATI will only use half the ALU cycles (with additional clause switch overhead) and with a fair bit of kludging by using normal registers (1KB) and strand-shared registers (2KB) it would be possible to get 3KB.

Sigh, total brain fade on the strand-shared registers - that's completely wrong and basically should be ignored. Only 128 of these can be allocated.

So best case for 1 thread on ATI is 2KB - really pointless using 1 thread instead of 2, since 2 threads will also have 2KB.

Jawed

Jawed · Aug 15, 2009

silent_guy said:
I can see how crysis at the highest quality level on a g98 has major parts of the chip are sitting idle, waiting for a shader to complete. But on a well proportioned RV770 or g98 or GT200? Come on...

At 40fps at 1920x1200 = 92.2M pixels/s, RV770 is capable of a fillrate of 12000M pixels/s. So the game is running at a headline rate of 0.77% utilisation - yep, I know, not meaningful because of multiple passes of various kinds per frame. Still, that's enough time for 130 pixel render passes at full resolution per visible frame

The numbers are even more absurd for GT200, equipped with 32 ROPs.

So, a fixed-function part of the GPU is designed to support 1920x1200 rendering at >5000fps, while high-end games are running at 20-100fps. Let's say there's 5x overdraw per pixel, that means the fixed function hardware is, on average, ~20x bigger than it needs to be, if you want to talk die area.

Yes there's a whole pile of caveats: the API is frequenly CPU-bound, bandwidth is limited and absolute latency is key (raising peak:average ratio for performance).

Having to allocate die area for fixed-function units for the fastest possible case makes them hugely wasteful for the common cases.

Jawed

nutball · Aug 15, 2009

Voxilla said:
The ARM instruction set was so simple and powerful that you could actually enjoy writing assembler.
Over a time span of a decade I wrote dozens of graphical demos and algorithms including the original FQuake.

^^^^ What he said.

The ARM ISA was a wet dream back in the days of Z80 and 6502. I had the How To Write ARM ASM book before I had my Acorn Archimedes (and I had one of the first to be made).

I wrote my first ray-tracer in ARM machine code, it wasn't quite real-time but a couple of seconds-per-frame rather than hours-per-frame made me feel good.

It's quite charming now to see folks here rave about this newly discovered ISA. :/

Anyway... enough of the black-and-white and sepia-tint, normal services resumes in this thread after this short message from our sponsors...

http://www.acornuser.com/acornuser/year6/issue61.html

Voxilla · Aug 16, 2009

nutball said:
I had the How To Write ARM ASM book before I had my Acorn Archimedes (and I had one of the first to be made).

If you designed the instruction set I bet you would, nice work !

Voxilla · Aug 26, 2009

I've just got a new netbook and funily CPU rendering is faster as GPU.
SU 3500 versus GMA 4500

Voxilla · Sep 16, 2009

Voxilla said:
If you designed the instruction set I bet you would, nice work !

The ARM instruction set designer Sophie Wilson confirmed not being Nutball, so my credits need to go to Sophie instead of Nutball.

nutball · Sep 16, 2009

Voxilla said:
The ARM instruction set designer Sophie Wilson confirmed not being Nutball, so my credits need to go to Sophie instead of Nutball.

?

I didn't claim to have designed the ISA! I said it was a wet dream! :smile:

Scali · Sep 16, 2009

nutball said:
?

I didn't claim to have designed the ISA! I said it was a wet dream! :smile:

The ISA, or Sophie Wilson?

nutball · Sep 16, 2009

Well Sophie was known as Roger back when I was a kid reading the book... so... erm...

Voxilla · Sep 16, 2009

nutball said:
?

I didn't claim to have designed the ISA! I said it was a wet dream! :smile:

Yeh, as Roger/Sophie Wilson also wrote some raytracer, I made a wrong association.
What raytracer did you write ?

nutball · Sep 16, 2009

Voxilla said:
Yeh, as Roger/Sophie Wilson also wrote some raytracer, I made a wrong association.
What raytracer did you write ?

Nothing that made it into the public domain. Just a school-kid having fun!

Voxilla · Sep 16, 2009

nutball said:
Nothing that made it into the public domain. Just a school-kid having fun!

Ok.

One of the reasons I ported FQuake to PC is that recently I got Virtual Acorn RPC, which turns a PC into a RPC with emulation running RISC OS 4.
It actually runs considerably faster than my real 200 Mhz StrongArm RPC.
Running the ARM FQuake under emulation looks not so great on a big screen, so this inspired me to port it and make it better.

I'm curious about the upcoming ARM smartbooks and the OSes they will run

Voxilla · Sep 19, 2009

I've just included a DX9 version.

This allows to do software rendering via SwiftShader.
Compared to the DX10 WARP version, performance is in the same league.
Slightly slower for the more 'complex' scenes, slightly faster for simple scenes.

Strangely the same can be said for pure GPU rendering, comparing DX9 to DX10.

(I've updated the d3dx9_42 and d3dx10_41 DLLs as they were 64-bit versions instead of 32-bit)

Voxilla · Oct 4, 2009

After replacing my GTX280 with a HD5870, CPU rendering speed has increased from 750 Mpix/s to 800 Mpix/s. The ATI card seems to have a less CPU intensive way of copying images from CPU to GPU memory.
CPU usage now seems to be well over 90%.
It is not so easy to measure actual CPU usage as running the task manager performance monitor reduces rendering speed by more than 5%.

Fast software renderer

Similar threads