Fast software renderer

One way to think about register pressure is to consider the maximum amount of thread state possible, while the ALUs hide their own systematic latency. In GT200 6 threads per multiprocessor are required to hide latency (register read after write), in RV770 4 threads (clause switch).

So per thread:
  • GT200 - 64KB / (32 strands * 6 threads) = 333
  • RV770 - 256KB / (64 strands * 4 threads) = 1024
On NVidia it's possible to get away with only 2 threads, but the code must be compiled with no serial instruction dependencies within 3 instructions of each other, to maintain full throughput. One thread will work on NVidia, to obtain 2KB, but I'm doubtful that will use more than half the ALU cycles available. One thread on ATI will only use half the ALU cycles (with additional clause switch overhead) and with a fair bit of kludging by using normal registers (1KB) and strand-shared registers (2KB) it would be possible to get 3KB.
:oops: Sigh, total brain fade on the strand-shared registers - that's completely wrong and basically should be ignored. Only 128 of these can be allocated.

So best case for 1 thread on ATI is 2KB - really pointless using 1 thread instead of 2, since 2 threads will also have 2KB.

Jawed
 
I can see how crysis at the highest quality level on a g98 has major parts of the chip are sitting idle, waiting for a shader to complete. But on a well proportioned RV770 or g98 or GT200? Come on...
At 40fps at 1920x1200 = 92.2M pixels/s, RV770 is capable of a fillrate of 12000M pixels/s. So the game is running at a headline rate of 0.77% utilisation - yep, I know, not meaningful because of multiple passes of various kinds per frame. Still, that's enough time for 130 pixel render passes at full resolution per visible frame :oops: The numbers are even more absurd for GT200, equipped with 32 ROPs.

So, a fixed-function part of the GPU is designed to support 1920x1200 rendering at >5000fps, while high-end games are running at 20-100fps. Let's say there's 5x overdraw per pixel, that means the fixed function hardware is, on average, ~20x bigger than it needs to be, if you want to talk die area.

Yes there's a whole pile of caveats: the API is frequenly CPU-bound, bandwidth is limited and absolute latency is key (raising peak:average ratio for performance).

Having to allocate die area for fixed-function units for the fastest possible case makes them hugely wasteful for the common cases.

Jawed
 
The ARM instruction set was so simple and powerful that you could actually enjoy writing assembler.
Over a time span of a decade I wrote dozens of graphical demos and algorithms including the original FQuake.

^^^^ What he said.

The ARM ISA was a wet dream back in the days of Z80 and 6502. I had the How To Write ARM ASM book before I had my Acorn Archimedes (and I had one of the first to be made).

I wrote my first ray-tracer in ARM machine code, it wasn't quite real-time but a couple of seconds-per-frame rather than hours-per-frame made me feel good.

It's quite charming now to see folks here rave about this newly discovered ISA. :/

Anyway... enough of the black-and-white and sepia-tint, normal services resumes in this thread after this short message from our sponsors...

http://www.acornuser.com/acornuser/year6/issue61.html
 
Nothing that made it into the public domain. Just a school-kid having fun!

Ok.

One of the reasons I ported FQuake to PC is that recently I got Virtual Acorn RPC, which turns a PC into a RPC with emulation running RISC OS 4.
It actually runs considerably faster than my real 200 Mhz StrongArm RPC.
Running the ARM FQuake under emulation looks not so great on a big screen, so this inspired me to port it and make it better.

I'm curious about the upcoming ARM smartbooks and the OSes they will run :)
 
Last edited by a moderator:
I've just included a DX9 version.

This allows to do software rendering via SwiftShader.
Compared to the DX10 WARP version, performance is in the same league.
Slightly slower for the more 'complex' scenes, slightly faster for simple scenes.

Strangely the same can be said for pure GPU rendering, comparing DX9 to DX10.

(I've updated the d3dx9_42 and d3dx10_41 DLLs as they were 64-bit versions instead of 32-bit)
 
Last edited by a moderator:
After replacing my GTX280 with a HD5870, CPU rendering speed has increased from 750 Mpix/s to 800 Mpix/s. The ATI card seems to have a less CPU intensive way of copying images from CPU to GPU memory.
CPU usage now seems to be well over 90%.
It is not so easy to measure actual CPU usage as running the task manager performance monitor reduces rendering speed by more than 5%.
 
Back
Top