Pavlos said:
Premature optimization is the mother of all evil, as Knuth said.
It's not premature, I have been focussing on software rendering for the last three years.
I've done a lot of premature optimizations before but now I learned to keep optimizations separate from the design...
I don’t think counting instructions and memory latencies is a good idea at this stage of development. High level optimizations usually offer bigger benefits. For example, deferring the shading calculations can increase the speed by an order of magnitude in complex scenes. And if you do this by implementing a tile based algorithm the increased cache rate (color/z/stencil on the L2) can provide farther improvements.
That's correct but what if you already do this or the optimizations doesn't influence the implementation? If you have 'algorithmic perfection' at high level there is no other way than to go low level to get more performance. I know such thing doesn't exist, but there's a time when high level design isn't going to change much any more. Besides, it can all be abstracted a lot thanks to automatic register allocation. Assembly code is a lot more reusable because you don't have to worry about what gets stored where, it's as simple as a symbolic name in C++.
SoA offers free differential operators, while with AoS the shader state for each dsx/dsy instruction must be pushed/popped. Also, the SoA introduces an overhead on the instructions inside a branch (not big if conditional moves are supported), to check if the pixel is active, while the AoS has a penalty from CPU branch misspredictions. It seems obvious to me that most shaders with many differential operators will execute faster with SoA and most shaders with long branches will execute faster with AoS. The problem is that high quality shaders use extensively both branches and many differential operators.
I agree, that's why I have to try both. But I'm getting more and more convinced that SoA will be the winner for a ps 3.0 implementation. Thanks to Dio! The only thing I haven't studied yet is the cost to convert from AoS to SoA and back...
For my renderer I have decided to take a different approach for time critical shaders (currently I’m using a custom virtual machine for every shader). I will convert them to ANSI C code and I will compile them to a dso(dynamic shared object - DLL in windows lingo) using the platform’s C compiler. This way I can let the C compiler perform the automatic vectorizations, using optimal scheduling for the specific platform.
Converting to C will be complex and slow. SoftWire uses run-time intrinsics, which are functions with the names of assembly instruction mnemonics. Once they are called, the corresponding machine code is generated with some simple lookup operations. There's no lexing, parsing or syntax checking, that's all done at compile-time by the C++ compiler (except for the shader code of course). Basically it's a shortcut to intermediate code. The main advantage of fast code generation is that I can have tens of specifically optimized shaders per frame. A good vectorizing C compiler is also rare and expensive.
SoftWire is probably a little bit faster for x86, especially with hand optimized assembly, but it’s not easily portable as far as I know. And the loops produced by the conversion to C can be trivially vectorized, so I don’t think the compilers will have any problems doing the job. Maybe you must experiment with this approach too.
I'm convinced that SoftWire allows to create code very close to optimal. Translating shader instructions into SSE instructions is quite straightforward and you don't have to worry about the registers. So it's not much more difficult than writing C code (once you know SSE). Furthermore the texture sample instructions can also be optimized a lot with MMX. Since these instructions are not as straightforward, the C vectorizer can't make full use of them. To port to another platform I just have to rewrite part of SoftWire and manually translate the shader instructions again. I'm sure I can do it in a few days. So, since I'm intested in every percentage of performance I'm not interested in the C compiler approach. But of course it's very interesting for your project.
Small triangles are a major problem of any classic scanline rasterization algorithm, and not because of shading. The problem is that with small triangles you can’t take advantage of the spatial coherence inside the triangle. The “triangle setup†cost (the cost to calculate the interpolation constants) cannot be amortized over many pixels, so the whole algorithm becomes sub-optimal, even before the shading stage. Scanline interpolation simply doesn’t make any sense when each triangle is going to be shaded-sampled only a few times.
Extra triangle setup is acceptable, as it's an 'expected' effect. Besides, I test if more than one pixel is covered before doing interpolation setup. But when we compare 2x2 block rendering to the classic scanline approach there's a huge difference when using tiny triangles. Anyway, with SoA it's only possible to render four pixels at once so there's no point trying to optimize it.
The problem is solved with the REYES (Renders Everything You Ever Saw) architecture, which is based on flat-shaded sub-pixel micropolygons. I predict in the near future the hardware architectures will start to look like REYES. Otherwise they will be unable to efficiently shade the sea of polygons produced from the upcoming programmable tessellators.
Well once all polygons are equal in size (a pixel) it's a constant cost of course.
Not really an attractive solution yet...
Good luck with your exams.
Thanks!