When to expect ps 3.0 hardware?

IlleglWpns · Aug 7, 2003

Very interesting discussion....one thing to remember is the memory loads are relatively less expensive on P4s and Athlons compared to the P3. This is because both have more L1 cache bandwidth, I believe the P4's L1 can sustain a 128 bit load or 128 bit store per cycle, and the Athlon's is (semi) dual ported. This could be important for SSE code.

Honestly I believe the issue with memory accesses from spill code is not latency (as Dio says, modern OOOE hardware can easily unroll tight loops and hoist the "hidden" loads in a load-execution instruction) but throughput. Register files are usually highly ported, so to make CISC style code with lots of memory operands efficient in comparison you have to have fast L1 caches.

In the case of the P3 the L1 cache's bandwidth is listed at 16 gb/sec at 1ghz. That means it can sustain at most one 64 bit load or store per cycle, which means for every 128 bit SSE load you'll probably incur a stall cycle rather than it being fully pipelined.

Nick · Aug 7, 2003

Dio said:
My immediate guess would be that the register-operand version has some kind of long dependency chain that doesn't exist in the memory-operand version.

I'm stupid... in the previous thread you warned me about uninitialized registers being denormal and such. Well that time initialization didn't make a difference, but this time...

Now the results are:

Memory operands: 0.038 s
Register operands: 0.028 s

Which finally is according to expectations.

Yes, it is. But the loads are free, because they are not on the dependent path and so can be hoisted during the waits, and as you say the register is an 'internal' one that therefore eases your register pressure.

Well, the above timings show that the loads are not free. I'll test it on a Pentium 4 as soon as possible but I expect similar results. You're right that theoretically it can all be loaded in advance, but there are practical limits. Address calculations still require resources and can cause extra dependencies. And RAW dependencies between memory operands have big latencies. Also, things like texture lookup often block execution of further shader instructions, so you don't want this high-priority memory access to wait for less important accesses. A good way to do that, aside from scheduling the texture read as soon as possible, is to minimize the number of low-priority memory accesses.

Does it? I'm not sure it would... of course, it might.

What I meant is, if shader registers that are located in memory get read a lot, you have a bigger performance loss because you spend so much time loading the same data over again. If it can stay in a physical register it's a lot more efficient especially if that register is reused a lot. Sorry for the confusion.

Yep, it's pretty limited if the code's working on 4D vectors, although both myself and Pavlos think that 4D vectors are rare in pixel shader code - I generally see a mix of 3D vectors and scalars. It's not much better for 3D vectors, of course!

I agree, this is the biggest strength of SoA. But I also see many arguments against it so only really trying it will tell who's right. I'll keep you updated in September...

At last! But it's my fault - I should have posted this code example before. It took a similar example to knock me out of AoS into SoA, and the instant I implemented it I was a complete convert.

Yes the example really openend my eyes that it's not so bad. Especially my argument of having too little registers wasn't totally correct. Thanks a lot!

IlleglWpns · Aug 7, 2003

Well, the above timings show that the loads are not free. I'll test it on a Pentium 4 as soon as possible but I expect similar results

I expect it to change on a P4. One way to test this is to use the scalar SSE instructions to see if bandwidth from cache really is limiting.

Nick · Aug 7, 2003

IlleglWpns said:
Honestly I believe the issue with memory accesses from spill code is not latency (as Dio says, modern OOOE hardware can easily unroll tight loops and hoist the "hidden" loads in a load-execution instruction) but throughput. Register files are usually highly ported, so to make CISC style code with lots of memory operands efficient in comparison you have to have fast L1 caches.

I just recieved the results from my brother's Pentium 4 2400:

Memory operands: 1.891 s
Register operands: 1.375 s

(that's for 100 million iterations)

So that's again over 30% difference...

In the case of the P3 the L1 cache's bandwidth is listed at 16 gb/sec at 1ghz. That means it can sustain at most one 64 bit load or store per cycle, which means for every 128 bit SSE load you'll probably incur a stall cycle rather than it being fully pipelined.

The result for my Celeron 1200 confirm this. The 0.038 seconds is exactly 128 bit per two clock cycles.

IlleglWpns · Aug 7, 2003

Looking at the P4 optimization guide SSE/MMX/FP reg-mem instructions have an additional six cycles of latency as opposed to 2. So it seems that my supposition that latency was not the problem may be wrong in this case. Honestly I don't know why the latency is so large for the P4.

Out of curiosity have you tried this code on an Athlon? I have a couple of machines here that I'd be willing to test for you.

Nick · Aug 7, 2003

IlleglWpns said:
I expect it to change on a P4. One way to test this is to use the scalar SSE instructions to see if bandwidth from cache really is limiting.

Well it didn't change with my brother's Pentium 4...

Coincidentally he has a 2.4 GHz processor and the code is 24 instructions and run 100 million times, so the time should be exactly 1 second if it did one instruction per second (not counting loop instructions). So the accurately measured 1.375 seconds is not that bad considering that SSE instructions consist of many micro-instructions on a Pentium 4. The only explanation I have for the 30% performance loss when using memory operands is that it adds more micro-instructions. Even though they are independent of the other micro-instructions, there's a limit on the number of micro-instructions being issued per clock cycle, if I recall correctly.

If anyone has a better explanation please let me know!

Nick · Aug 7, 2003

IlleglWpns said:
Out of curiosity have you tried this code on an Athlon? I have a couple of machines here that I'd be willing to test for you.

Yes, a dorm mate was willing to test it on his Althon XP 1900+:

Memory operands: 2.671 s
Register operands: 2.297 s

Only 16% difference here. Probably due to a lower extra latency for memory operands?

IlleglWpns · Aug 7, 2003

Out of the three architectures you're looking at the Athlon/Opteron is probably most suited to this "style" of coding. It has fast address generation and the load-execute instructions don't take up any additional space in the schedulers (in fact they increase decode bandwidth and code density).

IlleglWpns · Aug 8, 2003

you, know what's weird is that I can't figure out where the penalty is coming from.

Looking at this particular snippet that you pointed out earlier:

Code:

mulps xmm0, [r1_r] 
mulps xmm1, [r1_g] 
mulps xmm2, [r1_b] 
mulps xmm3, [r1_a] 
mulps xmm0, [r1_r] 
mulps xmm1, [r1_g] 
mulps xmm2, [r1_b] 
mulps xmm3, [r1_a]

Lets say the first load starts on cycle x. Then the first mulps finishes on cycle x+11. The fifth load starts on cycle x+4 and finishes on cycle x+10. This is before the first mulps operation even completes, so its clear that computational latency is the factor, not load latency (assuming this is on a P4)

At this point I'm stumped. Someone with more experience will have to comment.

Dio · Aug 8, 2003

That' s why I said I'd expect a delta of under 5% in that case - Nick measured that one at 8% (the 30% is a different case).

I don't think you can quite do the analysis like that because you're assuming the processor is starting from a standing start with a clean reorder buffer - but then you should really factor in the other pipe stages as well. I think steady-state analysis is easier and certainly more useful for inner loops.

And for steady-state analysis I just count latencies. I found with the P4 that the reorder buffer is so large (considering the relatively slow execution times of SSE instructions) that fine-grained scheduling is irrelevant. I never found significant scheduling gains unless I could reorder code by 15 instructions or more - but shorten that crucial dependency chain by 1 mul and you're 6 cycles faster. Probably the most surprising discovery was that loads and stores really were free as long as they could be hoisted.

There are so few real assembler programmers left.

Nick · Aug 8, 2003

IlleglWpns said:
Out of the three architectures you're looking at the Athlon/Opteron is probably most suited to this "style" of coding. It has fast address generation and the load-execute instructions don't take up any additional space in the schedulers (in fact they increase decode bandwidth and code density).

Yes the way it handles memory operands is nice, but overall SSE performance isn't that fantastic. It's nearly linear with clock frequency (1600) when comparing to my Celeron 1200. This was also confirmend by my demo, which also uses a lot of MMX.

I'm also curious about Prescott SSE performance. It should have legacy FPU and SSE execution units separated. The latency of the instructions could drop so it's not hiding the memory reads that much. But that's blind speculation...

So although the Althon XP is very suited for SoA and will clearly benifit from it, the Pentium 4 simple does more operations per second. Whether SoA will gain me anything could depend on the situation but it certainly looks promising. My fingers are itching to start implementing it but I have to wait till September.

Pavlos · Aug 8, 2003

Premature optimization is the mother of all evil, as Knuth said.
I donâ€™t think counting instructions and memory latencies is a good idea at this stage of development. High level optimizations usually offer bigger benefits. For example, deferring the shading calculations can increase the speed by an order of magnitude in complex scenes. And if you do this by implementing a tile based algorithm the increased cache rate (color/z/stencil on the L2) can provide farther improvements.

SoA vs AoS:
SoA offers free differential operators, while with AoS the shader state for each dsx/dsy instruction must be pushed/popped. Also, the SoA introduces an overhead on the instructions inside a branch (not big if conditional moves are supported), to check if the pixel is active, while the AoS has a penalty from CPU branch misspredictions. It seems obvious to me that most shaders with many differential operators will execute faster with SoA and most shaders with long branches will execute faster with AoS. The problem is that high quality shaders use extensively both branches and many differential operators.

For my renderer I have decided to take a different approach for time critical shaders (currently Iâ€™m using a custom virtual machine for every shader). I will convert them to ANSI C code and I will compile them to a dso(dynamic shared object - DLL in windows lingo) using the platformâ€™s C compiler. This way I can let the C compiler perform the automatic vectorizations, using optimal scheduling for the specific platform. SoftWire is probably a little bit faster for x86, especially with hand optimized assembly, but itâ€™s not easily portable as far as I know. And the loops produced by the conversion to C can be trivially vectorized, so I donâ€™t think the compilers will have any problems doing the job. Maybe you must experiment with this approach too.

About Small Triangles:
Small triangles are a major problem of any classic scanline rasterization algorithm, and not because of shading. The problem is that with small triangles you canâ€™t take advantage of the spatial coherence inside the triangle. The â€œtriangle setupâ€ cost (the cost to calculate the interpolation constants) cannot be amortized over many pixels, so the whole algorithm becomes sub-optimal, even before the shading stage. Scanline interpolation simply doesnâ€™t make any sense when each triangle is going to be shaded-sampled only a few times.

The problem is solved with the REYES (Renders Everything You Ever Saw) architecture, which is based on flat-shaded sub-pixel micropolygons. I predict in the near future the hardware architectures will start to look like REYES. Otherwise they will be unable to efficiently shade the sea of polygons produced from the upcoming programmable tessellators.

Good luck with your exams.

The SRT Rendering Toolkit

Nick · Aug 8, 2003

Pavlos said:
Premature optimization is the mother of all evil, as Knuth said.

It's not premature, I have been focussing on software rendering for the last three years.

I've done a lot of premature optimizations before but now I learned to keep optimizations separate from the design...

I donâ€™t think counting instructions and memory latencies is a good idea at this stage of development. High level optimizations usually offer bigger benefits. For example, deferring the shading calculations can increase the speed by an order of magnitude in complex scenes. And if you do this by implementing a tile based algorithm the increased cache rate (color/z/stencil on the L2) can provide farther improvements.

That's correct but what if you already do this or the optimizations doesn't influence the implementation? If you have 'algorithmic perfection' at high level there is no other way than to go low level to get more performance. I know such thing doesn't exist, but there's a time when high level design isn't going to change much any more. Besides, it can all be abstracted a lot thanks to automatic register allocation. Assembly code is a lot more reusable because you don't have to worry about what gets stored where, it's as simple as a symbolic name in C++.

SoA offers free differential operators, while with AoS the shader state for each dsx/dsy instruction must be pushed/popped. Also, the SoA introduces an overhead on the instructions inside a branch (not big if conditional moves are supported), to check if the pixel is active, while the AoS has a penalty from CPU branch misspredictions. It seems obvious to me that most shaders with many differential operators will execute faster with SoA and most shaders with long branches will execute faster with AoS. The problem is that high quality shaders use extensively both branches and many differential operators.

I agree, that's why I have to try both. But I'm getting more and more convinced that SoA will be the winner for a ps 3.0 implementation. Thanks to Dio! The only thing I haven't studied yet is the cost to convert from AoS to SoA and back...

For my renderer I have decided to take a different approach for time critical shaders (currently Iâ€™m using a custom virtual machine for every shader). I will convert them to ANSI C code and I will compile them to a dso(dynamic shared object - DLL in windows lingo) using the platformâ€™s C compiler. This way I can let the C compiler perform the automatic vectorizations, using optimal scheduling for the specific platform.

Converting to C will be complex and slow. SoftWire uses run-time intrinsics, which are functions with the names of assembly instruction mnemonics. Once they are called, the corresponding machine code is generated with some simple lookup operations. There's no lexing, parsing or syntax checking, that's all done at compile-time by the C++ compiler (except for the shader code of course). Basically it's a shortcut to intermediate code. The main advantage of fast code generation is that I can have tens of specifically optimized shaders per frame. A good vectorizing C compiler is also rare and expensive.

SoftWire is probably a little bit faster for x86, especially with hand optimized assembly, but itâ€™s not easily portable as far as I know. And the loops produced by the conversion to C can be trivially vectorized, so I donâ€™t think the compilers will have any problems doing the job. Maybe you must experiment with this approach too.

I'm convinced that SoftWire allows to create code very close to optimal. Translating shader instructions into SSE instructions is quite straightforward and you don't have to worry about the registers. So it's not much more difficult than writing C code (once you know SSE). Furthermore the texture sample instructions can also be optimized a lot with MMX. Since these instructions are not as straightforward, the C vectorizer can't make full use of them. To port to another platform I just have to rewrite part of SoftWire and manually translate the shader instructions again. I'm sure I can do it in a few days. So, since I'm intested in every percentage of performance I'm not interested in the C compiler approach. But of course it's very interesting for your project.

Small triangles are a major problem of any classic scanline rasterization algorithm, and not because of shading. The problem is that with small triangles you canâ€™t take advantage of the spatial coherence inside the triangle. The â€œtriangle setupâ€ cost (the cost to calculate the interpolation constants) cannot be amortized over many pixels, so the whole algorithm becomes sub-optimal, even before the shading stage. Scanline interpolation simply doesnâ€™t make any sense when each triangle is going to be shaded-sampled only a few times.

Extra triangle setup is acceptable, as it's an 'expected' effect. Besides, I test if more than one pixel is covered before doing interpolation setup. But when we compare 2x2 block rendering to the classic scanline approach there's a huge difference when using tiny triangles. Anyway, with SoA it's only possible to render four pixels at once so there's no point trying to optimize it.

The problem is solved with the REYES (Renders Everything You Ever Saw) architecture, which is based on flat-shaded sub-pixel micropolygons. I predict in the near future the hardware architectures will start to look like REYES. Otherwise they will be unable to efficiently shade the sea of polygons produced from the upcoming programmable tessellators.

Well once all polygons are equal in size (a pixel) it's a constant cost of course.

Not really an attractive solution yet...

Good luck with your exams.

Thanks!

Dio · Aug 8, 2003

Nick said:
Well once all polygons are equal in size (a pixel) it's a constant cost of course.

This is kind-of-OT but:

Developers really shouldn't allow polygons to get this small. Once polygons are so small that they aren't typically touching at least one pixel you can get geometry aliasing in the same way that you get texture aliasing - you end up rendering pseudorandomly sampled polygons and so can get unusual effects, notably around the edges of objects (and other places where geometry is close to edge-on to the viewpoint - as small polygons cluster in such places)...

Of course, this is a cost of approximating with polygons. If we can move to 'better' surface approximations this problem goes away (or at least moves into the hardware).

JohnH · Aug 8, 2003

Not everything can be correctly modeled with curved surfaces (at least not without massively expanding the amount of data needed to describe the object).

John.

Dio · Aug 8, 2003

I don't necessarily mean curved surfaces. It might be subdivision, or something else, maybe even some method not currently popular.

IlleglWpns · Aug 8, 2003

I don't think you can quite do the analysis like that because you're assuming the processor is starting from a standing start with a clean reorder buffer - but then you should really factor in the other pipe stages as well. I think steady-state analysis is easier and certainly more useful for inner loops.

Yeah, I know that treating it like an inorder prcoessor isn't quite right, but in this case I felt it was okay because

a) The inner loop was so large as to probably fill the scheduler (i.e no internal loop unrolling)

b) The code was essentially one large dependency chain (i.e, no real opportunity for the hardware to reorder anything).

There are so few real assembler programmers left

I'm more of a dabbler with an interest in computer architecture, this debate is mostly academic for me (but still very interesting!).

Anyway, can you estimate how large the P4's integer or FP scheduler is? I can't find this sort of information anywhere.

MfA · Aug 8, 2003

Even with view dependent multiresolution surfaces it can be nice to dice stuff into triangles before rasterization. The consensus for the "right" surface seems to have been pretty constant over the years, subdivision surfaces ... with displacement mapping for detail, and normal (and preferably roughness) maps for shading.

When hardware starts supporting it hopefully uptake will be faster than of normal maps. Since the use of high detail modelling and automated simplification is already pretty wide spread to make use of normal mapping, and a lot of research for conversion to displaced subdivision surfaces has already been published, it should be though.

Pavlos · Aug 8, 2003

Nick said:
That's correct but what if you already do this or the optimizations doesn't influence the implementation? If you have 'algorithmic perfection' at high level there is no other way than to go low level to get more performance. I know such thing doesn't exist, but there's a time when high level design isn't going to change much any more.

Unless you have finished adding all the functionality in your engine/library/renderer, no matter how much time you have spend designing the whole thing, chances are that something will come up and invalidate part of your design/code. The whole issue with the partial derivatives, I think, is a good example. Maybe in some months you will decide to use a tile based deferred renderer using the a-buffer/Z3 algorithm

Also, Intel will be introducing new vector instructions on every major processor release. The horizontal vector instructions of Prescott can easily invalidate any decision between SoA and AoS. Not to mention Appleâ€™s AltiVec. Anyway, your project seems nearly finished, so probably you will not have any problems.

Nick said:
....There's no lexing, parsing or syntax checking, that's all done at compile-time by the C++ compiler (except for the shader code of course). Basically it's a shortcut to intermediate code. The main advantage of fast code generation is that I can have tens of specifically optimized shaders per frame....

Do you have any reason to recompile the shaders for each frame? Do you re-optimize the shader for every object or something??? Usually shaders are compiled at content creation time, so compilation speed is not an issue, and I think this can be the case with real-time rendering (correct me if Iâ€™m wrong).

Nick said:
A good vectorizing C compiler is also rare and expensive.

Actually, the Intel C/C++ compiler for GNU/Linux is free for non commercial use. And I hope GCC will get decent vectorizing capabilities soon. When Iâ€˜ll finish with the high-level optimizations I will certainly make a test with SoftWire to see the speed difference for a real shader. (I saw your tutorial and itâ€™s really very easy to use!).And you are right about the MMX optimized sampling functions. Actually the shader DLL will be calling many â€œhooksâ€ in the renderer for hand-optimized functions like noise, texture sampling, ray-tracing etc....

Itâ€™s good to hear that you can easily port SoftWire in other platforms. I think portability is vital for any project. I donâ€™t want my code tied with a specific architecture or operating system. For example Appleâ€™s G5 (aka IBMâ€™s PPC970) is an amazing (and too expensive

) platform for software rendering. Using a G5, Pixar was rendering at Siggraph an untouched frame from Finding Nemo at only 4 minutes!!!.

The SRT Rendering Toolkit

Dave H · Aug 8, 2003

Dio said:
Developers really shouldn't allow polygons to get this small. Once polygons are so small that they aren't typically touching at least one pixel you can get geometry aliasing in the same way that you get texture aliasing - you end up rendering pseudorandomly sampled polygons and so can get unusual effects, notably around the edges of objects (and other places where geometry is close to edge-on to the viewpoint - as small polygons cluster in such places)...

Huh. Hadn't thought about that before...

But--and correct me if I'm wrong--can't this effect be combated through the antialiasing techniques we already have? Ok, multisampling would presumably have to use centroid sampling to avoid substantial color artifacts, but otherwise it should be just fine, right? And I'm presuming supersampling is the way offline RAYES renderers handle the issue already.

Related question--do current engines that do automatic tessellation therefore take screen resolution into account when determining the proper LOD? (In quite analogous way to how mipmap selection takes screen resolution into account?) If so, I suppose my assumption that geometry workloads remain constant with resolution changes is in fact often wrong... :?

And finally--why should this effect be fixed by moving to better surface descriptions, e.g. subdivision surfaces? It's not a matter of the underlying model not having the proper detail, but rather of undersampling. Or do you mean that the way geometry is currently specified lends itself more to the situation where ordinary FSAA doesn't help because the underlying surface description is for some reason not sub-pixel accurate?

When to expect ps 3.0 hardware?

Similar threads