When to expect ps 3.0 hardware?

MfA · Aug 8, 2003

Dave H said:
Related question--do current engines that do automatic tessellation therefore take screen resolution into account when determining the proper LOD? (In quite analogous way to how mipmap selection takes screen resolution into account?) If so, I suppose my assumption that geometry workloads remain constant with resolution changes is in fact often wrong... :?

You could do this, but it wouldnt help much against aliasing once you reach the 1 pixel/polygon ratio. You would need some sort of 3D anisotropic filtering

Entropy · Aug 8, 2003

Pavlos said:
And I hope GCC will get decent vectorizing capabilities soon.

Since you are aware of the platform, I guess you are aware of IBM/Apples efforts in this area. There could be some problems with getting it adopted into gcc, but given that gcc doesn't really understand instruction grouping that is central to the 970 scheduling, I hope the powers that be will be sympathetic towards the auto vectorizing efforts. I'm sure you know where to hunt for the pertinent info.

Entropy

MfA · Aug 8, 2003

They are building support for vectorization into the tree-SSA branch, which seems set to be the future of GCC (unfortunately ... from a technical standpoint it seems inferior to LLVM).

Nick · Aug 8, 2003

Pavlos said:
Unless you have finished adding all the functionality in your engine/library/renderer, no matter how much time you have spend designing the whole thing, chances are that something will come up and invalidate part of your design/code. The whole issue with the partial derivatives, I think, is a good example. Maybe in some months you will decide to use a tile based deferred renderer using the a-buffer/Z3 algorithm

I'm not finished adding functionality, but I already have a complete list of things things to be added which I try not to touch any more. So there should only be small design changes (like needed for dsx/dsy).

Also, Intel will be introducing new vector instructions on every major processor release. The horizontal vector instructions of Prescott can easily invalidate any decision between SoA and AoS. Not to mention Appleâ€™s AltiVec. Anyway, your project seems nearly finished, so probably you will not have any problems.

Well the conditional run-time compilation make it very easy to optimize specifically for one processor. For the switch to SoA some more work is required but it all remains nicely abstract unlike the mess that hard-coded assembly brings.

Do you have any reason to recompile the shaders for each frame? Do you re-optimize the shader for every object or something??? Usually shaders are compiled at content creation time, so compilation speed is not an issue, and I think this can be the case with real-time rendering (correct me if Iâ€™m wrong).

Well the first example that pops into my head is to optimize minification and magnification filtering in the fixed function pipeline. For magnification a bilinear filter is sufficient, since there are no higher resolution mipmaps. For minification a trilinear filter is desired. The usual way to implement this is to check per pixel wether there is minification or magnifications.

The way hardware handles this is simple, a handful of transistors decide whether to use bilinear or trilinear filtering, at virtually no extra cost. In software it's less advantageous because the cost of the check (and possible mispredictions) could in total be higher than what can be won by bilinear filtering. So most software renderers just do the full trilinear.

My method is to generate versions of the fixed function pipeline specifically for bilinear and trilinear filtering. If a triangle has magnification at all vertices, it can be rendered with the bilinear version, else I use the trilinear version. For the fixed function pipeline the list of specializations that can be done this way is endless (I cache the last 16 combinations of render settings).

Another example is that with bilinear filtering if mipmap LOD is the same a every vertex, then no mipmap coifficient has to be interpolated and recomputed. For shaders the similar things are possible because the application can switch certain render states keeping the same shader. The conclusion is that it wouldn't have been possible to do this if I couldn't compile several shaders per frame...

Itâ€™s good to hear that you can easily port SoftWire in other platforms. I think portability is vital for any project. I donâ€™t want my code tied with a specific architecture or operating system. For example Appleâ€™s G5 (aka IBMâ€™s PPC970) is an amazing (and too expensive ) platform for software rendering. Using a G5, Pixar was rendering at Siggraph an untouched frame from Finding Nemo at only 4 minutes!!!.

Sweet! The most abstract I can go with SoftWire is to make it like a C-language with vector types. For example:

float4 a = ...;
float4 b = ...;
a += b;

If float4 is a SoftWire class which creates the corresponding code of the operations performed on it (using operator overloading), you get the ultimate in portability, readability Ã¡nd performance.

Any sponsors for a Mac version?

Entropy · Aug 8, 2003

Here's the pertinent discussion re:autovectorization
http://gcc.gnu.org/ml/gcc/2003-07/msg01309.html

Enjoy.

Entropy

Pavlos · Aug 8, 2003

Thanks for the info.

Pavlos · Aug 8, 2003

Nick said:
Well the first example that pops into my head is ...

Very clever optimizations!

Nick said:
Sweet! The most abstract I can go with SoftWire is to make it like a C-language with vector types....

That would be very interesting. Altivec is clearly superior to MMX/SSE/SSE2/SSE3. Each instruction has 4 operands. Two source registers, one destination register and one filter/modifier register for write masks and arbitrary permutations. And with 32 128bit registers, this architecture is made to run shading code

. Probably, you canâ€™t expose the filter/modifier register with an abstract C-like language because Intel processors need several instructions to emulate this functionality. But you can use it under the hood to accelerate the AltiVec implementation. And preferably you must also offer a way to choose between AoS and SoA.

The SRT Rendering Toolkit

Nick · Aug 9, 2003

Pavlos said:
That would be very interesting. Altivec is clearly superior to MMX/SSE/SSE2/SSE3. Each instruction has 4 operands. Two source registers, one destination register and one filter/modifier register for write masks and arbitrary permutations. And with 32 128bit registers, this architecture is made to run shading code .

That's impressive! So it's clearly very well suited for AoS.

Probably, you canâ€™t expose the filter/modifier register with an abstract C-like language because Intel processors need several instructions to emulate this functionality. But you can use it under the hood to accelerate the AltiVec implementation. And preferably you must also offer a way to choose between AoS and SoA.

I can't see why that would be a problem? The SSE implementation would need extra instructions when using swizzles or write masks, that's all.

Choosing between AoS and SoA is only possible by writing the two versions. But once you've done that it should work with AltiVec and SSE...

Pavlos · Aug 9, 2003

Nick said:
I can't see why that would be a problem? The SSE implementation would need extra instructions when using swizzles or write masks, that's all.

I just think something like that would conflict with the reason of assembly existence: to optimize writing directly to the metal.

I think a better idea is to expose directly the instruction set of the underlying hardware and let the â€œthird partyâ€ programmer do the abstraction for the necessary platforms. d3d already tries to do an assembly abstraction of a vector language and I think it fails. Hardware architectures are too different to be abstracted in a low level, without sacrificing the flexibility and the performance of every platform. ARB realized this and opted for an abstraction on a higher level with GLslang. And the RenderMan people realized this decades ago. Thatâ€™s why I believe letting the compiler do the automatic vectorization of a high level C representation of the shader is a good idea (but I have to test the performance also. I donâ€™t expect more than 10-15% penalty, but I understand this is unacceptable in your case).

To summarize: There is no such think as "abstract assembly". A cross-platform vector library would be nice and interesting, but donâ€™t try to mix low level characteristics like swizzles and write masks. Thatâ€™s my opinion.

The SRT Rendering Toolkit

Luminescent · Aug 9, 2003

Pavlos seems to be on the money in that regard.

Nick · Aug 9, 2003

Pavlos said:
I just think something like that would conflict with the reason of assembly existence: to optimize writing directly to the metal.

Not really. If you regard it as an intermediate vector code representation, you see you still have all options open. You can write a peephole optimizer and a scheduler that can easily beat hand-written assembly. Actually the only advantage it gives is that you can write formulas and let the C++ compiler figure out the evaluation order, but it's still extremely close to writing assembly. I'm certainly going to experiment with this!

I think a better idea is to expose directly the instruction set of the underlying hardware and let the â€œthird partyâ€ programmer do the abstraction for the necessary platforms. d3d already tries to do an assembly abstraction of a vector language and I think it fails. Hardware architectures are too different to be abstracted in a low level, without sacrificing the flexibility and the performance of every platform. ARB realized this and opted for an abstraction on a higher level with GLslang. And the RenderMan people realized this decades ago. Thatâ€™s why I believe letting the compiler do the automatic vectorization of a high level C representation of the shader is a good idea (but I have to test the performance also. I donâ€™t expect more than 10-15% penalty, but I understand this is unacceptable in your case).

I think the main reason for a C-like shading language is that it's a bit simpler. You don't have to worry any more about how to evaluate a formula, and you can work with symbolic names. This greatly improves readability and development time, and since you're using the same instruction set and optimization rules there is no advantage to writing low-level shaders.

Using an ANSI C compiler, thus without build-in vector types with swizzling and masking seems far less optimal. The compiler will have difficulties selecting the optimal instruction sequence and sometimes fail completely to vectorize code.

To summarize: There is no such think as "abstract assembly". A cross-platform vector library would be nice and interesting, but donâ€™t try to mix low level characteristics like swizzles and write masks. Thatâ€™s my opinion.

You're right, there is no abstract assembly. But most modern CPUs (and GPUs) have SIMD instructions with four elements, so with this prerequisite a C-like shader language with build-in vector types (possibly with swizzling and masking) could be a good abstraction and still perform well.

Pavlos · Aug 9, 2003

Nick said:
Not really. If you regard it as an intermediate vector code representation....

But why expose the intermediate level to the user, when you can do the abstraction on a higher level? This intermediate level has neither the advantages of assembly nor the advantages of a high level language. Itâ€™s only introducing restrictions. And already the model used by 3Dlabs p10 is quite different from the traditional SIMD approach. Sun has also some similar ideas, but who cares?

. But since in the CPU world the instruction sets are more â€œstaticâ€ probably this is not an issue.

Nick said:
Using an ANSI C compiler, thus without build-in vector types with swizzling and masking seems far less optimal. The compiler will have difficulties selecting the optimal instruction sequence and sometimes fail completely to vectorize code.

All the swizzling and masking instructions can be substituted by doing the corresponding computations with independent scalars and then merging the results back to a vector. This is, I think, much more elegant and readable. And since the scalars are independent a good compiler can probably vectorize them.
But I canâ€™t really speak about the performance of shading using automatic vectorization until I test it.
Does anyone have a good link that compares autovectorization with hand-optimized assembly?

Iâ€™m not saying thatâ€™s something wrong with your approach, just I find the concept of automatic vectorization more elegant. And I hope the performance penalty is small. If someone needs ultimate performance or dynamic compilation, writing directly in assembly with SoftWire is probably more useful for him/her. Iâ€™m not sure if thereâ€™s any demand for an intermediate level vector abstraction.

The SRT Rendering Toolkit

MfA · Aug 9, 2003

ANSI C isnt too usefull, you really need C99 at the very least (since it includes restrict). Really though for good vectorization you need to use all the non-standard hints and pragmas vectorizing compilers provide.

Vectorizing compilers aren't all that smart.

Nick · Aug 9, 2003

Pavlos said:
But why expose the intermediate level to the user, when you can do the abstraction on a higher level? This intermediate level has neither the advantages of assembly nor the advantages of a high level language. Itâ€™s only introducing restrictions. And already the model used by 3Dlabs p10 is quite different from the traditional SIMD approach. Sun has also some similar ideas, but who cares? . But since in the CPU world the instruction sets are more â€œstaticâ€ probably this is not an issue.

Intermediate code has the advantages of assembly if it's closely related. I think that's the case for both SSE and AltiVec, and I'm sure there are many other similar architectures. But you're right, it won't be advantageous on completely different architectures. So it's more like a compromise between high- and low-level programming, adding an abstraction layer. You also still get the advantage of working with symbolic variables instead of registers, and you don't have to worry about optimizations.

All the swizzling and masking instructions can be substituted by doing the corresponding computations with independent scalars and then merging the results back to a vector. This is, I think, much more elegant and readable. And since the scalars are independent a good compiler can probably vectorize them.

Well it -can- be optimized, but there's no guarantee it will be. What I mean is, theoretically there should be no reason why compilers can't produce code as optimal as hand-tuned assembly, but practice shows different. Developing such compiler takes years, while translating ps 2.0 instructions to SSE was done in -one- week. So you're lucky if someone else has done it for you, but there's usually a lot more effort involved in writing a vectorizing compiler than an assembler for vector code.

Of course for your situation it is probably very close to ideal since you have different objectives...

But I canâ€™t really speak about the performance of shading using automatic vectorization until I test it. Does anyone have a good link that compares autovectorization with hand-optimized assembly?

Beware if you find such comparisons that they are often heavily biased. Of course developers of vectorizing compilers will tell you that it equals or beats manually written assembly every time, but ask DivX compression fanatics and you'll hear a different story. And the hand-optimized SIMD code at the Intel site is beaten by either of the two.

Iâ€™m not saying thatâ€™s something wrong with your approach, just I find the concept of automatic vectorization more elegant. And I hope the performance penalty is small. If someone needs ultimate performance or dynamic compilation, writing directly in assembly with SoftWire is probably more useful for him/her. Iâ€™m not sure if thereâ€™s any demand for an intermediate level vector abstraction.

I absolutely agree that automatic vectorization is more elegant, and I'm confident it's the way to go for your project. But for my fixed-function pipeline I'm sure that it's nearly unbeatable. Intermediate abstraction is only really useful when wanting guaranteed performance on multiple architectures.

Nick · Aug 9, 2003

MfA said:
Vectorizing compilers aren't all that smart.

Really nice paper, thanks for the link! Unlike most other papers that discuss vectorization on exotic architectures it shows practical results for real-life applications! I think this sais it all:

The mismatch between the C language and the underlying MME architecture also widens the gap between traditional and MME vectorization.

So if we want code that vectorizes well -today-, we have to introduce new data types like float4 and things like swizzle/mask, or some sort of portable intermediate representation...

Pavlos · Aug 9, 2003

MfA said:
Vectorizing compilers aren't all that smart.

But, they donâ€™t have to be smart to optimize the code produced from converting shaders. The paper you have linked doesnâ€™t use any â€œshadingâ€ benchmark or something with a similar workload. I agree that vectorizing a general multimedia application is very tough. I donâ€™t expect to recompile my renderer and get a 4x speedup

.

I have found a paper from Intel that reports some very interesting results. When computing integer and floating point kernels, mainly a dot product on a big array, the speedup from the sequential code is much more than the ideal 4!!!!???? (approaching 10 in some cases). I can easily translate shaders in loops like the one used in the tests, but I want to see some comparisons with hand-optimized assembly. The linked paper provides one comparison with hand-optimized assembly and the performance is almost identical, but this is a paper from Intel

. Probably I have to do my own tests.

Nick,
I agree with your previous post. After all, I'll try it and if it fails I'll use SoftWire.

The SRT Rendering Toolkit

Dio · Aug 9, 2003

IlleglWpns said:
Anyway, can you estimate how large the P4's integer or FP scheduler is? I can't find this sort of information anywhere.

The best data I had on this was when I was experimenting with hand-scheduling code to improve what was getting into the reorder buffer. I found that unless you could move an instruction by 10 instructions you never really saw a benefit, and it wasn't usually much until it reached 50 instructions.

So in general I don't interleave dependency chains on my P4 code as much as I used to do on P3 code. It makes it much easier to read and doesn't affect performance.

Dio · Aug 10, 2003

Dave H said:
And finally--why should this effect be fixed by moving to better surface descriptions, e.g. subdivision surfaces? It's not a matter of the underlying model not having the proper detail, but rather of undersampling.

It's a matter of semantics. As you get more semantic information you can make more intelligent choices. Semantic information is also usually 'higher level' - e.g 'this is a man' rather than a load of polygons, so each individual bit of data specifying the pixel is specifying 'more pixels'.

But I wouldn't say this problem is either a. large or b. solved, and also that I don't know that much about it, just that '<1 pixel polygons are bad'.

Simon F · Aug 18, 2003

Hellbinder said:
So Uttar..

I can clearly *not choose* the Information in front of you...

(Princess Brideism)

Unless you have built up a resistance to iocaine powder

Arun · Aug 18, 2003

Simon F said:
Hellbinder said:

So Uttar..

I can clearly *not choose* the Information in front of you...

(Princess Brideism)

Click to expand...

Unless you have built up a resistance to iocaine powder

Excuse me for being stupid, but I understood exactly 0% of those two posts.
Someone care to explain me?

Uttar

When to expect ps 3.0 hardware?

MfA

Entropy

MfA

Nick

Entropy

Pavlos

Pavlos

Nick

Pavlos

Luminescent

Nick

Pavlos

MfA

Nick

Nick

Pavlos

Dio

Dio

Simon F

Tea maker

Arun

Unknown.

Similar threads