Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 11-Aug-2004, 08:58   #1
Goragoth
Member
 
Join Date: Feb 2003
Location: NZ
Posts: 363
Default Fast matrix/vector math with SIMD?

I'm currently writing a simple software renderer for a university project (rendering a bezier patch basically). Now, I've noticed that I'm doing a lot of matrix/vector multiplication (in the bezier patch tesselator and for all the transform/projection calcualtions for all the vertices). I might download the AMD analyzer program to see if that really is a big bottleneck but I suspect so (especially since I'm not using indexed triangle lists at the moment, so lots of duplicate vertices - I'll fix that soon though).

So I'm wondering if it is wise to use SSE for the matrix/vector multiplication? Does anybody have some code for that they might like to share, because unfortunately I know close to no assembly and looking at the Intel SSE documentation wasn't all that helpful. If its too complex I'll skip it and just try to optimize what I can but since I've heard that using SSE can double the speed of these sorts of parallel calculations I would think it worth a shot.
Goragoth is offline   Reply With Quote
Old 11-Aug-2004, 09:44   #2
Simon F
Tea maker
 
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,396
Default

Out of curiousity, what method are you using to do the tessellation? If it's recursive subdivision then, IIRC, the subdivision matrix contains a lot of very simple constants so using a matrix multiply might be a bit of a brute force approach. Why not take a look at the code in Watt&Watt's "Advanced Animation and Rendering Techniques"?

There are other approaches, of course, but the recursive subdiv' algorithm is a simple one to start with and would be challenging enough for a Uni' project.
__________________
"Your work is both good and original. Unfortunately the part that is good is not original and the part that is original is not good." -(attributed to) Samuel Johnson

"I invented the term Object-Oriented, and I can tell you I did not have C++ in mind." Alan Kay
Simon F is offline   Reply With Quote
Old 11-Aug-2004, 10:01   #3
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Like Simon F sais, you should first look for the most optimal algorithm. I can imagine that since you're not using indexed vertices, some are re-used up to eight times. So even though assembly can help, just using an efficient algorithm can yield much more impressive results.

Anyway, I'd be glad to help you out with the SIMD code, I just first want to make sure you've got all other optimizations implemented first. It happens a lot that assembly optimizations become useless when the algorithm is improved. So try to work from top to bottom as much as possible.

Good luck!
Nick is offline   Reply With Quote
Old 11-Aug-2004, 10:06   #4
Goragoth
Member
 
Join Date: Feb 2003
Location: NZ
Posts: 363
Default

Right now its very brute force with matrix multiplication because that's what we got taught and it seemed easy. The book we are using is Foley/Van Dam and that explains it the same way. I was looking at an old gamedev mag article on central differencing though, that approach looks interesting and I might experiment with it.

No matter what though each vertex needs to be projected and that's a lot of multiplies (Matrix * Vector). I figured it wouldn't be all that hard to load 4 floats in one SSE register, another set in another register and then do a combined multiply, from what I've read on SIMD that's exactly the kind of thing it is for or am I completely wrong? It's not that important, there are probably lots of other ways of optimizing my code that will make more of a difference, I just thought it would be easy, sort of three or four lines of inline assembly.
Goragoth is offline   Reply With Quote
Old 11-Aug-2004, 11:04   #5
Scali
Naughty Boy!
 
Join Date: Nov 2003
Posts: 2,127
Send a message via ICQ to Scali Send a message via MSN to Scali
Default

You could use the D3DX library from the DirectX SDK. It contains matrix/vector/quaternion functions which are optimized for x87, 3dnow!, SSE and SSE2. The best path for the CPU is automatically selected at runtime.
Scali is offline   Reply With Quote
Old 11-Aug-2004, 12:15   #6
Simon F
Tea maker
 
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,396
Default

Quote:
Originally Posted by Goragoth
Right now its very brute force with matrix multiplication because that's what we got taught and it seemed easy. The book we are using is Foley/Van Dam and that explains it the same way.
So you're using...:
Code:
[ 8 0 0 0]
[ 4 4 0 0] * 1/8
[ 2 4 2 0]
[ 1 3 3 1]


You can reformulate that to save a LOT of calculations. I really suggest you take a look at Watt & Watt.

Quote:
I was looking at an old gamedev mag article on central differencing though, that approach looks interesting and I might experiment with it.
I guess that's Clark's method. That's also described in Watt&Watt. It makes trying to decide when to terminate due to flatness cheaper as well.

Quote:
No matter what though each vertex needs to be projected and that's a lot of multiplies (Matrix * Vector).
Ahhhh. The projection matrix +division is very simple so doesn't really cost much.

The local->world matrix, OTOH is more expensive but subdivision is a linear operation... so you can apply your transform/projection matrix to the original control points before subdividing. Does that make you happier?
__________________
"Your work is both good and original. Unfortunately the part that is good is not original and the part that is original is not good." -(attributed to) Samuel Johnson

"I invented the term Object-Oriented, and I can tell you I did not have C++ in mind." Alan Kay
Simon F is offline   Reply With Quote
Old 11-Aug-2004, 13:22   #7
Dio
Senior Member
 
Join Date: Jul 2002
Location: UK
Posts: 1,758
Default

Man overboard?
Dio is offline   Reply With Quote
Old 11-Aug-2004, 13:38   #8
Goragoth
Member
 
Join Date: Feb 2003
Location: NZ
Posts: 363
Default

Quote:
So you're using...:
Code:

[ 8 0 0 0]
[ 4 4 0 0] * 1/8
[ 2 4 2 0]
[ 1 3 3 1]

Question
Ok, to be exact I'm doing x(s,t) = S * Mb *Gbx * Mbt * Tt
where Mb =
Code:
[ -1  3 -3  1 ]
[  3 -6  3  0 ]
[ -3  3  0  0 ]
[  1  0  0  0 ]
Mbt is its transpose (itself) and S and T are [s^3, s^2, s, 1 ] and [t^3, t^2, t, 1 ]

Gbx is the geometry matrix of all the x values of the control points. The middle part (Mb * Gbx * Mbt) is the same for all points during each tesselation so I compute that before a huge loop through s and t. It is very slow but Foley/Van Dam doesn't cover anything else and I don't have access to Watt & Watt although I'll have a look if there is a copy at the uni library and if they don't have it I might order it because it sounds really worthwhile.


Quote:
Ahhhh. Very Happy The projection matrix +division is very simple so doesn't really cost much.

The local->world matrix, OTOH is more expensive but subdivision is a linear operation... so you can apply your transform/projection matrix to the original control points before subdividing. Does that make you happier? Smile
I see what you are saying but I have several matrices for the final projection that get concatonated together:
view transform - translates the scene down z and rotates around x and y.
projection - projects the points
viewport transform - moves all the points so that world coordinates of 0,0 match to the center of the view window rather than 0,0 on the view window and also scales the scene.

Now when I go to redraw the scene I apply those to each of the points without re-tesselating the bezier surface. I also don't really want to couple tesselation and projection because of how I've structured my code.
At maximum tesselation (four divisions of the control mesh) I worked out there are 13824 verticies (indexing would make a huge difference, I know and will implement that soon!) which makes that many vector*matrix operations and each of those is 16 multiplies and 12 adds if my maths is right so that's a lot of calculations going on there, plus each point needs to be homogonized. And I'm not doing any lighting calculations yet. Although I was thinking of computing lighting right after tesselating so that when the user spins the view around the patch it doesn't need to be calculated every frame.

So there's it really. Any suggestions on where I should optimize are welcome.

Looking at the swShader source has actually helped a lot, especially when I was creating my vector and matrix classes, I just wanted to say thank you Nick.
Goragoth is offline   Reply With Quote
Old 11-Aug-2004, 14:35   #9
Simon F
Tea maker
 
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,396
Default

Quote:
Originally Posted by Goragoth
Ok, to be exact I'm doing x(s,t) = S * Mb *Gbx * Mbt * Tt
where Mb =
Code:
[ -1  3 -3  1 ]
[  3 -6  3  0 ]
[ -3  3  0  0 ]
[  1  0  0  0 ]
Mbt is its transpose (itself) and S and T are [s^3, s^2, s, 1 ] and [t^3, t^2, t, 1 ]
Oh no! That's merely the common way to succinctly specify the equations - you should not use that to evaluate the curve. It's inefficient and mathematically unstable.
Quote:
I don't have access to Watt & Watt although I'll have a look if there is a copy at the uni library
Go and look. It is worth your time.


Quote:
I see what you are saying but I have several matrices for the final projection that get concatonated together:
As I would expect. Just concatenate all of them except for the final projection to cannonical clip/Projection coordinates.

Quote:
Now when I go to redraw the scene I apply those to each of the points without re-tesselating the bezier surface.
OK, if you are only doing the tessellation once but changing the viewpoint frequently, then it's not worth doing.
__________________
"Your work is both good and original. Unfortunately the part that is good is not original and the part that is original is not good." -(attributed to) Samuel Johnson

"I invented the term Object-Oriented, and I can tell you I did not have C++ in mind." Alan Kay
Simon F is offline   Reply With Quote
Old 12-Aug-2004, 03:00   #10
Goragoth
Member
 
Join Date: Feb 2003
Location: NZ
Posts: 363
Default

Quote:
Oh no! That's merely the common way to succinctly specify the equations - you should not use that to evaluate the curve. It's inefficient and mathematically unstable.
Well, I knew it was bad but I'm definatively going to look for another way to do tesselation now
I got out the Watt & Watt book from the library and had a look over that. It looks mildly complicated but I'll try to implement the central differencing as soon as I have indexed lists, simple lighting and zbuffering working.

If anyone is interested the app so far can be had here. Its about 320kb zipped up (the size is mostly from using wxWindows for the graphics). Be sure to read the readme for the commands.
Goragoth is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Vertex Shaders: MIMD versus SIMD Jawed 3D Architectures & Chips 9 24-Jun-2005 21:50
Fast writes, they are actually useful!?! Colourless 3D Architectures & Chips 20 08-May-2003 04:18
Nvidia CEO: working on something twice as fast? megadrive0088 3D Architectures & Chips 35 18-Feb-2003 12:36
specialized HW is fast, but new CPUs are fast as well... Panajev2001a 3D Architectures & Chips 64 24-Dec-2002 23:38
VIA Launches LAN-on-Motherboard Fast Ethernet Controller marco Press Releases 0 19-Apr-2002 07:12


All times are GMT +1. The time now is 01:07.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.