Brainfart: 0x86 CPU; line rasterization algorithm performance

K.I.L.E.R · Jan 30, 2007

Sorry but after the last discussion on CPU architecture I kind of decided as of late to see how 3D graphics algorithms perform on a CPU(not GPU).

Funny thing I found was that DDA was faster than Bresenham.
I'm thinking I may have stuffed up somewhere:

Code:

Avg time normal: 53764.058450000004
Avg time modded: 52061.156820000004

"modded" = DDA
"normal" = Bresenham

I've written and rewritten the test several times, each differently and came up with this final version.
Under all cases I saw almost alike results or the DDA being a bit faster.
The DDA was initially slower when I used Math.round

under the "dda" routine, however when I replaced it with "(int)" it became faster.

Is there a more efficient Bresenham algorithm (I stole this one off wikipedia than work it out)?
I believe the reason it's faster is due to 2 factors, as were also discussed in my other thread:

Casting being cheaper than conditional check.
Floating point calcs being faster than fixed point on modern CPUs.

Are my results normal or have I done something dumb?
I have attached my test so people can point at me and laugh.

Simon F · Jan 30, 2007

Kiler,
Using fractional maths (with truncation/rounding) should be much faster than a branch-based line drawing algorithm.

What would probably help is to compute several pixels (e.g. 2 or 4) at the same time so that you break up chains of dependency. Make each of these a direct calculation from the starting pixel rather than adding to the immediate predecessor.

K.I.L.E.R · Jan 30, 2007

Computing 2 pixels ahead of time is a bit faster with DDA vs 1 at a time.
I'm trying to find a better balance.
Thanks.

Graham · Jan 31, 2007

Um.
just a thought,
but I wouldn't be surprised if image.getRaster().setPixel is your bottleneck, might be making any other significant differences seem less so

K.I.L.E.R · Jan 31, 2007

If you look in the code you will see that specific line of code is inside all the methods I compared.
I'm only interested in which algorithm is faster under today's CPU hardware.

Graham said:
Um.
just a thought,
but I wouldn't be surprised if image.getRaster().setPixel is your bottleneck, might be making any other significant differences seem less so

Simon F · Jan 31, 2007

K.I.L.E.R said:
If you look in the code you will see that specific line of code is inside all the methods I compared.
I'm only interested in which algorithm is faster under today's CPU hardware.

But it might dominate the time so much that the time for the other calculations is merely noise. Why not set the pixel in an internal array to test the speed? (And do a blit of the entire array later if you want to check that the algorithm is working?)

ERP · Feb 1, 2007

The cost of SetPoint will dominate from the call overhead alone. As Simon says rasterise onto an array or alternatively store the list of points in the array and remove the SetPoint calls cost from the test.

I'd expect a fixed point DDA version to be significantly faster than a Bresenham algoythm.
Floating point will only be faster IF you don't ever convert to integer, which is unavoidable in raterisation. If you can pack the computation into SSE registers and do them 4 wide it might be a win.
If you must use FP make sure you use SSE to round to int, and not the slow as ass C style rounding.

Bresenham was faster on a 6502 where it was difficult to do 16bit plus arithmetic, but it's real value is it's invariance in direction of rasterisation. Although there are ways around this with DDA algorythms aswell.

Npl · Feb 1, 2007

judging CPU performance by using something like Java aint a good idea. You never know what obscure things are happening when running. Thats a bit like benchmarking the C64 by running an emulator.

@ERP: what do you mean with "slow as ass C style rounding"? a cast or calling floor() ?
I dont think theres a reason a cast should be slower than SSE (apart from x86 just being plain weird of course :smile: )

ERP · Feb 1, 2007

Npl said:
judging CPU performance by using something like Java aint a good idea. You never know what obscure things are happening when running. Thats a bit like benchmarking the C64 by running an emulator.

@ERP: what do you mean with "slow as ass C style rounding"? a cast or calling floor() ?
I dont think theres a reason a cast should be slower than SSE (apart from x86 just being plain weird of course :smile: )

The standard libraries do a function call and set the fpu rounding mode by doing fldcw and then fistp which is hideously slow.

Those two instructions alone would dominate a well written DDA algorythm. You really want to use cvtss2si.

With line drawing algorythms and scan converters your really in the realm of micro optimisations. I'm not an X86 Guru by any means, but if I were writing it, I'd spend as much time looking at the assembler output from the compiler, as writing it. And I'd probably curse the compiler then write in in assembler.

MMX might even be the fastest way to go.

Simon F · Feb 1, 2007

Npl said:
@ERP: what do you mean with "slow as ass C style rounding"?

C requires the float to int conversion to use truncate but, IIRC, the default of x86-style processors is to use round to nearest. The compiler, therefore, has to insert extra code to meet the C-standard.

AHH - I see ERP had already replied.

K.I.L.E.R · Feb 1, 2007

Are you telling me that float to int conversions are implemented via a single op on CPUs via a register?

ERP · Feb 1, 2007

in C or C++ on x86 float to int will insert "call _ftoi", that function will consist of fldcw and fistp instructions. Try running vtune on a few games, I've seen ftoi clock upwards of 5 % of the total cpu time (a lot of devs write there own version to try and mitigate that), and games have far more work between conversions.

Even ignoring the function call overhead, those two instructions are extremly expensive. Given the minimal amount of work a DDA loop does, the conversion will dominate the time.

It's actually the case that float to int conversions are extremely expensive on many modern architectures, though not always for the same reason.

If you really want to compare line drawing algorythms your going to have to get your hands dirty and at least look at the generatad code.

K.I.L.E.R · Feb 2, 2007

Thanks all.
Very interesting stuff here.

I've read that there is an integer version of the DDA algorithm but I cannot find anything about it, just mentioned in a few papers.

Simon F · Feb 2, 2007

K.I.L.E.R said:
Thanks all.
Very interesting stuff here.

I've read that there is an integer version of the DDA algorithm but I cannot find anything about it, just mentioned in a few papers.

I think you'll find that the original intent was that DDA be done in "integer"

To solve your problem, just use fixed point.

K.I.L.E.R · Feb 2, 2007

That's it? Nothing special?

Simon F said:
I think you'll find that the original intent was that DDA be done in "integer"

To solve your problem, just use fixed point.

Simon F · Feb 2, 2007

K.I.L.E.R said:
That's it? Nothing special?

Well you could stand on one leg and whistle "Click go the shears" if you like but....

BTW, are you just drawing single pixel lines?

K.I.L.E.R · Feb 2, 2007

No. Just wanting to see the performance between the 2 and I ended up getting free tips on performance.

Here is my revised algorithm.

* 1/1e3 due to numerical issues. (Overflow)

Math proof:
y = dy/dx * x + c; c = 0
*dx => dx.y = dy.x
solve for dy and dx:
dy = dx.dy / x
dx = dy.dx/y

m = dy/dx

Sub:
m = y^2.dx / x^2.dy

Numerically to avoid floating point computations, break apart the fraction.

Code:

[B]int yi = y1;
        int dx = x2 - x1;
        int dy = y2 - y1;
        int ySqr = dy * dy;
        int xSqr = dx * dx;
        int topFract = (ySqr*dx) / 1000;
        int fraction = (xSqr*dy) / 1000;
        
        yi /= 1000;
        
        for (int x = x1; x < x2; x++) {
            image.getRaster().setPixel(x, yi/fraction, data);
            yi += topFract;
        }[/B]

It's slower than the last 2 algorithms. :/
I thought itneger division was faster than that.

d-f-b · Feb 2, 2007

i vaguely remember coding alternate versions of the dda,
ones for when it's steep, ones for when it's shallow , etc. then pick which version to use

and , i'm tempted to say that the 6502 comment isnt true

. though only if u use fairly low accuracy fixed point

.

d-f-b · Feb 2, 2007

K.I.L.E.R said:
No. Just wanting to see the performance between the 2 and I ended up getting free tips on performance. Here is my revised algorithm.

Code:

[B]int yi = y1; int dx = x2 - x1; int dy = y2 - y1; int ySqr = dy * dy; int xSqr = dx * dx; int topFract = (ySqr*dx) / 1000; int fraction = (xSqr*dy) / 1000; yi /= 1000; for (int x = x1; x < x2; x++) { image.getRaster().setPixel(x, yi/fraction, data); yi += topFract; }[/B]

use 1024 , so you can shift and mask
but still the image.getraster will be taking far more time than the other bits..

d-f-b · Feb 2, 2007

i'd also say,,
is that how the dda is coded?? i dont remember having a "/" in the loop!?

Brainfart: 0x86 CPU; line rasterization algorithm performance

K.I.L.E.R

Retarded moron

Attachments

Simon F

Tea maker

K.I.L.E.R

Retarded moron

Graham

Hello :-)

K.I.L.E.R

Retarded moron

Simon F

Tea maker

ERP

Npl

ERP

Simon F

Tea maker

K.I.L.E.R

Retarded moron

ERP

K.I.L.E.R

Retarded moron

Simon F

Tea maker

K.I.L.E.R

Retarded moron

Simon F

Tea maker

K.I.L.E.R

Retarded moron

d-f-b

d-f-b

d-f-b

Similar threads