Advanced Rasterization

Ah, see, you start getting it.

It's also up to me, if you get it ;)

I could mention your name to some managers, if you tell me your true name.

Okay

Yes, but internet, educational applications and CAD are not asking for a new raytracer. Well, I'm sure there's a market for it of course, but I can't possibly compete against professional tools, that already have run-time SIMD compilation technology. If I did it wouldn't be a hobby project any more. But I'll keep it in mind for the future...

Do they, though?
I have not heard of any offline renderer that has runtime compilation.
And if you ask me, internet, educational apps and CAD ask for hardware acceleration :)
If the internet didn't require hardware acceleration, I'd probably be a rich man now with my Java engine :)

Sure, but it's not because these cheap graphics cards are available that people immediately 'upgrade'. Or would you replace a Geforce 4 Ti with a Geforce FX 5200?

If they want SM2.0 they probably will. Besides, if they can afford a GF4Ti, they can also afford a decent SM2.0 card.

Furthermore, I'll be SM 4.0 compatible long before the hardware is affordable. Not only amateurs love that...

You said the same thing about SM3.0, but so far I don't think you've even implemented SM2.0 and lower completely.

What are you trying to say? This is my fourth generation renderer.

I recall you saying that it was your first renderer. Perhaps you've rewritten it a few times, but everyone does that ;)

Either way it sells. They didn't include it in Unreal Tournament for nothing. And I would be more than happy to sell it for a fraction of their price.

Dunno, I tried the patch for Medal of Honor, and it was completely unplayable on a P4 2.4 GHz. Funny, that game runs on an age-old GF2 in 1024x768 with little trouble. I'm sure it runs on pretty much any onboard GPU too. So yes, as far as I'm concerned, they did include it for nothing. I don't think you can find a system with a CPU that is fast enough, coupled with a GPU that is slow enough, to make Pixomatic a better alternative than the GPU.
And I don't see how your renderer will improve that, because you simply have this barrier of limitations in memory bandwidth and processing power. After all, that is why we have hardware acceleration.

Refrast and your Java renderer, although respectable, both don't use the CPU optimally. They are no option for a fallback. People who design their software to run on all x86 systems won't choose refrast.

Your renderer doesn't use the CPU optimally either, since you try to shoehorn it into a hardware API, that was the point. And until you can render a Q3 level at a respectable framerate I don't think your renderer is an option for a fallback either :)

You would have given up long ago, wouldn't you?

Well yes. Unless someone pays me to optimize a software renderer, I see no reason to continue working on one. I am more interested in shading and animation now, and the things I am doing, wouldn't stand a chance in software.

The magic algorithm might not exist, but I'm getting close, and results will be satisfying, to me...

Make a demo again then, and this time, add some skinned characters to a Q3 level. And do per-pixel lighting instead of that texture*lightmap thing.
Then let's see what framerates you get. After all, if you implement D3D9 shaders, you might aswell use them.
 
Nick,

This reminds me of the rudimentary scanline work I did. First did a software renderer in Pascal way back, then sped it up with assembler. When I started doing some terrain stuff, I decided to do scanline conversion to tell me which tiles of the terrain were in the view frustum. That latter part may be applicable to you, since I had to check if any part of the tile was in a triangle, not just the centre as in ordinary rasterization.

Anyway, can't you just do scanline conversion the ordinary way, and just pair up scanlines? If you had the endpoints of two consecutive scanlines, couldn't you figure out coverage quite easily using those numbers? Then once you have the 2x2 blocks, you can make 4x4 blocks if you want in the same way. Or you can try to handle 4 scanlines at once, but I doubt it'd be faster.

Maybe I'm misinterpreting your goal here...
 
Scali said:
If the internet didn't require hardware acceleration, I'd probably be a rich man now with my Java engine :)
No, because you'd have to compete with other Java 3D engines that have similar performance and features closer to that of hardware rendering. But if they get the option for even more performance and the well-known DX9 interface, I'm sure there will be some interest. Once again: there's nothing like my software on the market. Besides, you're talking about being a "rich man". As I told you before, that's not my goal, though it already helps paying for my education.
You said the same thing about SM3.0, but so far I don't think you've even implemented SM2.0 and lower completely.
SM 2.0 is complete. SM 3.0 requires to render quads for the gradient instructions, but appart from that I already have most SM 3.0 features. And for SM 4.0 I will have a lot of time since Longhorn won't be out for several years.
I recall you saying that it was your first renderer. Perhaps you've rewritten it a few times, but everyone does that ;)
I never really restarted from absolute zero, but I don't think anyone ever does that. My first renderer was for DOS. It rendered a gouraud shaded cube in Mode 13h. Then I 'discovered' DirectDraw and 32-bit mode and I bought a real C++ compiler. None of the old code really survived that transistion, so that's when I started my second renderer. It evolved into Real Virtuality, where being able to render Quake 3 scenes was my primary objective. That's also when I started developing and using SoftWire. But when I started optimizing for Quake 3, it made it hard to use it for anything else. SoftWire also evolved from assembling text to run-time intrinsics, which opened a lot new possibilities. That's when I started swShader, my third renderer, where I wanted to strictly separate the application from the renderer, which evolved into using a DirectX 9-like interface and eventually the DLL. I also used new state management to really make use of the capabilities of run-time intrinsics. The fourth generation renderer is highly based on swShader, but that will change. I'm focussing on optimizing data structures and streaming processing. SoftWire is evolving into a higher level optimizing compiler, which makes it easier to write SIMD code. State and context management will be changed completely to allow deferred rendering techniques and finally overdraw reduction methods will be implemented. So I really feel like I'm working on a new, fourth project. I hope that cleared it up...
So yes, as far as I'm concerned, they did include it for nothing. I don't think you can find a system with a CPU that is fast enough, coupled with a GPU that is slow enough, to make Pixomatic a better alternative than the GPU.
Laptops and low-end systems still often have graphics chips that either are not supported by the application, or don't do any reasonable 3D operations. Pixomatic was recieved with enthousiasm. It's pretty close to playing Unreal Tournament 1 in software mode, which was really good.
And I don't see how your renderer will improve that, because you simply have this barrier of limitations in memory bandwidth and processing power. After all, that is why we have hardware acceleration.
I mainly 'only' improve the feature set and the interface. Performance versus quality should scale all the way from Quake 1 performance to SM 4.0 quality. Memory bandwidth is quite high even for Durons and Celerons, and processing power has evolved far beyond what was possible when hardware acceleration took over. And yes, we have hardware acceleration to push those barriers further, but sometimes you just don't have that hardware or the features. Either way, you're again only talking about games.
Your renderer doesn't use the CPU optimally either, since you try to shoehorn it into a hardware API, that was the point.
So what exactly is it that makes a hardware API terrible for a software renderer? Or let me put that differently, what feature that is so important to software rendering can I not expose through that API?
And until you can render a Q3 level at a respectable framerate I don't think your renderer is an option for a fallback either :)
Real Virtuality had only a regular 32-bit z-buffer. Furthermore only the pixel pipeline was completely assembly optimized. So your "until" is now.
Well yes. Unless someone pays me to optimize a software renderer, I see no reason to continue working on one. I am more interested in shading and animation now, and the things I am doing, wouldn't stand a chance in software.
I am getting money for this, and that's not the only "reason to continue working" on it. Congratulations on your shading and animation, I hope you're making a fortune out of it.
Make a demo again then, and this time, add some skinned characters to a Q3 level. And do per-pixel lighting instead of that texture*lightmap thing.
The cool thing about using the DirectX 9 interface is that I don't have to "make" demos. I'm running the SDK samples right now, which includes skinned characters and per-pixel lighting. Without this interface I had to do this all myself. Furthermore, this gives me a lot of testing material to make sure everything works. These advantages far outweigh any disadvantages I might get from using this interface.
Then let's see what framerates you get. After all, if you implement D3D9 shaders, you might aswell use them.
It's funny that you so much look forward to see my demos, while at the same time advising me never to do any software rendering again. :rolleyes:

The discussion ends here Scali. You absolutely can't stop me from doing software rendering, and I fail to see what you would gain by that. There are many more reasons, even ones I can't tell you, that make my hobby really worthwhile. So if you have anything left to say, let it be on topic, otherwise keep it for yourself or write me an e-mail that I can ignore so you don't fill the forum. Thanks.
 
Hi Mintmaster,
This reminds me of the rudimentary scanline work I did. First did a software renderer in Pascal way back, then sped it up with assembler. When I started doing some terrain stuff, I decided to do scanline conversion to tell me which tiles of the terrain were in the view frustum. That latter part may be applicable to you, since I had to check if any part of the tile was in a triangle, not just the centre as in ordinary rasterization.
That sounds interesting! How did that algorithm work for checking triangle intersection with the tiles?
Anyway, can't you just do scanline conversion the ordinary way, and just pair up scanlines? If you had the endpoints of two consecutive scanlines, couldn't you figure out coverage quite easily using those numbers? Then once you have the 2x2 blocks, you can make 4x4 blocks if you want in the same way. Or you can try to handle 4 scanlines at once, but I doubt it'd be faster.
The 'problem' with that approach is that I would always loose fillrate. It starts with the "oridinary" algorithm and then adds operations to process 2x2 blocks. That's sort of a detour. So my goal is to have a new rasterization method that is more optimal for rasterizing 2x2 blocks. The half-space functions algorithm seemed a great starting point since SSE makes it easy to compute write masks for partially covered 2x2 blocks. Anyway, I don't want to limit myself to this algorithm, but with a few tweaks it starts to turn out pretty good...

Thanks!
 
No, because you'd have to compete with other Java 3D engines that have similar performance and features closer to that of hardware rendering.

If you think my engine didn't have features that were close to hardware rendering, there is nothing to discuss.

But if they get the option for even more performance and the well-known DX9 interface, I'm sure there will be some interest.

If you think the API is important, think again.
And if you think the DX9 interface is the ideal interface for a renderer, think again.

SM 2.0 is complete.

So you also have implemented the 1.x shaders aswell, and you have cubemaps and 3d textures and the like? As in: can I actually run my code in it now?

So I really feel like I'm working on a new, fourth project. I hope that cleared it up...

Sounds to me like you made your first test renderer in DOS, then started on the project you are doing now... Always been a z-buffered sw triangle rasterizer anyway, even though some parts may have been redesigned.

So what exactly is it that makes a hardware API terrible for a software renderer? Or let me put that differently, what feature that is so important to software rendering can I not expose through that API?

It's not about features, it's about design.
The DX9 design only allows you to send triangles to the hardware, and only allows you to do processing per-vertex or per-pixel.
It also doesn't force any kind of order on the triangles, because it assumes that the hardware handles HSR.

So you have very strict limits on your input and where and when you should do processing.
An extreme example: Imagine a raytracer using the D3D9 interface.

Real Virtuality had only a regular 32-bit z-buffer. Furthermore only the pixel pipeline was completely assembly optimized. So your "until" is now.

I got between 5 and 20 fps on my 1800+. Imagine adding characters and game-logic... It would not be very playable, would it? While a PII 233 with a TNT2 or something would probably get over 50 fps most of the time?
So how exactly would it be a fallback? I'd need to upgrade my CPU to get the framerates I got with a PC like the one I upgraded from years ago.
Oh and if you think that in a Q3 level you need to optimize vertex-processing and triangle-setup in asm, aswell as the pixel-pipeline, erm... :)
How many polys are there onscreen on average? About 100?

It's funny that you so much look forward to see my demos, while at the same time advising me never to do any software rendering again.

That is not what I am advising. And yes, I do want to see how slowly your renderer will render some real D3D code on my PC.
The ATi car was already about 10 times slower with swShader than on my ancient GF2GTS with a 4-pass rendering scheme, and about 6 times slower than on an Intel Extreme with a 3-pass rendering scheme, and that was only the bare essential shading stuff of today's software.
Sadly for you, I didn't wait on swShader any longer, and wasted 100e on SM2.0 hardware, so now I do it in one pass, and about 50 times faster :)

The discussion ends here Scali. You absolutely can't stop me from doing software rendering, and I fail to see what you would gain by that.

You don't get it at all. I do not want you to stop doing software rendering, I want you to apply your stuff in areas where it would have more of an effect.
 
Scali said:
If you think the API is important, think again.
Great, then since it isn't important anyway, I choose DirectX 9. I'll just assume that it's coincidental that people like using the well-known interface, that I have tons of testing material, and that there are no limitations to extend the interface. I'm even using and passing many DCT tests, coincidentally. :rolleyes:
And if you think the DX9 interface is the ideal interface for a renderer, think again.
Please tell me all about the ideal one then. :!:
So you also have implemented the 1.x shaders aswell, and you have cubemaps and 3d textures and the like? As in: can I actually run my code in it now?
As you obviously know, shaders are passed to the DirectX DLL as a stream of tokens. They are highly compatible, so that version 1.1 shaders can easily be implemented using shader 2.0 capable hardware, or software. I have sufficient testing material, but feel free to send me your application and I'll make it run.
It's not about features, it's about design.
The DX9 design only allows you to send triangles to the hardware, and only allows you to do processing per-vertex or per-pixel.
That's not going to change any time soon. What would be the alternatives anyway?
It also doesn't force any kind of order on the triangles, because it assumes that the hardware handles HSR.
Which isn't a problem at all. Behind the interface I'm in control. I can reorder, batch, cull whatever I want. I can split render operations into a deferred pass and totally eliminate overdraw. Whatever works best, there are many options. And I can still extend the interface with new methods, if that would be useful.
So you have very strict limits on your input and where and when you should do processing.
See above.
An extreme example: Imagine a raytracer using the D3D9 interface.
This isn't "extreme" to imagine. After all, you just pass geometry, material and light information. That's all you need to start raytracing. If I'm not mistaken there's a project that uses the OpenGL interface for ray-tracing.
I got between 5 and 20 fps on my 1800+. Imagine adding characters and game-logic... It would not be very playable, would it? While a PII 233 with a TNT2 or something would probably get over 50 fps most of the time?
I'm only a small factor behind low-end hardware. In five years we'll have the CPU processing power to play today's games in software, and you'll still be complaining... You have to see things in the right perspective. My rendering quality and performance is far beyond that of the last software-rendered games. And for the tenth time; it's not all about games.
So how exactly would it be a fallback? I'd need to upgrade my CPU to get the framerates I got with a PC like the one I upgraded from years ago.
Oh and if you think that in a Q3 level you need to optimize vertex-processing and triangle-setup in asm, aswell as the pixel-pipeline, erm... :)
How many polys are there onscreen on average? About 100?
1000 really, and about ten times more that don't make it to the screen but still need processing. And Real Virtuality's vertex pipeline was totally C code without any decent optimization. It sucks compared to swShader's programmable pipeline with advanced culling, batching and caching. And I'm not even exploiting SIMD to the fullest yet. So yes, vertex processing was the second bottleneck and optimizing it would have given me 30 FPS.
That is not what I am advising. And yes, I do want to see how slowly your renderer will render some real D3D code on my PC.
I can intentionally put some dead loops in it to satisfy you. I'll make a SCALI_MODE option in the initialization file. :rolleyes:
The ATi car was already about 10 times slower with swShader than on my ancient GF2GTS with a 4-pass rendering scheme, and about 6 times slower than on an Intel Extreme with a 3-pass rendering scheme, and that was only the bare essential shading stuff of today's software.
That's only six times slower. And I used 32-bit floating-point color precision and a prototype vertex pipeline back then. With fixed-point quad processing pipelines I might equal Intel Extreme Graphics II. And let's not forget that this hardware has no support for shaders at all, so you'd have to rewrite every application with mutli-pass rendering to get similar results. They sold and still sell like sweet buns so there are going to be a lot of people out there interested in a software fallback.

And although 'beating' hardware rendering performance-wise is still not my goal, an SSE z-buffer implementation is several times faster than this hardware's z-buffer. And that's the raw power, without tiling or anything like that. So in cases with high overdraw, like rendering CAD models, I 'beat' hardware.

There is no better alternative for a fallback.
You don't get it at all. I do not want you to stop doing software rendering, I want you to apply your stuff in areas where it would have more of an effect.
Then what do you really want me to do? What's "my stuff" except for a software renderer? And why don't you do it then? I'm happy with what I do.
 
Great, then since it isn't important anyway, I choose DirectX 9. I'll just assume that it's coincidental that people like using the well-known interface, that I have tons of testing material, and that there are no limitations to extend the interface. I'm even using and passing many DCT tests, coincidentally.

I doubt that anyone actually LIKES using the DirectX API. Since it's relatively low-level it doesn't make coding any easier.

Please tell me all about the ideal one then.

That is exactly the point. There is no ideal interface. The interface should be dictated by the type of renderer, but in your case it's the other way around. The DX9 interface dictates that your renderer pretty much has to work like hardware, which is not always what you want. You may think it is what you want, but it is not.

As you obviously know, shaders are passed to the DirectX DLL as a stream of tokens. They are highly compatible, so that version 1.1 shaders can easily be implemented using shader 2.0 capable hardware, or software. I have sufficient testing material, but feel free to send me your application and I'll make it run.

You mean you don't have support for it.
Let me point out that the DirectX specs require any driver for shader model X to support every shader model < X.
And my code, like most other D3D-code uses a nice blend of shader versions and fixedfunction, always using the lowest possible version, to improve compatibility and reusability.

That's not going to change any time soon. What would be the alternatives anyway?

Adaptive NURBS tesselation, voxels, raytracing, REYES, ...?

Which isn't a problem at all. Behind the interface I'm in control. I can reorder, batch, cull whatever I want. I can split render operations into a deferred pass and totally eliminate overdraw. Whatever works best, there are many options. And I can still extend the interface with new methods, if that would be useful.

My point exactly, BEHIND the interface. So after the application just generated 10 mb of renderdata, you are going to batch it all, and reorder, and preprocess etc...
While this could have been done BEFORE the application handed it over to the interface, saving a lot of extra overhead.
Also, extending the interface with new methods pretty much breaks the idea of being DX-compatible anyway. People will still need to learn a new interface, and their code will no longer run on DX-hardware again as-is.
Might as well design an optimal interface then.

This isn't "extreme" to imagine. After all, you just pass geometry, material and light information. That's all you need to start raytracing. If I'm not mistaken there's a project that uses the OpenGL interface for ray-tracing.

If you want to raytrace on raw triangle meshes without pregenerated BSP-trees or the like, you are more naive than I thought.
This demonstrates my point as painfully as it can be: You get a raw set of triangles... Two choices:
1) Generate the BSP-tree on-the-fly, which could take QUITE a while.
2) Skip the BSP-tree and bruteforce the ray against all triangles in the mesh, which could take QUITE a while.

While you could implement the raytracer with O(log N) complexity, you now have to implement it with O(N) complexity because the interface doesn't give enough information.

Ofcourse, you could argue that as long as the vertexdata is static, you can re-use the same tree, so you only generate it when the data changes... But that doesn't help when you are using vertexshaders, does it?

And how would you make use of the advantages of raytracing, like shadows or reflections?

Also, if you mean OpenRT... that's a project with an OpenGL-like interface. It's not the same, but it bears resemblance to OpenGL.

I'm only a small factor behind low-end hardware. In five years we'll have the CPU processing power to play today's games in software, and you'll still be complaining...

Yes, why would I want to play today's games on my new high-end CPU in five years? I can already play them today on a simple Radeon 9600 or GeForce FX 5700, which are very cheap. Cheaper than that high-end CPU will be in 5 years. In 5 years, I want to play games that are released then, not now.

My rendering quality and performance is far beyond that of the last software-rendered games.

So was my Java engine, big deal.

So yes, vertex processing was the second bottleneck and optimizing it would have given me 30 FPS.

That code must have been REALLY shitty, if 1000 polys make it 50% slower than optimum code.

That's only six times slower. And I used 32-bit floating-point color precision and a prototype vertex pipeline back then. With fixed-point quad processing pipelines I might equal Intel Extreme Graphics II.

So fixed-point is about 6 times faster than floating point?

And although 'beating' hardware rendering performance-wise is still not my goal, an SSE z-buffer implementation is several times faster than this hardware's z-buffer.

Erm, why exactly would SSE be faster than hardware?

Then what do you really want me to do? What's "my stuff" except for a software renderer? And why don't you do it then? I'm happy with what I do.

Your "stuff" is programmable vertex and pixel processing.
I'd do it if you pay me. Else I have better things to do with my time.
 
Sure Scali. Bye bye now.

Everyone else, thanks for the ideas! The performance of my new rasterizer is higher than my previous one, for small triangles as well as big ones. Some further improvement could be possible but I'm really happy with the result now and the code is simply elegant. I also noticed that cache misses due to texturing nearly halved, thanks to more local accesses. This might be a significant win for scenes with many textures.
 
Fine, then don't answer the technical questions I ask you. In that case I will just assume that you don't have an answer, or you don't like the answer enough to give it. Either way, it would seem that I am right.
At least I make you think about what you do, because you have to defend it.
 
Scali and Nick - your discussion is beyond simple technical querying; you're deliberately trying to provoke an argument here. May I suggest that you take it into PMs or simply drop it altogether?
 
Nick,

Think you could tell us exactly what you found to be the best solution?

I am considering picking my rasterizer back up to play with some stuff. I don't care that much about speed, but I don't aim to be too slow either.

-Evan
 
Scali and Nick - your discussion is beyond simple technical querying; you're deliberately trying to provoke an argument here. May I suggest that you take it into PMs or simply drop it altogether?

Well, I would at least like Nick to explain why his SSE-based zbuffer is faster than hardware, or how his fixed-point processing will be as fast as Intel Extreme II, since apparently his float processing is 6 times slower.
I am interested in the optimizations that he used to get here (and also a binary, so I can check for myself... I've only seen some rather dated binaries of his stuff, apparently).

For the rest, I just tried to open Nick's eyes to other types of rendering, where software will be more suitable.
 
Neeyik said:
Scali and Nick - your discussion is beyond simple technical querying; you're deliberately trying to provoke an argument here. May I suggest that you take it into PMs or simply drop it altogether?
Hi Neeyik,

Scali hasn't really tried to contributed a single post related to the topic, has he? I asked him kindly to e-mail me about the things he wanted to discuss, but he didn't listen. With my last 'response' I tried to "drop it altogether", but well... He attacks the things I'm doing and tells me to do it differently, but he simply doesn't know how to do it better. I take that personally and find it hard not to reply to it. Please understand that I'm trying hard not to provoke anything, but sometimes it's just inevitable when people keep ignoring technical and personal arguments.

Sincerely,

Nicolas
 
ehart said:
Think you could tell us exactly what you found to be the best solution?

I am considering picking my rasterizer back up to play with some stuff. I don't care that much about speed, but I don't aim to be too slow either.
Ok, the method that I found to be quite succesful is to first test whether a 8x8 pixel block has any coverage. This is possible by computing the half-space functions at the corners of the block. By comparing them as described above, this gives an exact result.

When a 8x8 pixel block has no coverage, it means you can skip 64 pixels at once! When it does have coverage, I compute the complete coverage mask for it. This is done all at once in an unrolled loop, which turned out to be faster than using a dynamic loop where I jump out at the 'end' of the scanline. I zig-zag through the 8x8 block so the functions can be evaluated incrementally (they are linear). The complete 8x8 coverage mask is then send to the pixel pipeline(s). It's easy to skip several quads at once there, and to send only covered quads to the shader pipeline.

That's the high-level algorithm, which can be implemented using SSE and MMX quite efficiently. I don't think it would be very suited for a C implementation.

Good luck!
 
Scali said:
Well, I would at least like Nick to explain why his SSE-based zbuffer is faster than hardware, or how his fixed-point processing will be as fast as Intel Extreme II, since apparently his float processing is 6 times slower.
I am interested in the optimizations that he used to get here (and also a binary, so I can check for myself... I've only seen some rather dated binaries of his stuff, apparently).
These are the last things I'm going to answer here. Everything else you can still ask by e-mail or somewhere else, but not in this thread.

Intel Extreme Graphics II has a clock of 266 MHz. It has only one pixel pipeline so it's fillrate is 266 Mpixels/s. That's also the theoretical limit of it's z-buffer. With SSE I can do 4 z-tests in a little over 4 clock cycles. On a 2.66 GHz CPU that's 2.66 Gpixels/s. Even though software rendering obviously still has a lot of extra things to compute, this particular stage is ten times faster than hardware rendering, raw power. So that's without any kind of more sophisticated algorithm. The Extreme Graphics II chip is fixed and will never be able to exceed its fillrate.

The 32-bit floating-point, single pixel shader pipeline turned out to be six times slower than Intel Extreme Graphics II for the simplified ATI car demo. MMX is more than twice as fast as SSE, but not six times. However, processing quads is a lot more efficient than processing single pixels. For example computing four cross products for a single pixel takes 11 instructions, while it takes 12 instructions to compute the cross product for a whole quad. This is an extreme example, but speedups between 2 and 3 are not uncommon. Swizzling is entirely for free when processing quads. Masking is also for free - it even eliminates instructions. That together with my rasterization and vertex processing optimizations could make it close to six times faster.
For the rest, I just tried to open Nick's eyes to other types of rendering, where software will be more suitable.
You did that many times. My eyes are wide open, and I still choose to continue with what I'm doing now. I'm perfectly happy with that choice.

Thank you.
 
Scali hasn't really tried to contributed a single post related to the topic, has he?

I did actually. I described a way to determine coverage on blocks at once. But apparently you just ignored it.

With SSE I can do 4 z-tests in a little over 4 clock cycles. On a 2.66 GHz CPU that's 2.66 Gpixels/s. Even though software rendering obviously still has a lot of extra things to compute, this particular stage is ten times faster than hardware rendering, raw power.

Right, and memory bandwidth wouldn't have anything to do with that, I suppose? If we assume 24 or 32 bit zbuffer, we will need 4 bytes per pixel, also, you will need to read and write each pixel in many cases. Which means we will need 2.66*4*2 = ~20 GB/s memory bandwidth.
You will not get near this kind of bandwidth in practice. 1/10th of that bandwidth would be more realistic, which brings us back to the level of Intel Extreme II. So, perhaps you can beat it with z-only (if rendering a lot front-to-back), but the Intel Extreme II can render colour aswell, at very little extra cost, which you cannot. Intel Extreme II can also do zbuffering while the CPU does the geometry setup for the next triangles, which you cannot. So with reasonably high polycount, the Intel Extreme II will probably still be lots faster, because it processes in parallel.
Besides, I would want to see you beat it before I believe it. In no way is it 10 times faster than an Intel Extreme II in practice, I am sure.

That together with my rasterization and vertex processing optimizations could make it close to six times faster.

Again, I will believe that when I see it.

You did that many times. My eyes are wide open, and I still choose to continue with what I'm doing now. I'm perfectly happy with that choice.

I disagree, your answers have shown that you have no idea how raytracing works, and it seems you are completely foreign to other types of rendering aswell. Which makes me think that your choices are ill-informed. I don't care about that, but don't go bragging about a renderer which you cannot provide.
Anyway, provide me with a drop-in replacement for D3D9.dll, which includes 1.x shader support and cubemaps, and we can do some testing.
 
Scali said:
I did actually. I described a way to determine coverage on blocks at once. But apparently you just ignored it.
I didn't ignore it, it was just not on-topic. Here's what you suggested:
Oh, as for the hardware-solution... Ofcourse the writemask for the quad is very simple there. Skipping entire quads is often done via early-z and a hierarchical zbuffer of some sort (or I believe some people call it z-compression).
The basic idea is to have a minimum z-value for all pixels in a quad, and the maximum z-value for the same pixels in the zbuffer. If the first is smaller than the second, then all pixels are rejected.
This can be applied at multiple levels, so even larger parts than just single quads could be rejected from a triangle early.

Obviously in hardware this solution is good, because it's very easy to make the circuitry that finds these z values, and insert them early in the pipeline so there is no additional cost. In software this solution makes very little sense, because getting the min/max z-values is not 'free'.
To start with, you're saying this algorithm makes little sense in software. Which already makes it off-topic. And that's also why I didn't think you expected me to reply on it.

But it would have at least been interesting if it was about rasterization at all. While the algorithm you describe is absolutely correct (*), and it does speed up rendering in the wide sense, it happens at a completely different level. Namely visibility determination. Rasterization, in the narrow sense, means finding the pixels a triangle covers. A triangle without any context of a depth buffer. One good example of this is rendering only a skybox. There is no z-buffering, the number of pixels is constant, so all that matters is how fast the rasterization algorithm works.

So the algorithm you described isn't going to help rasterization in that scenario. Or many other scenarios. For every convex part of an object, it would still scan the whole triangle's bounding rectangle for covered pixels. It is orthogonal to the rasterization algorithm, so I can still add it later on to speed things up even more, due to visibility, not triangle coverage.

(*) It's false actually, tiles get rejected when the minimum z-value of the tile is bigger than the maximum z-value in the z-buffer for that tile. But that's a detail of course.

Anyway, thanks for the attempt, but I didn't just 'ignore' it for no reason. And I kindly suggest you to do something about your noise/signal ratio. You sometimes do have great ideas. But try to stick to the point, start new threads if you like, e-mail me about your concerns. Cheers.
 
Nick said:
I didn't ignore it, it was just not on-topic. Here's what you suggested:

...

Actually, the passage directly above that is the one I was referring to:

Scali said:
As for your problem, I am not too familiar with the alternative homogenous rasterizing method, but it seems to me like it introduces more problems with the generation of this mask than the regular scanline conversion?
With regular scanline conversion it's quite simple I guess, you can interpolate 2 scanlines at the same time, and both start at 'outside'... the moment they are both 'inside', you can assume 2x2 blocks to be inside, until the right edge comes near (at all times you know exactly how far each edge is from your current rendering position).

While the algorithm you describe is absolutely correct (*), and it does speed up rendering in the wide sense, it happens at a completely different level. Namely visibility determination.

True, as I said, the masks for coverage themselves are trivial in hardware, and this part is added, because hardware does z-testing on quads aswell (or even larger). You can implement the first part (edges inside/outside), but not the second part.
The first part is quite simple actually... eg if you have the edge-values for all 4 pixels in a single SSE register, a single cmp instruction will generate the correct mask, I suppose.

I can ignore the rest of your post, since it ignores the rasterization part that is missing from your quote, but not from my original post.

(*) It's false actually, tiles get rejected when the minimum z-value of the tile is bigger than the maximum z-value in the z-buffer for that tile. But that's a detail of course.

I did mention that it can be applied at multiple levels. If the tile is not rejected, quads inside the tile can still be rejected.
But as you mentioned, that part is orthogonal to rasterizing, and tiles have no direct relation to quads and their writemasks, unlike the early-z per quad.

I am not going to email you about my concerns, because you still don't seem to understand most of them anyway, and I see no reason to start new threads. If you want my input, you know where to find me.

PS: I thought 266 Mpix/s was a bit low for the Extreme Graphics II, so I looked it up: http://www.extremetech.com/article2/0,1558,1617353,00.asp
There are apparently 2 pipelines, and a total of 533 Mpix/s fillrate.
This makes more sense, considering the speed I got from it with the test programs.
The original one did have 1 pipeline though. Then again, the memory didn't have more bandwidth at that time anyway.
 
Nick said:
And that one I didn't ignore at all, I replied to it.

I don't think you understood it. What I said was the same as what ehart said, but when ehart said it, you acted like it was the first time you heard it.


I think that particular website got the specs mixed up with the original Extreme Graphics.
It also doesn't mention the bicubic texture-filtering, for example.

This site compares the 2 generations: http://www.rojakpot.com/default.aspx?location=3&var1=98&var2=2

That would seem more logical. The other specs would indicate that there is no difference between generation 1 and 2.
Sadly Intel doesn't seem to publish the specs. They only boast about Zone Rendering 2.
 
Back
Top