I can do fake trilinear too, but I don't call it trilinear. If I did, I could also say dithering is bilinear filtering I know real trilinear takes the double amount of operations, and that with hardware this just means the double amount of transistors. Everybody at this forum (except me) has a graphics card with pixel shader support. But how many casual or average game players do you think still have TNT2 or similar hardware? It's a really big part of the market. Counter-Strike is often still played in software mode and it's just as fun. Also think of all the mid-range laptops with crappy 3D cards but a fast Pentium III. In all these cases, a software renderer like mine can bring things like bilinear filtering and shader effects. Since the first Voodoo, the CPU has evolved a lot, but noone ever looked at software rendering again! Of course you are focussed on maximum performance, I am too, but only on the CPU. It's another point of view. A software renderer with unlimited shader capabilities can also bring truely infinite possibilities for offline rendering, especially for multi-processor systems. A GPU will also always lag behind in new features. For example, one of my interests is to get acceptable image quality at low framerate by doing semi-analytical motion blur in one pass. Also edge antialiasing algorithms can be improved, so we can have much more shades on the edge without huge fillrate and bandwidth needs. In general, a software renderer allows you to be 'smarter' instead of using brute force because it has no limitations at all. Thanks If the above explanation didn't suffice, allow me to throw the ball back. What's so 'meaningful' about a Geforce 4 Ti or a Radeon 9700? Name a few games that really wouldn't be fun without them, please. I admit pretty graphics are nice to look at, but has it improved games? I'm not a big gamer, but I don't see an important difference between Quake 1 or Quake 3, or Unreal Tournament in software or hardware mode.
Sure, but it's hard to speak of parallelism here. You can only say it keeps the CPU busy with other tasks while it waits for a result. On a Pentium III there are still only three micro-instructions retired per clock cycle, and this has to happen in order. It can execute up to 40 micro-instructions ahead in ideal situations, but my pixel loop is mostly around 100 instructions long so it's not like it can do the same operation in parallel. And even if this buffer was longer it still had to wait till the execution units are free which probably won't happen before the first pixel has almost completely been finished. An Itanium solves most of these problems, but I don't expect one on my desk in the next few years. HyperThreading is also a good way forward for improving parallelism, but we'll need more execution units to see the difference.
I wouldn't bother putting too much effort in. The Celeron and P3 are somewhat sensitve to scheduling, the Athlon is pretty insensitive, and the P4 is totally insensitive to scheduling. In my experience unless you can reorder by between 15 and 20 instructions or interleave algorithms (rare because of register pressure) there will be no effect of scheduling for P4.
I think that's a touch unfair. If it's true and really does have comparable quality, then it's an impressive project and my hat off to him. I know you acknowledged this too but bragging rights are important even for software renderers and he has some If you consider that the fastest CPU's now are perhaps half an order of magnitude ahead of that Celeron, but the fastest GPU's are less than two orders of magnitude ahead of the TNT, it's an impressive effort.
Too late! Currently it results in 2-20% speedup. But that's for short 'artificial' benchmark situations. I hope that in real situations for the shader assembler it will reach 25%. I still got two problems to solve. The first is to quickly and accurately time the differences. RDTSC gives an accurate clock cycle count, but not when it was interrupted by another thread. I can set the priority to real-time but it still isn't optimal. The other problem is to find an algorithm that will quickly converge towards and ideal schedule. Surprisingly, my random algorithm gives the best results when run long enough. This is probably because of the highly unpredictable behaviour. I still want to put some effort in it because every percentage counts, and because this method is processor independent.
One note about RDTSC is that although in the docs it states it isn't a serialising instruction I find it usually behaves like one. Don't call it more often than a few hundred times per frame - certainly don't wrap your loop in it unless the loop is doing a thousand iterations, or it will skew your results badly. I once achieved a 3x speedup (and it wasn't a small loop) by taking out two rdtsc instructions.
That's pretty easy to explain. For a Pentium III, it get's decoded into 31 micro-instructions. But the reorder buffer is only 40 entries long. So not many instructions will be able to run in parallel with it. The decoder makes this even worse. Because it can only be decoded by the first decoder, it will take about 15 clock cycles before it is completely decoded. In this time, all previous instructions will be finished. Also, other instructions can't be decoded in parallel when the first decoder handles a long instruction. So the RDTSC is effectively choking the processor. It's not a serialising instruction because it doesn't wait till the pipeline is empty, like CPUID, but it does make it very hard to have instructions before and after it to execute in parallel.
Well, that explains _everything_. That would fit in very much with my practical observations. And you obviously know to be afraid of it
Critical factors at .13 microns, and most especially smaller, will be current leakage and electron migration. The smaller c/gpus become the less advisable it is to overclock them, and especially to over volt them. As gates become progressively closer to each other physically, it takes less and less in the way of current to migrate them all over the place. Many P4 overclockers are reporting very short-lived overclocking experiences with the current Northwoods, and this is why. As chips get smaller and run faster overclocking will slowly become a lost art, I predict, with the possible exception of the exotic camp who go in for water cooling, freezing and the like. Just wait until we hit production .09 microns for things to really start getting interesting. Current leakage--very big problem. P4 and Athlons consume so much power because they waste so much of it. There are lots of proposals as how to defeat it--but none of them very good so far, IMO. Read something by Intel a few months ago that talked about the company looking into "sleep transistors" for .09 possibly--but definitely for .65 and smaller. These are "manager" transistors which regulate power to sections of the chip, which has been cordoned off in discrete electrical units. When an area of the chip isn't being used the sleep transistor shuts down power to the unneeded area of the chip until it is needed again. But you see how cumbersome this approach really is, which illustrates the magnitude of the problem. They really need to come at it some other way, and we've seen some interesting things about possible advances in basic gate design coming out of AMD and Intel in the past several months. But right now these are just tentative, scholarly approaches being done on campuses under limited and controlled conditions--so nobody much knows what if anything will come out of this research. But something has to, that's for sure, because current leakage has to be stopped (or at least heavily subdued) or chip design gets stopped--is about it. At some point economic dynamics may tip the balance and we'll begin to see lateral advances in parallel processing weigh in to fill the needs. So far it's been cheaper to design single chips and jack up the MHz in ever rising spirals, but it may soon become cheaper to branch out for processing power in other ways aside from MHz. I mean, IPC is already a very important dynamic in cpu design and it wouldn't surprise me to see it completely overshadow current trends (at least at Intel) of less IPC and more MHz. Already with the P4's HYPEr-Threading circuitry we see Intel recognizing the IPC dynamic in a very concrete fashion (of course AMD has paid homage to it for years.) Even so current leakage hurts the HT P4 even in its first forms. But this isn't pessismistic at all, it just foreshadows a change towards an environment where IPC overshadows MHz, and I personally see that as a Very Good Thing... But those are the current trends as I see them.
Current leakage can be kept under control (by applying thick gate oxides and making transistor threshold voltages high), but the cost is that you get slower transistors and sublinear scaling of performance with feature size - I suspect that the jump from 0.13 to 0.09 and further will exhibit much less than linear clock scaling for this reason (and other reasons as well, like interconnect delay), and that the main benefit of 0.09 and smaller will just be that you can cram more circuits into a given chip area. If this is true, clock speeds will flatten out and the way to get more performance from new processes will be mainly through parallellism and caching - a 1GHz GPU may be further away than one might think.
What's with the rude tone? Jealous? From what I've followed in the thread, he was responding to epicstruggle's question with an answer that would be more complete with detail regarding his software renderer - so he wasn't bragging at all, simply giving a detailed answer. Personally, I find software rendering a really interesting topic. But even if I didn't, I wouldn't go bashing other people's efforts on the subject. So, what's the point of your post, really? That you can make a post bashing other people's work and make yourself look like an asshole? Well, in that case, contragulations! You're an accomplished asshole! PS: You should make yourself shut up instead of bashing someone's work, that would be cool and useful.
Dsukio: It's beneath me to give your insulting post any more attention than to say that I wasn't being rude and I didn't bash his efforts, and it seems YOU'RE the one with the problem - not me. Now let's get back on topic, shall we? *G*
This is a common misconception. The TNT could do proper trilinear. The only problem (and a big problem, at that) was that it could only perform true trilinear filtering when single-texturing is used, meaning that only half the fillrate was available with trilinear filtering enabled, and the bandwidth savings of multitexturing was not available.
Thanks Chalnoth, that's pretty interesting. So what they do is use the same texture in both pipelines but with one mip level higher/lower and then blend these bilinear filtered samples together with the mip level fraction? That's pretty cool, but it's not very useful for my software renderer because then I do all other operations twice too. It's better for me to just write the trilinear filter directly.
Well that explains where Mike A has been for the last year or so. Looks like Nick's system is reasonably comparable.
Right. Does anyone have some inspiration for a name for my project? I want to put it at sourceforge under the GPL or LGPL license. Thanks.