Gigaherz GPU?

Grall said:
TNT can't do *proper* trilinear, but it has some kind of fake trilinear I think. Anyway, a Rage128 has proper trilinear and it's the same gen as TNT. A CPU really takes it on the chin having to do the extra work to do trilinear. As did GPUs in the past of course, but not today. :)
I can do fake trilinear too, but I don't call it trilinear. If I did, I could also say dithering is bilinear filtering :) I know real trilinear takes the double amount of operations, and that with hardware this just means the double amount of transistors.
Grall said:
Anyway, bragging that a celly might compete with a TNT as a software renderer once a bunch of optimizations/cheats are put in place is kinda impressive until you remember the TNT is slow as a snail today, much slower than the celly is compared to the most recent CPUs. I mean, when the GFFX benchmarks Q3A at 200+ fps at 1600*1200 *WITH* antialias on buggy hardware and early drivers... *Ahem*
Everybody at this forum (except me) has a graphics card with pixel shader support. But how many casual or average game players do you think still have TNT2 or similar hardware? It's a really big part of the market. Counter-Strike is often still played in software mode and it's just as fun. Also think of all the mid-range laptops with crappy 3D cards but a fast Pentium III. In all these cases, a software renderer like mine can bring things like bilinear filtering and shader effects. Since the first Voodoo, the CPU has evolved a lot, but noone ever looked at software rendering again!

Of course you are focussed on maximum performance, I am too, but only on the CPU. It's another point of view. A software renderer with unlimited shader capabilities can also bring truely infinite possibilities for offline rendering, especially for multi-processor systems.

A GPU will also always lag behind in new features. For example, one of my interests is to get acceptable image quality at low framerate by doing semi-analytical motion blur in one pass. Also edge antialiasing algorithms can be improved, so we can have much more shades on the edge without huge fillrate and bandwidth needs. In general, a software renderer allows you to be 'smarter' instead of using brute force because it has no limitations at all.
Grall said:
So, what's your point, really? That you can write a software engine that runs a four year old game at an at least somewhat playable speed on a fairly weak CPU using basic quality settings, well in that case, congratulations! :) You're an accomplished programmer.
Thanks :)
Grall said:
But even you can't say it's really meaningful in any other sense than as a programming exercise, right?
If the above explanation didn't suffice, allow me to throw the ball back. What's so 'meaningful' about a Geforce 4 Ti or a Radeon 9700? Name a few games that really wouldn't be fun without them, please. I admit pretty graphics are nice to look at, but has it improved games? I'm not a big gamer, but I don't see an important difference between Quake 1 or Quake 3, or Unreal Tournament in software or hardware mode.
Grall said:
I mean, even if you had really really uber CPU clockspeed that could offset the inherent parallelism in a GPU, memory bandwidth would still strangle you compared to the real thing...[/qoute]
Intel's upcoming prescott will have 800 MHz memory bus speeds and a huge cache. It still won't be as good as the newest graphics card, but do we absolutely need more?
Grall said:
PS: You should make your software engine into a winamp visual plugin or something, that would be cool and useful. :)
LOL, you answered your own question now ;) See, a software renderer with shaders can be run on any computer with a GHz processor. Many office computers have a graphics card integrated on the motherboard with only basic triangle rasterization capabilities. But wouldn't it be interesting to let even these people enjoy for example online 3D animations? Why pay for an graphics card if your CPU can give satisfactory graphics?

Anyway, I got the info I needed: using the name "Gigahertz" or anything similar for my renderer would only be relevant for a year or two. Yup, I only started this thread to have a name for my software renderer :D Any better suggestions? ;) I was thinking of something with "retro" in it...
 
Gubbi said:
Thanks to the wonders of moden Out Of Order processors with register renaming the next pixel can actually begin execution before the current is finished being processed. At least as long as you dont spill temporaries to memory.
Sure, but it's hard to speak of parallelism here. You can only say it keeps the CPU busy with other tasks while it waits for a result. On a Pentium III there are still only three micro-instructions retired per clock cycle, and this has to happen in order.

It can execute up to 40 micro-instructions ahead in ideal situations, but my pixel loop is mostly around 100 instructions long so it's not like it can do the same operation in parallel. And even if this buffer was longer it still had to wait till the execution units are free which probably won't happen before the first pixel has almost completely been finished.

An Itanium solves most of these problems, but I don't expect one on my desk in the next few years. HyperThreading is also a good way forward for improving parallelism, but we'll need more execution units to see the difference.
 
Nick said:
I'm also working on a exhaustive instruction scheduler.
I wouldn't bother putting too much effort in.

The Celeron and P3 are somewhat sensitve to scheduling, the Athlon is pretty insensitive, and the P4 is totally insensitive to scheduling. In my experience unless you can reorder by between 15 and 20 instructions or interleave algorithms (rare because of register pressure) there will be no effect of scheduling for P4.
 
Grall said:
Anyway, bragging that a celly might compete with a TNT as a software renderer once a bunch of optimizations/cheats are put in place is kinda impressive until you remember the TNT is slow as a snail today, much slower than the celly is compared to the most recent CPUs.
I think that's a touch unfair. If it's true and really does have comparable quality, then it's an impressive project and my hat off to him. I know you acknowledged this too but bragging rights are important even for software renderers and he has some :)

If you consider that the fastest CPU's now are perhaps half an order of magnitude ahead of that Celeron, but the fastest GPU's are less than two orders of magnitude ahead of the TNT, it's an impressive effort.
 
Dio said:
I wouldn't bother putting too much effort in.
Too late! ;) Currently it results in 2-20% speedup. But that's for short 'artificial' benchmark situations. I hope that in real situations for the shader assembler it will reach 25%.

I still got two problems to solve. The first is to quickly and accurately time the differences. RDTSC gives an accurate clock cycle count, but not when it was interrupted by another thread. I can set the priority to real-time but it still isn't optimal.

The other problem is to find an algorithm that will quickly converge towards and ideal schedule. Surprisingly, my random algorithm gives the best results when run long enough. This is probably because of the highly unpredictable behaviour.

I still want to put some effort in it because every percentage counts, and because this method is processor independent.
 
One note about RDTSC is that although in the docs it states it isn't a serialising instruction I find it usually behaves like one.

Don't call it more often than a few hundred times per frame - certainly don't wrap your loop in it unless the loop is doing a thousand iterations, or it will skew your results badly.

I once achieved a 3x speedup (and it wasn't a small loop) by taking out two rdtsc instructions.
 
Dio said:
One note about RDTSC is that although in the docs it states it isn't a serialising instruction I find it usually behaves like one.
That's pretty easy to explain. For a Pentium III, it get's decoded into 31 micro-instructions. But the reorder buffer is only 40 entries long. So not many instructions will be able to run in parallel with it.

The decoder makes this even worse. Because it can only be decoded by the first decoder, it will take about 15 clock cycles before it is completely decoded. In this time, all previous instructions will be finished. Also, other instructions can't be decoded in parallel when the first decoder handles a long instruction. So the RDTSC is effectively choking the processor.

It's not a serialising instruction because it doesn't wait till the pipeline is empty, like CPUID, but it does make it very hard to have instructions before and after it to execute in parallel.
 
Well, that explains _everything_. That would fit in very much with my practical observations. And you obviously know to be afraid of it :)
 
Nick said:
Hi,

Can anyone estimate when GPU's will break the GHz limit? The Geforce FX will probably start at 500 MHz, but it's not unlikely that card manufacturors will clock them a bit higher. How soon do you expect extreme overclockers to reach 1000 Mhz, and when do you think an official GHz GPU will be available?

The .13 micron process has proven to be very stable in the first few GHz for the CPU, so what can we expect for the GPU? Do you think GPU clockspeed is an important factor for performance, or do you think we just need more transistors? What will be the consequences for cooling?

Lot's of questions, just curious ;)

Critical factors at .13 microns, and most especially smaller, will be current leakage and electron migration. The smaller c/gpus become the less advisable it is to overclock them, and especially to over volt them. As gates become progressively closer to each other physically, it takes less and less in the way of current to migrate them all over the place. Many P4 overclockers are reporting very short-lived overclocking experiences with the current Northwoods, and this is why. As chips get smaller and run faster overclocking will slowly become a lost art, I predict, with the possible exception of the exotic camp who go in for water cooling, freezing and the like. Just wait until we hit production .09 microns for things to really start getting interesting.

Current leakage--very big problem. P4 and Athlons consume so much power because they waste so much of it. There are lots of proposals as how to defeat it--but none of them very good so far, IMO. Read something by Intel a few months ago that talked about the company looking into "sleep transistors" for .09 possibly--but definitely for .65 and smaller. These are "manager" transistors which regulate power to sections of the chip, which has been cordoned off in discrete electrical units. When an area of the chip isn't being used the sleep transistor shuts down power to the unneeded area of the chip until it is needed again.

But you see how cumbersome this approach really is, which illustrates the magnitude of the problem. They really need to come at it some other way, and we've seen some interesting things about possible advances in basic gate design coming out of AMD and Intel in the past several months. But right now these are just tentative, scholarly approaches being done on campuses under limited and controlled conditions--so nobody much knows what if anything will come out of this research. But something has to, that's for sure, because current leakage has to be stopped (or at least heavily subdued) or chip design gets stopped--is about it.

At some point economic dynamics may tip the balance and we'll begin to see lateral advances in parallel processing weigh in to fill the needs. So far it's been cheaper to design single chips and jack up the MHz in ever rising spirals, but it may soon become cheaper to branch out for processing power in other ways aside from MHz. I mean, IPC is already a very important dynamic in cpu design and it wouldn't surprise me to see it completely overshadow current trends (at least at Intel) of less IPC and more MHz. Already with the P4's HYPEr-Threading circuitry we see Intel recognizing the IPC dynamic in a very concrete fashion (of course AMD has paid homage to it for years.) Even so current leakage hurts the HT P4 even in its first forms.

But this isn't pessismistic at all, it just foreshadows a change towards an environment where IPC overshadows MHz, and I personally see that as a Very Good Thing...;) But those are the current trends as I see them.
 
Current leakage can be kept under control (by applying thick gate oxides and making transistor threshold voltages high), but the cost is that you get slower transistors and sublinear scaling of performance with feature size - I suspect that the jump from 0.13 to 0.09 and further will exhibit much less than linear clock scaling for this reason (and other reasons as well, like interconnect delay), and that the main benefit of 0.09 and smaller will just be that you can cram more circuits into a given chip area.

If this is true, clock speeds will flatten out and the way to get more performance from new processes will be mainly through parallellism and caching - a 1GHz GPU may be further away than one might think.
 
Grall said:
Anyway, bragging that a celly might compete with a TNT as a software renderer once a bunch of optimizations/cheats are put in place is kinda impressive until you remember the TNT is slow as a snail today, much slower than the celly is compared to the most recent CPUs. I mean, when the GFFX benchmarks Q3A at 200+ fps at 1600*1200 *WITH* antialias on buggy hardware and early drivers... *Ahem*

What's with the rude tone? Jealous? From what I've followed in the thread, he was responding to epicstruggle's question with an answer that would be more complete with detail regarding his software renderer - so he wasn't bragging at all, simply giving a detailed answer. Personally, I find software rendering a really interesting topic. But even if I didn't, I wouldn't go bashing other people's efforts on the subject.

So, what's your point, really? That you can write a software engine that runs a four year old game at an at least somewhat playable speed on a fairly weak CPU using basic quality settings, well in that case, congratulations! :) You're an accomplished programmer. But even you can't say it's really meaningful in any other sense than as a programming exercise, right? I mean, even if you had really really uber CPU clockspeed that could offset the inherent parallelism in a GPU, memory bandwidth would still strangle you compared to the real thing...


*G*

PS: You should make your software engine into a winamp visual plugin or something, that would be cool and useful. :)

So, what's the point of your post, really? That you can make a post bashing other people's work and make yourself look like an asshole? Well, in that case, contragulations! :) You're an accomplished asshole!

PS: You should make yourself shut up instead of bashing someone's work, that would be cool and useful. :)
 
Dsukio:

It's beneath me to give your insulting post any more attention than to say that I wasn't being rude and I didn't bash his efforts, and it seems YOU'RE the one with the problem - not me.

Now let's get back on topic, shall we?


*G*
 
Grall said:
TNT can't do *proper* trilinear, but it has some kind of fake trilinear I think. Anyway, a Rage128 has proper trilinear and it's the same gen as TNT. A CPU really takes it on the chin having to do the extra work to do trilinear. As did GPUs in the past of course, but not today. :)

This is a common misconception. The TNT could do proper trilinear. The only problem (and a big problem, at that) was that it could only perform true trilinear filtering when single-texturing is used, meaning that only half the fillrate was available with trilinear filtering enabled, and the bandwidth savings of multitexturing was not available.
 
Chalnoth said:
This is a common misconception. The TNT could do proper trilinear. The only problem (and a big problem, at that) was that it could only perform true trilinear filtering when single-texturing is used, meaning that only half the fillrate was available with trilinear filtering enabled, and the bandwidth savings of multitexturing was not available.
Thanks Chalnoth, that's pretty interesting. So what they do is use the same texture in both pipelines but with one mip level higher/lower and then blend these bilinear filtered samples together with the mip level fraction? That's pretty cool, but it's not very useful for my software renderer because then I do all other operations twice too. It's better for me to just write the trilinear filter directly.
 
Grall said:
It's beneath me to give your insulting post any more attention than to say that I wasn't being rude and I didn't bash his efforts, and it seems YOU'RE the one with the problem - not me.
I'll give you that. :)

Now let's get back on topic, shall we?
Ok! (Check your PMs!)
 
Well that explains where Mike A has been for the last year or so. Looks like Nick's system is reasonably comparable.
 
Grall said:
Now let's get back on topic, shall we?
Right. Does anyone have some inspiration for a name for my project? I want to put it at sourceforge under the GPL or LGPL license. Thanks.
 
Back
Top