Gigaherz GPU?

Nick

Veteran
Hi,

Can anyone estimate when GPU's will break the GHz limit? The Geforce FX will probably start at 500 MHz, but it's not unlikely that card manufacturors will clock them a bit higher. How soon do you expect extreme overclockers to reach 1000 Mhz, and when do you think an official GHz GPU will be available?

The .13 micron process has proven to be very stable in the first few GHz for the CPU, so what can we expect for the GPU? Do you think GPU clockspeed is an important factor for performance, or do you think we just need more transistors? What will be the consequences for cooling?

Lot's of questions, just curious ;)
 
Either NV40 or NV45 IMHO for official 1Ghz core clock speeds.
If you did insane cooling, the NV35 with Low K might get at 1Ghz core.

CPUs are based on higher and higher clock speeds because increasing die size doesn't make as much of a difference.
With GPUs, increasing die size makes a hig difference ( because it's all parallel and little info is taken to draw something. The GPU can compute some of the vertices info, the triangles, the pixels and so on.

So, for GPUs, both die size and clockspeed are as important. David Kirk also said somewhere ( I think it was in an extremetech interview ) that this is why they're beating Moore's law and will continue to do so. They benefit from two factors while CPUs mostly benefit from one.

Consequences for cooling? *cough* Flow FX *cough* Or maybe water cooling, too.


Uttar
 
i have no idea. :) but how fast would a cpu have to be to replace what a gpu can do. hope that makes sense. or is this a very stupid question.
 
1 GHz GPU: NV50 on a 0.065u process is my guess. Increasing clock speeds too sharply increases the amount of power needed to draw each pixel as well as circuit development time, so I doubt NV/ATI will try that hard to stay ahead of linear scaling with process feature size.

A CPU that can match today's GPUs? I'd guess at ~30-60 GHz CPU speed to match e.g. an R300, assuming of course that IPC of CPUs keep up with today's Pentium4's.
 
Weren't NVidia going to a 2 chip solution next with the pixel shader running at 1.5 times the speed of the vertex shader to do True-time T&L?

If so one of chips might hit 1GHz within a year to 18 months I suppose?
 
epicstruggle said:
i have no idea. :) but how fast would a cpu have to be to replace what a gpu can do. hope that makes sense. or is this a very stupid question.
My software renderer handles Quake 3 scenes at 20 FPS on my 1.2 GHz Celeron, while my TNT2 at 125 Mhz renders it at 80 FPS in a comparable scenario and quality. So to get near a Geforce 4 (about 20 times faster than TNT2?) we probably need a 100 GHz CPU :LOL:

So can I conclude that a GHz GPU won't be expected in the next two or three years? I'm talking about an official chip of course, with conventional cooling solution for normal PC's...
 
Nick said:
epicstruggle said:
i have no idea. :) but how fast would a cpu have to be to replace what a gpu can do. hope that makes sense. or is this a very stupid question.
My software renderer handles Quake 3 scenes at 20 FPS on my 1.2 GHz Celeron, while my TNT2 at 125 Mhz renders it at 80 FPS in a comparable scenario and quality. So to get near a Geforce 4 (about 20 times faster than TNT2?) we probably need a 100 GHz CPU :LOL:

So can I conclude that a GHz GPU won't be expected in the next two or three years? I'm talking about an official chip of course, with conventional cooling solution for normal PC's...

NV45 is to be released in nearly 3 years if nVidia doesn't get any delays again...
I'd be surprised if the NV40 had a 1Ghz core ( but I'd be delighted by that ) - but the NV45 likely will IMO. After all, a mature 0.09 and Low K ( or maybe something else ) could do miracles...

Uttar
 
not that far away. NV40-NV50 most likely.

btw, the Playstation 3's GPU, currently thought to be the 'Visualizer'
(also known a Graphics Synthesizer3) is thought to be clocked at 2 GHz when PS3 reaches the market in 2005. the EE3/Broadband Engine CPU is currently indicated to be 4 Ghz, with the GPU likely being clocked at half that rate. check the console forum thread "Playstation III Architecture"

I would assume Xbox 2/Xbox Next to have at least a GHz GPU(s) (NV50/55) if not faster. maybe the Next Xbox will have multipule GPUs on a single die, to counter Cell/Visualizer. I dont see Intel doing that. but that's for another thread....
 
Nick:
"My software renderer handles Quake 3 scenes at 20 FPS on my 1.2 GHz Celeron"

With what kind of texture filtering? I have a hunch it's not proper trilinear... Also, note that Q3A really doesn't feature that advanced graphics, it's just a base texture and a lightmap after all. Throw in one layer of dot3 bumpmapping and performance on your celly goes right down the tubes instantly. :) Forget about any pixel shading stuff... We're talking slideshow city here. :D

Yes, even on the fastest of fast P4:s and stuff.


*G*
 
didnt realize that a cpu would performe so poorly. any major reasons why. ( i know that there is a fair amount of parrallelism involved :) ) but is this why the difference is so large?
 
A vanilla TNT2 would outrender a P4 3.06GHz.

It's because of the sheer amount of calculation that goes into each pixel; for example, trilinear filtering needs to take eight texture samples and blend them. x86 is VERY bad for those ops.
 
If we're talking extreme overclocking with LN2, then maybe you'll see it on the NV30 or NV35. It's really hard to say, not knowing anything at all about the chip. Also performance at 1 GHz will be crap, because it's ram would totally be holding it back.
 
epicstruggle said:
didnt realize that a cpu would performe so poorly. any major reasons why. ( i know that there is a fair amount of parrallelism involved :) ) but is this why the difference is so large?
Just think about doing bilinear filtering. You need 4 samples from a single texture. To combine them, you need 3 adds and a divide (or shift), so that's 4 memory accesses plus 4 instructions. If you're doing this in the pixel shader, that becomes 4 FPU instructions. Some GPUs can do 8 of these operations per cycle.

Now, toss in vertex shaders, fog, alpha blending, depth checks, etc, and you get the picture that there are a lot of computations involved and all of these are done per pixel. Of course, you can optimize some of this, but there is massive parallelism here.
 
Not to mention that CPU's aren't really designed specifically for these types of tasks. Dedicated hardware in most all cases beats out generalized hardware everytime.
 
epicstruggle said:
didnt realize that a cpu would performe so poorly. any major reasons why. ( i know that there is a fair amount of parrallelism involved :) ) but is this why the difference is so large?

Rendering is by it's very nature a highly parallel operation (each pixel is independent, and hence can be processed simultaneously).

This means that adding more and more functional units to a GPU can give you tangible benefits in performance terms. For CPUs this is not the case, as most programs at some level require operations to be performed in sequential order, otherwise the program simply doesn't work. Therefore adding extra functional units to a CPU (beyond a given number, roughyl 2-4) doesn't yield worthwhile performance gains.
 
Grall said:
With what kind of texture filtering? I have a hunch it's not proper trilinear...
Bilinear filtering at full precision. A TNT2 doesn't support trilinear either.
Grall said:
Also, note that Q3A really doesn't feature that advanced graphics, it's just a base texture and a lightmap after all. Throw in one layer of dot3 bumpmapping and performance on your celly goes right down the tubes instantly. :)
I don't see why dot3 would be a big performance hit. It's just another texel read, a dot3 and a few simple operations. Fillrate would probably halve.
Grall said:
Forget about any pixel shading stuff... We're talking slideshow city here. :D
I use run-time conditional compilation of x86 code. In other words, I only use shaders. The x86 instruction set is much richer than the DX9 shader instructions, and I have access to the whole RAM which is a big advantage. The CPU is much more versatile and can produce effects that are not yet available on the GPU.
Grall said:
Yes, even on the fastest of fast P4:s and stuff.
My renderer is not even half as optimized as it could be. I still need to put much of the vertex pipeline in SSE code. I can clear my z-buffer in zero time by inverting the compare instruction. Or I can totally eliminate the z-buffer by using zero overdraw portal clipping. Perspective correction can be faster by only doing one division every eight pixels, the difference is not noticable. For minified textures I can use a cheaper average filter without quality loss. Things like mipmap selection can be improved by using lookup tables. I'm also working on a exhaustive instruction scheduler.

All this and more should bring my FPS near a stable 40 FPS, on my Celeron that is. On the newest Pentium 4 it can probably compete with a TNT2, and have better quality.

Sure, GPU's are optimized to do complex tasks in one clock cyle in parallel, but that doesn't mean a GHz CPU has a "seconds per frame" performance. I don't want to compete with the GPU, only provide an acceptable aternative with it's own advantages...
 
epicstruggle said:
didnt realize that a cpu would performe so poorly. any major reasons why. ( i know that there is a fair amount of parrallelism involved :) ) but is this why the difference is so large?
One pixel in my software renderer takes roughly 60 clock cycles under ideal situations. The CPU has to process every instruction separately and for the biggest part in linear order. It also doesn't know that the next pixel is independant of the current one, so it can't execute that code yet.

A GPU does not have to do the operations in linear order, because it knows in advance that for every pixel it will have to do perspecitive correction, z-compare, mipmap selection, texel fetching, blending operations, etc. It can even start the next few pixels before the current pixel gets trough the pipeline. A CPU doesn't have that advantage because it doesn't know it's operations in advance. Put eight of those pipelines in parallel and you can have eight pixels per clock cycle. On the other hand, a GPU can't perform tasks that it wasn't designed for. It 'expects' to render triangles and is optimized for only that task.

Ok CPU's do have some level of out-of-order execution and parallelism, but it's much more limited because a CPU is designed to be versatile. An EPIC processor like the Itanium takes the parallelism for CPU's to the maximum, but it's not available for desktops yet...
 
Nick:

TNT can't do *proper* trilinear, but it has some kind of fake trilinear I think. Anyway, a Rage128 has proper trilinear and it's the same gen as TNT. A CPU really takes it on the chin having to do the extra work to do trilinear. As did GPUs in the past of course, but not today. :)

Anyway, bragging that a celly might compete with a TNT as a software renderer once a bunch of optimizations/cheats are put in place is kinda impressive until you remember the TNT is slow as a snail today, much slower than the celly is compared to the most recent CPUs. I mean, when the GFFX benchmarks Q3A at 200+ fps at 1600*1200 *WITH* antialias on buggy hardware and early drivers... *Ahem*

So, what's your point, really? That you can write a software engine that runs a four year old game at an at least somewhat playable speed on a fairly weak CPU using basic quality settings, well in that case, congratulations! :) You're an accomplished programmer. But even you can't say it's really meaningful in any other sense than as a programming exercise, right? I mean, even if you had really really uber CPU clockspeed that could offset the inherent parallelism in a GPU, memory bandwidth would still strangle you compared to the real thing...


*G*

PS: You should make your software engine into a winamp visual plugin or something, that would be cool and useful. :)
 
Nick said:
The CPU has to process every instruction separately and for the biggest part in linear order. It also doesn't know that the next pixel is independant of the current one, so it can't execute that code yet.

Thanks to the wonders of moden Out Of Order processors with register renaming the next pixel can actually begin execution before the current is finished being processed. At least as long as you dont spill temporaries to memory.

Cheers
Gubbi.
 
Back
Top