Tim's thoughts

Ailuros said:
Did that already a page or two before. I switched between HW T&L/VS off and on in UT2k3. Here again for those who missed it:

SW T&L on 2.0GHz Athlon: 27.9 fps
HW T&L/VS on 4 VS@325MHz + 2.0GHz Athlon: 50.4 fps
It would be far more interresting if you turned on SW rasterisation (make sure you enable full alpha and not stippled other wise the comparison isn't valid)....
 
Crazyace said:
Did that already a page or two before. I switched between HW T&L/VS off and on in UT2k3. Here again for those who missed it:

SW T&L on 2.0GHz Athlon: 27.9 fps
HW T&L/VS on 4 VS@325MHz + 2.0GHz Athlon: 50.4 fps

Hi Ailluros,

The interesting point about your comparision comes when you consider that running UT2k3 'game' code probally takes 50% of your cpu cycles... So effectively the SW T&L will probally be just as quick on its own as the HW T&L.
You can't actually say that without more information - considering that the hardware T&L time should be effectively parallelised with the CPU time it is impossible to extract much meaningful information on how much time the hardware T&L is taking from these numbers. Consider the following horrible ASCII art (that ignores pixel rendering time for clarity) -

Code:
   HWTL Case                      SWTL case

HWTL  |  |                           |
Time     |                           |
         |                           |
         | Gameloop time             |  Gameloop time
         |                           |
         |                           |
         |... 50 fps                 |
                                       |
                                       |
                                       |
                                       |   SWTL Time
                                       |
                                       |
                                       | ... 28 fps

In this case we have the game with HWTL running at 50 fps and with SWTL at around 28 fps, but the hardware TL time is parallelised with the gameloop (since they are running decoupled by a large command buffer on independent processors). In the case shown the hardware transform is taking about 1/7th of the time of the software transform and yet we get something that looks like the performance case given by Ailuros. If we were to remove the gameloop we could see that the hardware is actually transforming much faster than the software version, but is being limited by the gameloop speed. Similarly the hardware TL could actually be taking just as much time as the SWTL, but we would still get the performance case that Ailuros showed because of the parallelism. You cannot extract the relative performances of the HW and SW transform paths from the numbers that were given.
 
For TL, I would expect a fast CPU to already be competitive with a hardware solution, because TL is easy.

The real performance problem is in rasterisation and texturing, where you have to handle scan conversion, per-pixel mip calculations, texel-miss-per-pixel, and filtering...
 
JohnH,

I don't think I can get to SW rasterization in UT2k3, but could be wrong. I think they're planning to re-enable it in UT2k4.

Here are a couple of 3dmark2k1 scores; no idea if they could be of any use:

Software T&L

  • High Polygon Count 1 light: 5.3 MTriangles/sec
  • High Polygon Count 8 lights: 2.8 MTriangles/sec
  • Vertex Shader: 69.4 fps
  • Point Sprites: 11.4 MSprites/sec

Pure Hardware T&L

  • High Polygon Count 1 light: 63.1 MTriangles/sec
  • High Polygon Count 8 lights: 16.0 MTriangles/sec
  • Vertex Shader: 192.3 fps
  • Point sprites: 35.3 MSprites/sec

***edit:

I also ran Nature in both occassions but any PS effects it has run also with SW T&L AFAIK: 47.4 vs 91.3 fps.
 
I may be wrong, but IIRC there's a patch for UT2K3 enabling SW rendering... I think I remember some comments by Tim Sweeney saying that by removing most effects, the game was playable at 800x600 on very fast CPUs.
 
" but IIRC there's a patch for UT2K3 enabling SW rendering... I think I remember some comments by Tim Sweeney saying that by removing most effects, the game was playable at 800x600 on very fast CPUs."
-------------------------------------------------------------------------------------
this is correct, although you have to download a seperate renderer along with the patch to get it to work. as impressive as it is, it's still a blurry mess. the software renderer is said to have a full dx7 features set, but it must be emulating current nVidia hardware because the filtering looks like scrot. seriously, i don't think it's doing bilinear at all, but rather a full screen blend effect (or maybe it's bilinear on the final 2d image). some effects are stripped down, and you can't hit high res without it choking. i know i couldn't get 800*600 working playable, but perhaps someone with a fast p4 could. honestly, it's about the same effect you would get if you run the game on older (voodoo3, intel810, ect) hardware, only with worse filtering.

http://unreal.epicgames.com/
or directly to the renderer....
http://unreal.epicgames.com/files/pixodrv.zip

and the ini changes....

[PixoDrv.PixoRenderDevice]
FogEnabled=True
Zoom2X=True
LimitTextureSize=True
LowQualityTerrain=True
TerrainLOD=10
SkyboxHack=True
FilterQuality3D=1
FilterQualityHUD=1
HighDetailActors=False
SuperHighDetailActors=False
ReduceMouseLag=False
DesiredRefreshRate=0
DetailTexMipBias=0.000000
Use16bitTextures=False
Use16bit=True
UseStencil=False
UseCompressedLightmaps=False
DetailTextures=False
UsePrecaching=True
SimpleMaterials=True

should then allow you to select it by modifying your UT2003.ini file as follows.

RenderDevice=PixoDrv.PixoRenderDevice
;RenderDevice=D3DDrv.D3DRenderDevice
;RenderDevice=Engine.NullRenderDevice
;RenderDevice=OpenGLDrv.OpenGLRenderDevice

c:
 
"(i.e. very few people that I talk to outside of computer graphics seem to notice any real difference between Toy Story 1's graphics, and Finding Nemo's)"

Try comparing Toy Story to Revolutions. I think even the lamest of laymen could tell a big difference.


As far as his CPU comments are concerned.. perhaps an integreation of the two in one form or another into a single chip with embedded memory? Im not nearly as qualified to speculate as most here, but I'll indulge myself this once. :p
 
gurgi said:
"(i.e. very few people that I talk to outside of computer graphics seem to notice any real difference between Toy Story 1's graphics, and Finding Nemo's)"

Try comparing Toy Story to Revolutions. I think even the lamest of laymen could tell a big difference.

Agreed. Finding Nemo is a stylised computer based animation. Most people seeing computer graphics in mainstream movies don't even realise that is what they are watching because the graphics can be so realistic. Until we can reach "indistinguisable from real life" levels of realtime rendering, there is still a long way to go before we start to see a slowdown in the increasing quality we see every year. Increases in processing power will be constantly eaten up by increasingly realistic rendering.
 
Dio said:
For TL, I would expect a fast CPU to already be competitive with a hardware solution, because TL is easy.

The real performance problem is in rasterisation and texturing, where you have to handle scan conversion, per-pixel mip calculations, texel-miss-per-pixel, and filtering...

I agree with everything you've stated, but in a fundamental/governing dynamic way, I think it's becoming apparent that CPU's will utilize the influx of logic sub-100nm by going concurrent.

This is going to intersect the current trend in consumer 3D which is towards a unified shader sooner than later. These are the tasks I look at as the core of tomorrow's processing load. The tasks you listed are all "dumb" in that they're almost entirely fragment ops which scale linearly and rely on sheer parallel processing for their speed up - hell, stuff like that should be done in dedicated logic constructs/pipelines.

But, IMHO the future is in Shaders and flexible, 'plastic', computational resources - an area in which the CPU is quickly evolving into a niche filler. Beyond the tasks you listed (which aren't that logic heavy) what good will a GPU be in 2010? Other than an additional 200mm^2 chuck of logic? Architecturally, what's going to distinguish nVidia's or ATI's Shaders/PPP front-end from anything more than a computationally restricted and arbitrarily defined CPU?
 
By concurrency, I presume you mean hyperthreading or (if it ever happens) desktop multicore.

Texel-miss-per-pixel is not a parallel processing or concurrency issue.

It is the fundamental reason that graphics chips are designed the way they are, and the fundamental reason it will become very hard for a general purpose CPU to catch up. Depending on how you define it, you could say that most of the transistors on a GPU are tasked to solving this problem. :)

At the moment it isn't particularly a limiting factor for software renderers because the rest of it is so slow, but if you're trying to generate 1 pixel in 10-20 cycles then it damn well will be.

I can't look as far ahead as 2010 - as I show above, my five-year predictions of graphics performance have always been wrong. There is indeed a possibility by that point that the graphics chip will be reduced to 'add-in' features rather than must-haves (as sound cards are now).

Architecturally, what's going to distinguish nVidia's or ATI's Shaders/PPP front-end from anything more than a computationally restricted and arbitrarily defined CPU?
Nothing. But a few carefully placed restrictions can greatly increase performance. The aim of graphics manufacturers right now seems to be to relax the restrictions in a manner that doesn't reduce performance.
 
Dio said:
For TL, I would expect a fast CPU to already be competitive with a hardware solution, because TL is easy.
First of all, fast CPU's aren't competitive.

But even then, the main reason to go hardware T&L has nothing to do with how fast the hardware is at it (and it won't until hardware supports robust higher-order surfaces, for scalability of triangle count across disparate hardware), but rather with the very significant number of CPU cycles freed by offloading the processing.
 
Dio said:
I can't look as far ahead as 2010 - as I show above, my five-year predictions of graphics performance have always been wrong. There is indeed a possibility by that point that the graphics chip will be reduced to 'add-in' features rather than must-haves (as sound cards are now).
Heh. The way Intel develops graphics chips, and the nonexistance of any graphics chips by ATI, I think it'll be a very long time before CPU's have integrated graphics that are remotely useable for games.
 
Ailuros said:
Would I count on say R3xx the 4 VS and 8 PS ALUs as SSE2 units in a metaphorical sense, I'd already reach a total of 12 SSE2 units, whereby the R3xx family isn't even close in arithmetic or computational efficiency I could imagine for PS/VS3.0 or DX-Next compliant GPUs.

And now take into account close rate differences...

A 3.2 Ghz P4 would need to be able to complete 1.5 SSE2 operations per cycles to be slightly faster than a Radeon 9800 Pro. Right now it only does ~1 SSE2 op per cycles, but this could be increased in the future if it was needed. If there was a market, it wouldn't be unheard of for a future P4 to have 4 concurrent SSE2 units which would make a future hypothetical P4 significantly faster than a hypothetical 16 pipe R420+.

The thing to remember is that cpus already do all the hard stuff. GPUs only do the simple stuff, but as the interface moves to PS/VS 3 and beyond the GPUs will have to start doing the hard stuff and run into all the problems that cpus solved decades ago... starting with the branch.


Aaron Spink
speaking for myself inc
 
Well each shader unit can do 4 x madd per cycle. That's 96 floating point operations per cycle or the equivalent of 24 SSE(1/2) units. So even with a 8 x frequency advantage the P4 would still need 2 full SSE units more to match the peak floating point performance of the R350.

That and loads of bandwidth.

I don't see it happening.

CPUs and GPUs are built to solve different problems. GPUs fall into the much touted streaming architecture category, ie built primarily to exploit spatial locality. Whereas modern CPUs uses a growing percentage of the die for cache RAM simply because the problems that you want to solve on a CPU has good (or better) temporal locality as found in you typical pointer chasing C/C++/Java object oriented spaghetti.

Cheers
Gubbi
 
Simon F said:
One that executes 100s of instructions in parallel? That would be a remarkably "high end" cpu.

Define instruction. I've yet to see a GPU that executes more than 2 instructions from a program in parallel. I've seen some that execute a lot of programs in parallel, but people have been doing that since the dark ages, but none that execute more than 2 instructions from a single program in parallel.

Sorry, you've lost me. If you mean graphics (or, for that matter, some stream processing applications) then GPUs are already faster and cheaper than CPUs.

Not for a large class of streaming and graphics application. Most GPUs are toys at best lacking any illusion of being Turing complete. As such there are a wide variety of workloads they just simple can't handle at a reasonable speed.

We've seen these sort of cycles in the past where dedicated HW was replaced by HW made from commodity (programmable) chips, which again was replaced by custom HW **. At the moment, powerful CPUs are relatively expensive and I can't see that changing any time in the near future.

Acutally, I would wager to guess that powerful CPUs are actually a lot cheaper than pretty much any descrete GPU you can buy. Both AMD and Intel have huge cost savings from being vertically integrated along their manufacturing and design chain.

Oh, you mean for you to buy? Well that has more to do with perceived value than anything else. It is no secret that Nvidia and ATI sell silicon for 1/10th the price per mm^2 that AMD and Intel do. If AMD/Intel wanted the GPU market, they could easily integrate it. Right now though, they just make more money not doing it.


Aaron Spink
speaking for myself inc
 
The amount of die area you have to waste to go beyond 1 or 2 IPC simply isnt worth it for the kind of workloads we want to run on it, games, MIMD with low IPC for any given thread of execution is the plain better approach. CPUs might not be designed for that at the moment, and Intel might make a good living keeping it that way ... but IMO that is only the momentum of legacy.

Lets see how well Sony does and where the future lies. I think wide superscalar machines are on their way out.
 
The argument about exactly what Sony are doing is what I've been trying to stay away from. That's a fight for a different thread.

It just isn't relevant to this, because this is the PC space we are talking about. Within 3-4 years, I doubt if we'll see either the 'PC space' not being x86-based, or an x86-class chip with 'Sony-like' architecture.
 
gurgi said:
"(i.e. very few people that I talk to outside of computer graphics seem to notice any real difference between Toy Story 1's graphics, and Finding Nemo's)"

Try comparing Toy Story to Revolutions. I think even the lamest of laymen could tell a big difference.

While I thought Reloaded's CGI was fairly obvious (I assume Revolutions' is the same, though I won't waste my time seeing the movie to verify that), and Finding Nemo's is obviously not meant to be realistic (though The Incredibles has some incredible human animation, leaps beyond Toy Story), I think that new Nike football commercial has some pretty incredible CGI animation.
 
Gubbi said:
Well each shader unit can do 4 x madd per cycle. That's 96 floating point operations per cycle or the equivalent of 24 SSE(1/2) units. So even with a 8 x frequency advantage the P4 would still need 2 full SSE units more to match the peak floating point performance of the R350.

Bah, it is 1 3/4 Element SIMD and 0/1 scalar op per cycle. MACs have nothing o do with it, or a max of 8 4 element SIMD ops per cycle. Which is exactly what I said. Don't get hung up on MACs, they are a detail. In fact, execution units are pretty much insignificant details in todays processor designs. SIMD units are very very easy to design and build at high speed.




CPUs and GPUs are built to solve different problems. GPUs fall into the much touted streaming architecture category, ie built primarily to exploit spatial locality. Whereas modern CPUs uses a growing percentage of the die for cache RAM simply because the problems that you want to solve on a CPU has good (or better) temporal locality as found in you typical pointer chasing C/C++/Java object oriented spaghetti. [/qoute]

Yet GPUs are also using a growing portion of their die for cache....

GPUs were orignially designed to offload simple fixed function operations from the cpu. And they have done a good job. There is no reason to break up a texture filter into its basic element and execute each seperately. You know you want to filter the texture, just do it. These are the kinds of operations that GPUs were designed for.

The issue is that the programming models for GPUs is shifting them out of that domain and into the domain of general processors. They are no longer just given a set of inputs that select which fixed function unit to use or how that fixed function unit should output something. They are being asked to run increasingly complex programs with the end goal being to run arbitrary length programs with any degree of flow control that the programmer wishes.

The language to program these GPUs will move from the assembly/C like HLSL to more complex programming languages analogous to C++. The issues faces by modern CPU will be faces by what are then called GPUs. These GPUs of the future will have more in common with our modern day CPUs than they have in common with our modern day GPUs.

Aaron Spink
speaking for myself inc
 
x86 becoming a less viable platform for entertainment (and for HPC for that matter, traditional CPUs may have a small heyday in that area at the moment ... but that will soon be history too) willl be a force for change, I dont really care about the exact time schedule when it will be felt.

Modern CPUs are wide superscalar monsters not designed for area-efficiency in a parallel situations ... I dont see GPUs (or Sony) heading that way at all, their cores will have more in common with CPUs of a decade ago than anything we see today.

Hell, IBM's Bluegene processors are pretty much a throwback compared to modern CPUs.
 
Back
Top