Price of Graphics boards

Sxotty · Jul 28, 2003

Wen Intel and AMD started their fight for performance crown prices dropped a staggering amount, like good capitalists would expect.

What is the deal with graphics boards, new ones come out and the fight is on but there has not been a large drop in price. The 9700 pro is still relatively expensive, other boards just dissapear instead of ever getting cheap. It seems that waiting for that right time to buy is a non-issue and the right time is always now b/c tings are always expensive.

Dave Baumann · Jul 28, 2003

For a high end P4 Intel is pulling in about $180, for an R350 or NV35 the IHV's are pulling in about $18 - and these chips are more complicated than P4's. As you can see, there is plenty of room for the CPU manufacturers to move, but not a hell of a lot for the 3d manufacturers.

digitalwanderer · Jul 28, 2003

Video cards are like cars...

They depreciate in value DRAMATICALLY just after being bought, I've been buying 'em used for the last couple of 3 cards and REALLY getting a lot more bang for me buck!

Nick · Jul 28, 2003

DaveBaumann said:
...and these chips are more complicated than P4's.

What? In transistor count they win, but I wouldn't exactly call them more complicated. Does a GPU have unlimited stored programs? Does it have virtual addressing? Does it do out-of-order execution? Jump prediction? Task switching? Exception handling? Does it allow you to access all memory? Does it have 8-way associative caches? Does it have a trace cache? Does it have forwarding? Hyper-Threading?

I know quite a lot about the CPU architecture, so if you're serious about GPUs being more complicated then please educate me...

Pete · Jul 28, 2003

My amateur perspective: I still view the 3D market ATi and nVidia compete in as primarily a luxury market, whereas Intel's and AMD's business is really more vital. Also, ATi and nVidia are at the mercy of third parties to provide basically their whole product, from materials to manufacture, whereas Intel and AMD own their own fabs.

Sxotty · Jul 28, 2003

So dave do you have any idea what the chips actually cost to produce?

I am curious what percentage is the memory, board, chip, engineering costs and so forth. The reason I wonder is that obviously then one could see what costs if lowered would reduce the price considerably.

andypski · Jul 28, 2003

Nick said:
...

Many of the things you describe don't occur in VPUs, however there are plenty of other things that do.

Does it allow you to access all memory?

It certainly accesses all the memory that is attached to it, yes. From an individual program that you would write, typically no.

Does it have 8-way associative caches?

Certainly has associative caches. Whether they're 8-way or not would be private implementation details.

Hyper-Threading?

Might well be running an awful lot of threads.

Task switching?

Every time it processes a new pixel?

Let's pick out some more examples -

Does a processor have a built in memory controller? (Well, I guess Opteron does now). Dual memory interface (AGP and local)? Display hardware? Video stream processing? IDCT? Full featured BitBLT with ROPs? Line drawing? Anti-aliasing? Hierarchical depth-culling? Colour compression? Depth compression? Texture decompression? Colourspace conversion? Input and output gamma correction? Primitive assembly? High order surface tessellation? Triangle rasterisation? Texturing? Trilinear filtering? Anisotropic filtering? Clipping? Scissoring? Overlay? Alpha blending? Fogging? Stencil buffering?

How about doing all of these simultaneously? How about doing all of these simultaneously while also running vertex and pixel shading programs on multiple vectors?

There's plenty of complexity in a VPU. A CPU's complexity is in the program. A VPU's is in all the many things that it does simultaneously to give you a high-speed graphics display.

A CPU can do many of the above things, but it doesn't have dedicated hardware for it, and will do it extremely slowly. It's hard work making a VPU.

- Andy.

WaltC · Jul 28, 2003

I think it's the size of the markets. New cpu sales amount to ~100 million a year, +. The high-end 3D-card gaming market is probably < or = 5% of that annually. In fact, I'd estimate the whole 3D gaming market (from avid gamers buying 8-10 games a year to people who buy 1-2 games a year) is no greater than 15M worldwide at any given time. So I think the higher costs of the higher-end 3D cards reflect the need to make much higher profits per unit sold in order to drive R&D and because of the comparative size of the 9700P-class 3D market compared to the general cpu market at large.

It's not really as different as it might seem, though. If I want to I can go out and buy a cpu from either Intel or AMD which costs a multiple of what my R9800P costs---not including the motherboard I might also have to buy to support those cpus. Volume and yields are a major part of the equation.

Bambers · Jul 28, 2003

Also, a cpu is just pretty much that, a chip.

On a graphics card you also have memory (fast and expensive stuff on high end boards) and a board to add to the cost.

Take the cpu price and add at least the ram and possibly some of the mobo cost an now compare

Nick · Jul 28, 2003

Ok it seems like there are lots of things about GPU architecture I don't understand:

andypski said:
It certainly accesses all the memory that is attached to it, yes. From an individual program that you would write, typically no.

As far as I know the CPU's memory access method is a lot more complicated than a GPU's. It has to calculate addresses, look it up in the L1 cache, translate to physical address, lookup in L2 cache, generate a page interrupt when data is not in RAM, etc. As far as I know a GPU works with very limited caches, addresses are hardwired, unpaged and always physical. Overall less complicated and dedicated.

Certainly has associative caches. Whether they're 8-way or not would be private implementation details.

That makes little sense to me. For example a texture cache doesn't need associativity because you only need texels close together. Or is there a unified cache for all texture units? In that case it would be logical but I expect every texture unit can have its own tiny cache. Don't know, just sounds most probable to me...

Might well be running an awful lot of threads.

Really? I thought a GPU just had hardwired control? Has this always been the case or only recently? I'm just wondering because else I would expect GPUs are usable for much more than rendering. If I understand correctly the GPU is just a bunch of SIMD units?

Every time it processes a new pixel?

Euh, you got me confused again. Can't that be hardwired? I mean, many GPUs have 4 pixel pipelines, not 4 pixel threads does it? I'm particularly interested in this because my software rendering project (see sig.) can't process pixels in complete parallel, and I though a GPU could.

Does a processor have a built in memory controller? (Well, I guess Opteron does now).

You answered your own question

I thought a memory controller was not very complicated in design? All it has to do is generate row and colums signals from a linear address?

Dual memory interface (AGP and local)?

Hmm, I'm getting intrigued. How is AGP handled? Does it contain control about where the data has to be stored on the card's memory or is that also directed by the GPU? Or am I completely wrong?

Display hardware?

You mean the RAMDAC? That's a separate chip which doesn't have to be very complicated as far as I know.

Video stream processing?

Ok I don't really count that to the GPU, but you're right, nowadays it's also integrated on the chip. Isn't it implemented with a programmable DSP?

IDCT?

Same as above? Doesn't really have anything to do with 3D rendering?

Full featured BitBLT with ROPs?

A blit doesn't seem a complicated operation to me, and ROPs only need a basic ALU. Or I'm probably terribly wrong again...

Line drawing?

If I recall correctly, very little hardware has line drawing capabilities and most just draw a thin rectangle?

Anti-aliasing?

Well I'm not up to date with anti-aliasing techniques, but super-sampling just requires a bigger frame buffer and color averaging (with gamma correction). The BitBLT unit could maybe help here?

Hierarchical depth-culling?

Let's see. All you need is a few comparators and calculating the addresses of the hierarchical depth buffers. Of course a lot more complicated than straightforward depth buffers but it doesn't seem like it needs a radical design change.

Colour compression? Depth compression? Texture decompression? Colourspace conversion? Input and output gamma correction?

Also seems like a 'plugins' that don't influence the rest of the chip much. So not too complicated.

Primitive assembly? High order surface tessellation?

Really wouldn't know. Seems complicated

Triangle rasterisation?

Hardwired Bresenham algorithm and interpolators?

Texturing? Trilinear filtering? Anisotropic filtering?

Can also be done with dedicated units? Again this seems like a 'component' to me that hasn't changed much in 'functionality' over the years.

Clipping? Scissoring?

I'm sure clipping also can be hardwired. Don't know anything about scissoring.

Overlay? Alpha blending? Fogging? Stencil buffering?

Also seems like basic extensions of the pipelines...

How about doing all of these simultaneously? How about doing all of these simultaneously while also running vertex and pixel shading programs on multiple vectors?

Doesn't every unit just work independently? It seems nowhere near as complicated as out-of-order execution where everything is shared and there are hundreds of exceptions. Just to mention a few: jump misprediction, register renaming, resource dependency, interrupts, monitoring, address generation interlock, locked instruction execution, blocking instructions, etc. There are no independent components that handle this.

There's plenty of complexity in a VPU. A CPU's complexity is in the program. A VPU's is in all the many things that it does simultaneously to give you a high-speed graphics display.

There surely is a lot of complexity in the microcode, but it wouldn't be complex if the hardware wasn't complex. In a GPU every unit can just work nearly independently. Control isn't influenced much by the states of other units. It just processes what comes in and passes it to the next unit. Again, I could be very wrong about this because I never learned GPU architecture at university, but that's how I see things. If I'm terribly wrong please correct me.

A CPU can do many of the above things, but it doesn't have dedicated hardware for it, and will do it extremely slowly. It's hard work making a VPU.

Sure I never implied a GPU is 'easy', but I find it a bit unlogical to think it's more 'complicated' than a CPU. The way I see it, if you change one thing in a CPU, the whole design has to change. For a GPU it seems that certain things are completely reusable for every design and can relatively simply be extended in functionality without depending on the rest of the chip's implementation.

I'm not sure of this, but I think a modern CPU, compared to a GPU with only one pipeline, has more 'functional' transistors. I mean leaving the caches and such aside. I'm sure a GPU can do thousands of operations per clock, but a CPU has hundreds of micro-instructions 'in flight'. But a main difference is that each of these micro-instructions can change the control of the whole pipeline. A GPU works much more 'linear' and computed results don't influence the execution of other parts of the chip directly.

Thanks

OpenGL guy · Jul 28, 2003

Nick said:
andypski said:

It certainly accesses all the memory that is attached to it, yes. From an individual program that you would write, typically no.

Click to expand...

As far as I know the CPU's memory access method is a lot more complicated than a GPU's. It has to calculate addresses, look it up in the L1 cache, translate to physical address, lookup in L2 cache, generate a page interrupt when data is not in RAM, etc. As far as I know a GPU works with very limited caches, addresses are hardwired, unpaged and always physical. Overall less complicated and dedicated.

A lot of what you said about CPUs applies to GPUs as well. Except that what you said about "hardwired addresses". Huh? Nothing's hardwired in that way because addresses can change (textures/backbuffers/Z buffers can be in local or AGP at any available address within the address space).

Certainly has associative caches. Whether they're 8-way or not would be private implementation details.

Click to expand...

That makes little sense to me. For example a texture cache doesn't need associativity because you only need texels close together. Or is there a unified cache for all texture units? In that case it would be logical but I expect every texture unit can have its own tiny cache. Don't know, just sounds most probable to me...

Associativity has little to do with locality and everything to do with multiple data sources.

Might well be running an awful lot of threads.

Click to expand...

Really? I thought a GPU just had hardwired control? Has this always been the case or only recently? I'm just wondering because else I would expect GPUs are usable for much more than rendering. If I understand correctly the GPU is just a bunch of SIMD units?

Even a single pixel pipeline could be multi-threaded. Imagine that you're doing 8 layer multitexturing. A single pixel will take a long time because you have to access 8 textures. If you waited for each texture before going to the next you would waste a lot of texture cache performance. Another way to do it is to have the pixel get the first texture, then work on the next pixel while the first is waiting for the next texture, and so on. You can expand this to pixel shaders too...

Display hardware?

Click to expand...

You mean the RAMDAC? That's a separate chip which doesn't have to be very complicated as far as I know.

Separate chip? We've had integrated RAMDACs for years.

<snip>

You seem to be under the impression that "hardwired" equals "simple" when that's not always the case.

I'm not sure of this, but I think a modern CPU, compared to a GPU with only one pipeline, has more 'functional' transistors. I mean leaving the caches and such aside. I'm sure a GPU can do thousands of operations per clock, but a CPU has hundreds of micro-instructions 'in flight'. But a main difference is that each of these micro-instructions can change the control of the whole pipeline. A GPU works much more 'linear' and computed results don't influence the execution of other parts of the chip directly.

How many pixels do you think a GPU has a flight at any given moment? And functional transitors? How many ALUs does a P4 have? How many does an R300 have?

RussSchultz · Jul 28, 2003

How many pixels ARE in flight at once?

Dave Baumann · Jul 28, 2003

Nick said:
As far as I know the CPU's memory access method is a lot more complicated than a GPU's. It has to calculate addresses, look it up in the L1 cache, translate to physical address, lookup in L2 cache, generate a page interrupt when data is not in RAM, etc. As far as I know a GPU works with very limited caches, addresses are hardwired, unpaged and always physical. Overall less complicated and dedicated.

You might want to read this for at least one implementation of memory addressing and as threading within a graphics processor.

PSarge · Jul 28, 2003

RussSchultz said:
How many pixels ARE in flight at once?

Hundreds

Nick · Jul 28, 2003

Sorry for taking this completey off-topic but I'm kindof intrigued...

OpenGL guy said:
A lot of what you said about CPUs applies to GPUs as well. Except that what you said about "hardwired addresses". Huh? Nothing's hardwired in that way because addresses can change (textures/backbuffers/Z buffers can be in local or AGP at any available address within the address space).

What I meant is that you have dedicated hardware for computing adresses, at virtually no cost. While on a Pentium 4 they have to be computed separately. But that was my idea of it before reading your answer. So a GPU also requires extra cycles for setting up, computing and incrementing addresses? It's still not virtual addresses, but very close...

Associativity has little to do with locality and everything to do with multiple data sources.

If data is very local and you only have one data source you don't need associativety, right? That was my idea of how a texture cache works, with every texture unit it's own cache. But apparently a unified cache is used?

Even a single pixel pipeline could be multi-threaded. Imagine that you're doing 8 layer multitexturing. A single pixel will take a long time because you have to access 8 textures. If you waited for each texture before going to the next you would waste a lot of texture cache performance. Another way to do it is to have the pixel get the first texture, then work on the next pixel while the first is waiting for the next texture, and so on. You can expand this to pixel shaders too...

That second technique is what I referred to as pipelining. But I see that having 8 texturing stages isn't economical. So I see what you mean by treads. But are those threads on micro-instruction level (i.e. they are hard-coded control signals, in term controlled by certain variables like number of texture layers), or real instruction level threads like on a CPU?

Separate chip? We've had integrated RAMDACs for years.

Oh, had a TNT2 for too long...

But still it isn't really a complicated component, right? I mean in a CPU you have little that could be separated from the rest...

You seem to be under the impression that "hardwired" equals "simple" when that's not always the case.

No, of course not, but with a Pentium 4 nearly nothing is dedicated and therefore needs complicated control. But operations like bilinear filtering, does a GPU have specialized 'hardwired' units for that or does it also use fully programmable add and mul units?

How many pixels do you think a GPU has a flight at any given moment? And functional transitors? How many ALUs does a P4 have? How many does an R300 have?

I know a lot about x86 programming and its architecture, but I really wouldn't know how a GPU works. Does it have instrutions like 'compute texel offset from u and v coordinates' (a very common operation) or is that micro-coded or is that hardwired? In my software renderer it's implemented with SIMD instructions, which in term are micro-coded. Is it similar on a GPU or does it have dedicated silicon for this operation, hardwired to the rest of the pipeline. I always thought the latter was the case, but can we really call it an ALU then? I must be way off here...

But I still don't get a few things. If GPUs are nearly as versatile as a CPU, then why don't we have raytracing implemented on a Geforce, or at least ps 3.0 support? Do the driver developers have access to the finest control like micro-instructions or is it much more high-level? And on the other side, why the hell are CPUs so slow then and use so many transistors if they have a lot less execution units? I mean look at the die of a Pentium 4: A big portion of it is floating-point execution units, but they only do a few operations per clock, while a modern GPU has dozens of floating-point units on roughly the same die space? If this is really caused by the 'outdated' x86 architecture then how comes the competition doesn't have much faster CPUs?

andypski · Jul 28, 2003

I'll try to cut this down a bit as it's rather long otherwise.

Nick said:
As far as I know the CPU's memory access method is a lot more complicated than a GPU's. It has to calculate addresses, look it up in the L1 cache, translate to physical address, lookup in L2 cache, generate a page interrupt when data is not in RAM, etc. As far as I know a GPU works with very limited caches, addresses are hardwired, unpaged and always physical. Overall less complicated and dedicated.

Certainly memory accesses in VPUs may go through a less complex path. That wasn't what you asked about originally, though.

Certainly has associative caches. Whether they're 8-way or not would be private implementation details.

Click to expand...

That makes little sense to me. For example a texture cache doesn't need associativity because you only need texels close together. Or is there a unified cache for all texture units? In that case it would be logical but I expect every texture unit can have its own tiny cache. Don't know, just sounds most probable to me...

Don't you need associativity? Consider the multitexturing case - You can be accessing up to 16 different textures in a PS2.0 part - those textures can come from locations that immediately thrash each other in a direct-mapped cache. You won't get much benefit from caching if every texel you read evicts the ones you have...

Every time it processes a new pixel?

Click to expand...

Euh, you got me confused again. Can't that be hardwired? I mean, many GPUs have 4 pixel pipelines, not 4 pixel threads does it? I'm particularly interested in this because my software rendering project (see sig.) can't process pixels in complete parallel, and I though a GPU could.

Is each pipeline necessarily running only one pixel thread? For example - how would I keep my execution units busy in a 4 pipeline architecture if the execution units have a latency of, say, 5 clocks and instructions are co-dependent in a single dependency chain?

Does a processor have a built in memory controller? (Well, I guess Opteron does now).

Click to expand...

You answered your own question I thought a memory controller was not very complicated in design? All it has to do is generate row and colums signals from a linear address?

It's not just physical memory control. Arbitration of requests between different clients must be managed efficiently to keep the system operating effectively, also ensuring that there are no deadlock cases or examples where high-priority clients can completely starve low-priority ones.

A VPU can be trying to do a lot of memory operations every cycle. Think depth-fetch, colour-fetch, vertex-fetch, texture-fetch, colour-write, z-write, 2D blit, command fetch...

Dual memory interface (AGP and local)?

Click to expand...

Hmm, I'm getting intrigued. How is AGP handled? Does it contain control about where the data has to be stored on the card's memory or is that also directed by the GPU? Or am I completely wrong?

My understanding is that AGP requests go through the GART translation on the host, however a memory interface optimised entirely for local memory will be poor with AGP memory because of the very long latencies involved.

Full featured BitBLT with ROPs?

Click to expand...

A blit doesn't seem a complicated operation to me, and ROPs only need a basic ALU. Or I'm probably terribly wrong again...

So? It's all extra. All not included in a typical CPU. All of this stuff has to go on every VPU. Some of it is changed extensively from generation to generation, some parts less so.

Many things sound simple - branch prediction sounds simple - you have a table of known branch locations and the current results of prediction. As you take/don't take the branch you alter your expectations of what happens the next time you see that branch.

I'm sure that Intel would tell me that there's a little more to it than that though.

Anti-aliasing?

Click to expand...

Well I'm not up to date with anti-aliasing techniques, but super-sampling just requires a bigger frame buffer and color averaging (with gamma correction). The BitBLT unit could maybe help here?

Multi-sampling requires more. Jittered sparse grid multisampling requires more still. Gamma-corrected sparse grid multisampling with colour and depth compression requires more still. All of this has to be invented, architected, designed, implemented, verified, performance optimised...

Hierarchical depth-culling?

Click to expand...

Let's see. All you need is a few comparators and calculating the addresses of the hierarchical depth buffers. Of course a lot more complicated than straightforward depth buffers but it doesn't seem like it needs a radical design change.

I guarantee that it was a radical design change when it was first implemented. Just like instruction or data caching on CPUs were major design changes when they first appeared.

The depth buffer interacts with the pixel shader, doesn't it? After all the pixel shader can generate Z values. Does the Z check happen pre or post shading? There's more complex control involved than you seem to think.

Colour compression? Depth compression? Texture decompression? Colourspace conversion? Input and output gamma correction?

Click to expand...

Also seems like a 'plugins' that don't influence the rest of the chip much. So not too complicated.

You could say that instruction scheduling is a plugin that doesn't affect the rest of the chip much, it just makes it run more efficiently by choosing a better order to execute an instruction stream.

In reality that's a complete fallacy. One section of a design almost always affects others, often in terms of the amount of performance they must provide if nothing else. I may decide to split a single pipe design into a dual pipe one. Then I might call the pipes U and V, and make V less capable than U because my analysis shows that to get a good speedup on most common applications I only need to have a subset of instructions execute in V to frequently get a dual issue. Is this a small change? No, because with my increased performance I now have to redesign my entire back-end interface to deal with up to twice the amount of data being generated each cycle. I have to redesign my entire front end to avoid starving the execution pipes.

Nothing is truly separate. Everything is connected.

eg. Mightn't the addition of stencil completely change how my depth unit has to work?

Texturing? Trilinear filtering? Anisotropic filtering?

Click to expand...

Can also be done with dedicated units? Again this seems like a 'component' to me that hasn't changed much in 'functionality' over the years.

I think you'll find that every element of VPUs changes at least as much between generations as their CPU equivalents. The basic functions of these units are well understood, but this doesn't prevent there being innovation and redesign in each generation.

How about doing all of these simultaneously? How about doing all of these simultaneously while also running vertex and pixel shading programs on multiple vectors?

Click to expand...

Doesn't every unit just work independently?

No. Every unit works together in a complex execution stream that has to be appropriately balanced for each new design to eliminate bottlenecks and get the highest efficiency of execution. How much buffering do I need? How many pixels and vertices do I have in flight? How many Z calculations do I do each clock? How many pixels do I work on? How wide are my data paths? How do I keep performance up when doing indirected texture accesses?

It seems nowhere near as complicated as out-of-order execution where everything is shared and there are hundreds of exceptions. Just to mention a few: jump misprediction, register renaming, resource dependency, interrupts, monitoring, address generation interlock, locked instruction execution, blocking instructions, etc. There are no independent components that handle this.

Yes - this is all very complex, and there is no doubt that it makes modern CPU design a very complex thing.

For a GPU it seems that certain things are completely reusable for every design and can relatively simply be extended in functionality without depending on the rest of the chip's implementation.

I expect that at least as much stuff is redesigned between each VPU generation as between each CPU generation.

I'm not sure of this, but I think a modern CPU, compared to a GPU with only one pipeline, has more 'functional' transistors. I mean leaving the caches and such aside. I'm sure a GPU can do thousands of operations per clock, but a CPU has hundreds of micro-instructions 'in flight'. But a main difference is that each of these micro-instructions can change the control of the whole pipeline. A GPU works much more 'linear' and computed results don't influence the execution of other parts of the chip directly.

If you leave the caches aside most of the transistors from a modern CPU are gone

(That may be a joke, but it has more than an element of truth about it)

Your point is taken. Certain elements of CPU design are much more complex in terms of overall internal control than VPUs. I don't think that you're right about a CPU having more 'functional' transistors though.

How many transistors does a P4 have? Around 55 million? How much of that is cache?

How many transistors does a Radeon 9700 have? Around 100 Million. how much of that is cache?

Dio · Jul 28, 2003

I think the point is this. Both sides of the argument seem to be 'Your difficult bits are simple, while my difficult bits are difficult'. It's a reasonably pointless argument. Both CPU's and VPU's are very complicated.

But to throw in my 5p: as far as I can see, the 'architectural' aspects of the CPU aren't particularly complex. The data paths are generally simpler. There's less communication, and communication requires protocols and synchronisation, and that's difficult.

Where the CPU is vastly more complex than the VPU is when it comes to proved correctness (particularly around the complex edges of the x86 ISA - look at the number of Intel and AMD errata involving things like interrupt-during-TSS), and the extensive design engineering required to achieve clock rates more than five times what a VPU achieves. Both those are huge, difficult jobs - but probably relatively cheap in terms of silicon area.

Coming back to the original point of this thread: as someone who has seen a heatsinkless R300, and a heatsinkless Athlon XP, I know which is smaller. It ain't the Athlon.

Dave H · Jul 28, 2003

Blah Blah Blah.

A ground-up implementation of a world-class high-performance CPU (concept, design, simulation, layout, debugging, fabbing, and testing) would take several times (WAG roughly 5x) the engineer man-hours as for a world-class high-performance GPU. Of course neither is designed from scratch each time around, but the ratio for the iterative design process they do undergo would be about the same.

CPUs are more complex than GPUs.

Nick · Jul 28, 2003

I'll try to cut it down too

I'm beginning to realize that I'm greatly underestimating modern VPU design, and I was still thinking that roughly the same architecture as a TNT2 is still used.

Certainly memory accesses in VPUs may go through a less complex path. That wasn't what you asked about originally, though.

Oh, sorry if that wasn't clear, but it's indeed what I meant. VPU memory access is a lot more direct. To add a bit to the picture: a CPU also has to respect read/write protection and privelages.

Don't you need associativity? Consider the multitexturing case - You can be accessing up to 16 different textures in a PS2.0 part - those textures can come from locations that immediately thrash each other in a direct-mapped cache. You won't get much benefit from caching if every texel you read evicts the ones you have...

My error. I thought every sampler had its own cache...

Is each pipeline necessarily running only one pixel thread? For example - how would I keep my execution units busy in a 4 pipeline architecture if the execution units have a latency of, say, 5 clocks and instructions are co-dependent in a single dependency chain?

I read things like "card X has N pixel pipelines and thus generates N pixels per clock". So I still don't get it: what exactly are these pipelines? Does having four pipelines just means you have four independent execution units of every type?

A VPU can be trying to do a lot of memory operations every cycle. Think depth-fetch, colour-fetch, vertex-fetch, texture-fetch, colour-write, z-write, 2D blit, command fetch...

So, first come first serve doesn't work? I think I get it... if the front end desperately needs data to keep the pipeline full, it gets prioritized.

So? It's all extra. All not included in a typical CPU. All of this stuff has to go on every VPU. Some of it is changed extensively from generation to generation, some parts less so.

I think that's my biggest problem. I completely can't keep the different generations apart.

Multi-sampling requires more. Jittered sparse grid multisampling requires more still. Gamma-corrected sparse grid multisampling with colour and depth compression requires more still. All of this has to be invented, architected, designed, implemented, verified, performance optimised...

Yummy, now I -definitely- want to become a driver developer when I graduate.

The depth buffer interacts with the pixel shader, doesn't it? After all the pixel shader can generate Z values. Does the Z check happen pre or post shading? There's more complex control involved than you seem to think.

How about this: move all pixel shader instructions involved in the calculation of 'oDepth' to the begin of the shader. As soon as 'mov oDpeth, ...' is executed you compare it to the z-buffer... no? That's how I migth optimize it with my emulator. But in hardware things will probably go different...

<snip>

Nothing is truly separate. Everything is connected.

eg. Mightn't the addition of stencil completely change how my depth unit has to work?

Thanks, that really opened my eyes!

No. Every unit works together in a complex execution stream that has to be appropriately balanced for each new design to eliminate bottlenecks and get the highest efficiency of execution. How much buffering do I need? How many pixels and vertices do I have in flight? How many Z calculations do I do each clock? How many pixels do I work on? How wide are my data paths? How do I keep performance up when doing indirected texture accesses?

I'm just starting to realize how 'simple' my software renderer is.

Nothing is really computed in parallel and I don't have to worry about load balancing.

How many transistors does a P4 have? Around 55 million? How much of that is cache?

How many transistors does a Radeon 9700 have? Around 100 Million. how much of that is cache?

Roughly half of a Pentium is cache... so lets say 25 million functional transistors. A Radeon 9700's cache is probably negligible, but it has 8 pipelines so it only has a complexity of 12.5 million transistors?

Got to sleep now... Thanks for all the answers!

Dave H · Jul 29, 2003

Dio said:
Where the CPU is vastly more complex than the VPU is when it comes to proved correctness (particularly around the complex edges of the x86 ISA - look at the number of Intel and AMD errata involving things like interrupt-during-TSS), and the extensive design engineering required to achieve clock rates more than five times what a VPU achieves. Both those are huge, difficult jobs - but probably relatively cheap in terms of silicon area.

Me too.

Price of Graphics boards

Sxotty

Dave Baumann

Gamerscore Wh...

digitalwanderer

wandering

Nick

Pete

Moderate Nuisance

Sxotty

andypski

WaltC

Bambers

Nick

OpenGL guy

RussSchultz

Professional Malcontent

Dave Baumann

Gamerscore Wh...

PSarge

Nick

andypski

Dio

Dave H

Nick

Dave H

Similar threads