NVIDIA on GPU architecture.

Kewl! Thanks.

As both clock speeds and chip sizes increase, the amount of time it takes for a signal to travel across an entire chip, measured in clock cycles, is also increasing. On today’s fastest processors, sending a signal from one side of a chip to another typically requires multiple clock cycles, and this amount of time increases with each new process generation. We can characterize this trend as an increase in the cost of communication when compared to the cost of computation.

Why doesn't this mean Very Bad Things for generalizing units to be used amongst multiple kinds of higher-class units? I'm thinking of the unification of PS/VS here. Won't that require a substantially greater volume of communication amongst them than is currently the case?
 
Actually, in this instance a more generalised unit can reduce communication. Lets say that we have a task that requires some vertex processing, the results of that got the pixel shaders and then the results of that may go back to the VS again (for whatever reason) with a discrete VS / PS implementation the data will be being passed from one end of the chip to the other, but a unified approach the communication overhead will be lower since the units are both the VS and PS.

Communication overhead can be lowered for larger chips by the use of different internal comminication lines to those used today (extraolated onto larger chips) - take a look at Cell's internal memory bus as an example.
 
DaveBaumann said:
Actually, in this instance a more generalised unit can reduce communication. Lets say that we have a task that requires some vertex processing, the results of that got the pixel shaders and then the results of that may go back to the VS again (for whatever reason) with a discrete VS / PS implementation the data will be being passed from one end of the chip to the other, but a unified approach the communication overhead will be lower since the units are both the VS and PS.

Communication overhead can be lowered for larger chips by the use of different internal comminication lines to those used today (extraolated onto larger chips) - take a look at Cell's internal memory bus as an example.

Hmm! So the kind of generalization you're pointing at is actually a performance "win", at the potential cost of more transistors (to make the generalization happen) and more of those potentially "on the dole" in any given cycle. This would explain why it would take a major node shift like 90nm (or 65?) to make it happen --you need a sudden infusion of more transistors to make the switch without giving up any of your current functionality or number of units. You'd also, I think, have to feel pretty good about your current fill rate as not being a bottleneck --otherwise you might get killed when the other fellow adds four more pipes rather than generalize.

Bad, bad would be a pool of units "off yonder" that could work with any of the pipelines? That's where you'd get killed on communication costs?
 
Dunno, in a GPU I would presume that vertex/pixel shaders take up a very large chunk of the die area in any case, so even if you stay within the vertex/pixel shader portion of the core you will get slammed with communication overhead once you need to move data from one pipeline to another. This doesn't get any better just because the vertex and pixel shader blocks are merged.

As for Cell, I would guess that its internal buses are heavily buffered and pipelined; IIRC it could handle something like 16 outstanding transfers per SPE, which suggests that it is built to tolerate large latencies.
 
One more thot, Dave --how well would such an architecture "downsize" into the mid-range and low-end? Didn't Orton make some noises in your interview that it is getting increasingly hard to make one arch that is suitable as you take it down the range?
 
arjan de lumens said:
Dunno, in a GPU I would presume that vertex/pixel shaders take up a very large chunk of the die area in any case, so even if you stay within the vertex/pixel shader portion of the core you will get slammed with communication overhead once you need to move data from one pipeline to another. This doesn't get any better just because the vertex and pixel shader blocks are merged.

Take a look at the number of stages between the VS and PS in the document at the moment, sure some of those won't necessarily be growing at the rate the shader processors are but elements will be sacling up as well. With a unifed structure you may not actually be moving from one pipeline to another either.

As for Cell, I would guess that its internal buses are heavily buffered and pipelined; IIRC it could handle something like 16 outstanding transfers per SPE, which suggests that it is built to tolerate large latencies.

Graphics processors have to deal with texture lookups, so they know all about latencies! ;)
 
arjan de lumens said:
As for Cell, I would guess that its internal buses are heavily buffered and pipelined; IIRC it could handle something like 16 outstanding transfers per SPE, which suggests that it is built to tolerate large latencies.
from a CELL presentation:
In order to leverage the bandwidth to a main
memory that has a latency of, say, 1K cycles, and transfer
granule of, say, 8 cycles, 128 transfers need to be
pipelined to fully leverage the available bandwidth.
16 outstanding DMA requestes per SPE x 8 SPE = 128 pipelined transfers ;)
 
geo said:
One more thot, Dave --how well would such an architecture "downsize" into the mid-range and low-end? Didn't Orton make some noises in your interview that it is getting increasingly hard to make one arch that is suitable as you take it down the range?

That issues already manifests itself on current products - witness RV350-380 dropping the HierZ from the core and 5200/6200 dropping various compression techniques. There will always be elements that may work for the high end, and the die size its targetting, but won't be correct for the lower end parts. If there are element that are being built for new high end chips that are being consisdered specifically to alleviates the size they are reaching then these may not be appropriate for the lower end parts as these are likely to be similar transistor quantities as todays high end.
 
DaveBaumann said:
That issues already manifests itself on current products - witness RV350-380 dropping the HierZ from the core and 5200/6200 dropping various compression techniques. There will always be elements that may work for the high end, and the die size its targetting, but won't be correct for the lower end parts. If there are element that are being built for new high end chips that are being consisdered specifically to alleviates the size they are reaching then these may not be appropriate for the lower end parts as these are likely to be similar transistor quantities as todays high end.

Yeah, I get the early part of that. I was wondering if this makes this particular problem worse, better, or no effect. What I get from the end part of the above is possibly a little worse.
 
Dave B(TotalVR) said:
The answer is a spherical GPU core, so you reduce the maximum distance any signal has to travel ;)

The real answer is a temporal core so it can travel any distance instantaneously. ;)
 
Changing topics within the subject, why don't the Photoshops & PaintShop Pro's of the world take advantage of all that horsepower & fast memory in modern gpus? Wouldn't all that parellelism and arithmatic horsepower be right up their alley? They did some optimisations in the past for SSE, right?
 
Maybe b/c video cards don't have enough RAM for the typical PS workload, or having the GPU access system RAM was too slow with AGP? PCIe may change this.
 
THanks for the link DeanoC. It really is quiet brilliant what Core Image is capable of. MS is hyping Avalon but is it really the best MS can do to counter Core Image?
 
DeanoC said:
Pete said:
Maybe b/c video cards don't have enough RAM for the typical PS workload, or having the GPU access system RAM was too slow with AGP? PCIe may change this.

Apple's Core Image have shown that a GPU based image processing architecture is the future.

http://www.appleclub.com.hk/macosx/tiger/core.html

Hmm! A very impressive list of filters listed too (not included here):
Until now, harnessing the power of the GPU required in-depth knowledge of pixel-level programming. Core Image allows developers to easily leverage the GPU for blistering-fast image processing that can eliminate rendering time delays. Effects and transitions can be expressed with a few lines of code. Core Image handles the rest, optimizing the path to the GPU. The result is real-time, interactive responsiveness as you select and apply filters.

Supported graphics cards:
ATI Radeon 9800 XT
ATI Radeon 9800 Pro
ATI Radeon 9700 Pro
ATI Radeon 9600 XT
ATI Radeon 9600 Pro
ATI Mobility Radeon 9700
ATI Mobility Radeon 9600
NVIDIA GeForceFX Go 5200
NVIDIA GeForceFX 5200 Ultra

One might wonder if the demo guys and gals at the IHVs should do something with Photoshop to show the world. . .and expand the market for top-end gpus.
 
Back
Top