From GPU Gems 2, a good explaination of why GPU are like they are.
http://download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch29.pdf
http://download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch29.pdf
As both clock speeds and chip sizes increase, the amount of time it takes for a signal to travel across an entire chip, measured in clock cycles, is also increasing. On today’s fastest processors, sending a signal from one side of a chip to another typically requires multiple clock cycles, and this amount of time increases with each new process generation. We can characterize this trend as an increase in the cost of communication when compared to the cost of computation.
DaveBaumann said:Actually, in this instance a more generalised unit can reduce communication. Lets say that we have a task that requires some vertex processing, the results of that got the pixel shaders and then the results of that may go back to the VS again (for whatever reason) with a discrete VS / PS implementation the data will be being passed from one end of the chip to the other, but a unified approach the communication overhead will be lower since the units are both the VS and PS.
Communication overhead can be lowered for larger chips by the use of different internal comminication lines to those used today (extraolated onto larger chips) - take a look at Cell's internal memory bus as an example.
arjan de lumens said:Dunno, in a GPU I would presume that vertex/pixel shaders take up a very large chunk of the die area in any case, so even if you stay within the vertex/pixel shader portion of the core you will get slammed with communication overhead once you need to move data from one pipeline to another. This doesn't get any better just because the vertex and pixel shader blocks are merged.
As for Cell, I would guess that its internal buses are heavily buffered and pipelined; IIRC it could handle something like 16 outstanding transfers per SPE, which suggests that it is built to tolerate large latencies.
from a CELL presentation:arjan de lumens said:As for Cell, I would guess that its internal buses are heavily buffered and pipelined; IIRC it could handle something like 16 outstanding transfers per SPE, which suggests that it is built to tolerate large latencies.
16 outstanding DMA requestes per SPE x 8 SPE = 128 pipelined transfersIn order to leverage the bandwidth to a main
memory that has a latency of, say, 1K cycles, and transfer
granule of, say, 8 cycles, 128 transfers need to be
pipelined to fully leverage the available bandwidth.
geo said:One more thot, Dave --how well would such an architecture "downsize" into the mid-range and low-end? Didn't Orton make some noises in your interview that it is getting increasingly hard to make one arch that is suitable as you take it down the range?
DaveBaumann said:That issues already manifests itself on current products - witness RV350-380 dropping the HierZ from the core and 5200/6200 dropping various compression techniques. There will always be elements that may work for the high end, and the die size its targetting, but won't be correct for the lower end parts. If there are element that are being built for new high end chips that are being consisdered specifically to alleviates the size they are reaching then these may not be appropriate for the lower end parts as these are likely to be similar transistor quantities as todays high end.
Dave B(TotalVR) said:The answer is a spherical GPU core, so you reduce the maximum distance any signal has to travel
Pete said:Maybe b/c video cards don't have enough RAM for the typical PS workload, or having the GPU access system RAM was too slow with AGP? PCIe may change this.
DeanoC said:Pete said:Maybe b/c video cards don't have enough RAM for the typical PS workload, or having the GPU access system RAM was too slow with AGP? PCIe may change this.
Apple's Core Image have shown that a GPU based image processing architecture is the future.
http://www.appleclub.com.hk/macosx/tiger/core.html
Until now, harnessing the power of the GPU required in-depth knowledge of pixel-level programming. Core Image allows developers to easily leverage the GPU for blistering-fast image processing that can eliminate rendering time delays. Effects and transitions can be expressed with a few lines of code. Core Image handles the rest, optimizing the path to the GPU. The result is real-time, interactive responsiveness as you select and apply filters.
Supported graphics cards:
ATI Radeon 9800 XT
ATI Radeon 9800 Pro
ATI Radeon 9700 Pro
ATI Radeon 9600 XT
ATI Radeon 9600 Pro
ATI Mobility Radeon 9700
ATI Mobility Radeon 9600
NVIDIA GeForceFX Go 5200
NVIDIA GeForceFX 5200 Ultra
Dave B(TotalVR) said:The answer is a spherical GPU core, so you reduce the maximum distance any signal has to travel