Implication of core clock .. ATi Vs nVidia?

Sure, like GPU is the recursive acronym for "GPU's probably underutilized".
 
Last edited by a moderator:
The answer to that was answered in a patent or two if I recall correctly. ALUs are easy to pipeline to get up to those speeds. Don't remember the details but I recall a blurb about scheduler/dispatch logic being happier with lower clocks, maybe because some parts of the process just can't be broken down into more granular stages.

simply put, it's much easier and more productive to custom design alu's than control logic. control logic changes too frequently. it's basically a PLA programmed into a finite state machine. i think ATi does this in microcode and the instructions themselves contain more information.

also pipelining does not necessarily make a processor clock higher in itself. itanium has 8 stages and runs at 1.7 GHz. gpu's probably have 25-30 stages and run at 1.4GHz.
 
also pipelining does not necessarily make a processor clock higher in itself. itanium has 8 stages and runs at 1.7 GHz. gpu's probably have 25-30 stages and run at 1.4GHz.

Yep, 24 on GT200 according to volkov's analysis - http://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf

But is that comparison even valid? What about voltages, power consumption etc - do they not factor into clockspeed as well? Sure they only run at 1.4Ghz but there are a whole lot more ALUs in a GPU.
 
that would make sense for vliw. long pipelines will create huge dependency issues for wide instruction issue. in fact half of the logic in itanium's adder is for bypassing. latency is a huge problem too but it seems that heavily threaded approaches like wavefronts/warps solve that problem pretty well. just for comparison i remember seeing some data on specint about itanium. 10-95% of time spent on memory stalls were to the L2 cache which is 5-7 cycles and i forget the # of r/w ports. just goes to show how important the ISA is.

Yep, 24 on GT200 according to volkov's analysis - http://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf

But is that comparison even valid? What about voltages, power consumption etc - do they not factor into clockspeed as well? Sure they only run at 1.4Ghz but there are a whole lot more ALUs in a GPU.
it is really difficult to compare microprocessors imo. there is a lot of grey area like you pointed out which makes a lot of arguments based on nothing or unkown information or assumptions. my point is that adding more pipeline stages wont make it clock higher in itself. for pipelining you will get at best linear frequency gains and at best linear performance gains.

my theory is that it is related to the clock architecture. intel is really good at this. designing a fast, low skew PLL is extremely difficult. PLL's are so important that the most of the chip is impossible to build with out it. since it is an analog circuit it requires custom design and deep knowledge of the transistor characteristcs of the process.another challenge is getting the clock to arrive at all areas of the chip with low skew and at the same time using as little power as possible. doing this with multiple clocks is nearly impossible at high speeds. most of the delay of the clock cycle is from skew and not the pulse itself. then again would trying to make rop/control logic run at full speed be a larger bottleneck?
 
Why and how is that obvious?
I meant with the number of shader instructions executed on the whole chip , which is larger in ATi chips compared to Nvidia .

The question about clocks can be re-formulated: "why isn't NVidia running the entire chip, control, scheduling, ROPs etc. at ~1.5GHz and using half of the units?" The chips would be a lot smaller.

Yes , good question !

Factor in ATI's 5 wide VLIW units and with a single thread (or a low number) nvidia's GPUs get a sound beating. They simply need higher thread counts than ATI GPUs to achieve peak performance.
So even though nvidia ALUs are clocked higher , they have longer latencies ? is it because they use a longer pipeline ?

I was also referring to TMUs and ROPs. Of course halving their sizes means reducing their count and sharing them with more ALUs - so you get into a communication trade-off.

There's an all-to-all crossbar between the ROPs and ALUs, whose complexity could be cut by halving the ROP count.. But there's another trade-off there as ROPs would need beefier crossbars between them and the MCs, and that's theoretically the highest data rate on the chip...
I see ..

As for intra-SM scheduling/despatch, part of the trade-off is the latency of the memories. Area can be saved by bursting data (reduced addressing density/clocking). That area saved might exceed the area saved in double-clocking.
I am sorry , I didn't get this part , could you elaborate ..

And if nvidia needs full custom to run at 1.6 Ghz, then they have a lot of other problems. Simple datapath P&R will hit that.
I don't get it , what is the problem in doing a custom design ?
 
I am sorry , I didn't get this part , could you elaborate ..
NVidia's current configuration perhaps uses less space than running something that's half the size at twice the clock.

The issue is making the alternative version, with twice the clock, have low enough latencies. Meeting that constraint adds area, so there's a trade-off in the two approaches...
 
So even though nvidia ALUs are clocked higher , they have longer latencies ? is it because they use a longer pipeline?
Yes, that is the reason. As Jawed said, ATI is at 8 cycles latency, nvidia at ~24 cycles with the G80/GT200 (appears to be reduced to 18 in the GF100).
So in absolute numbers (i.e. time in ns) ATI GPUs have shorter latencies. Furthemore they generally have less units (you have to count the VLIW units, not the marketing SP numbers, comparing Cypress vs. GTX480 results in 320 vs 480 units). Both together simply mean, ATI GPUs need less threads to hide the latencies.

Of course, ALU latencies are not everything, the memory access latencies has to be taken into account, too. I would expect them to be comparable for both manufactures. But GF100 is supposed to have an edge there with its cache structure (but the bandwidth of L2 could be better).
 
ATI parts needs less threads to hide latencies also because they quad-pump every instruction. I believe just 2 threads, pardon..wavefronts, are required to hide most register hazards.
 
Yes, that is the reason. As Jawed said, ATI is at 8 cycles latency, nvidia at ~24 cycles with the G80/GT200 (appears to be reduced to 18 in the GF100).
So in absolute numbers (i.e. time in ns) ATI GPUs have shorter latencies.
Do they add up or is it just first throughput latency as in latency vs. streaming throughput?
 
Back
Top