So is that what you see when you look at a die shot of GF100?
A GPU?
So is that what you see when you look at a die shot of GF100?
A GPU?
What are you asking?A GPU?
The answer to that was answered in a patent or two if I recall correctly. ALUs are easy to pipeline to get up to those speeds. Don't remember the details but I recall a blurb about scheduler/dispatch logic being happier with lower clocks, maybe because some parts of the process just can't be broken down into more granular stages.
also pipelining does not necessarily make a processor clock higher in itself. itanium has 8 stages and runs at 1.7 GHz. gpu's probably have 25-30 stages and run at 1.4GHz.
ATI is 8 stages for up to 1GHz or so.gpu's probably have 25-30 stages and run at 1.4GHz.
it is really difficult to compare microprocessors imo. there is a lot of grey area like you pointed out which makes a lot of arguments based on nothing or unkown information or assumptions. my point is that adding more pipeline stages wont make it clock higher in itself. for pipelining you will get at best linear frequency gains and at best linear performance gains.Yep, 24 on GT200 according to volkov's analysis - http://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf
But is that comparison even valid? What about voltages, power consumption etc - do they not factor into clockspeed as well? Sure they only run at 1.4Ghz but there are a whole lot more ALUs in a GPU.
I meant with the number of shader instructions executed on the whole chip , which is larger in ATi chips compared to Nvidia .Why and how is that obvious?
The question about clocks can be re-formulated: "why isn't NVidia running the entire chip, control, scheduling, ROPs etc. at ~1.5GHz and using half of the units?" The chips would be a lot smaller.
So even though nvidia ALUs are clocked higher , they have longer latencies ? is it because they use a longer pipeline ?Factor in ATI's 5 wide VLIW units and with a single thread (or a low number) nvidia's GPUs get a sound beating. They simply need higher thread counts than ATI GPUs to achieve peak performance.
I see ..I was also referring to TMUs and ROPs. Of course halving their sizes means reducing their count and sharing them with more ALUs - so you get into a communication trade-off.
There's an all-to-all crossbar between the ROPs and ALUs, whose complexity could be cut by halving the ROP count.. But there's another trade-off there as ROPs would need beefier crossbars between them and the MCs, and that's theoretically the highest data rate on the chip...
I am sorry , I didn't get this part , could you elaborate ..As for intra-SM scheduling/despatch, part of the trade-off is the latency of the memories. Area can be saved by bursting data (reduced addressing density/clocking). That area saved might exceed the area saved in double-clocking.
I don't get it , what is the problem in doing a custom design ?And if nvidia needs full custom to run at 1.6 Ghz, then they have a lot of other problems. Simple datapath P&R will hit that.
NVidia's current configuration perhaps uses less space than running something that's half the size at twice the clock.I am sorry , I didn't get this part , could you elaborate ..
Yes, that is the reason. As Jawed said, ATI is at 8 cycles latency, nvidia at ~24 cycles with the G80/GT200 (appears to be reduced to 18 in the GF100).So even though nvidia ALUs are clocked higher , they have longer latencies ? is it because they use a longer pipeline?
Do they add up or is it just first throughput latency as in latency vs. streaming throughput?Yes, that is the reason. As Jawed said, ATI is at 8 cycles latency, nvidia at ~24 cycles with the G80/GT200 (appears to be reduced to 18 in the GF100).
So in absolute numbers (i.e. time in ns) ATI GPUs have shorter latencies.