Implication of core clock .. ATi Vs nVidia?

Everyone knows ATi main clock "the core clock" drives the whole GPU , SPs , Texture Units , ROPs , even the scheduler .

I wonder if there is a necessity for this , ATi chips obviously need more parallelism , does it help to have the whole pipeline operating at the same speed , smoothing data flow or something ?

If that is true , then what is the implication of decoupling the SPs clock from the rest of the core in Nvidia chips ? operating the SPs at double frequency might create some bottlenecks , as the scheduler is slower in this case , what are the advantages of this ?
 
ATi chips obviously need more parallelism
I'm not sure... If we take, for example, dependent DP mads what's the minimum number of threads in each chip to reach a given performance, let's say 500GFlops?

Increasing clock speeds imply increasing number of pipeline stages including some buffers in the pipeline, if the task already takes more than one clock cycle it's latency will likely increase.

If that is true , then what is the implication of decoupling the SPs clock from the rest of the core in Nvidia chips ? operating the SPs at double frequency might create some bottlenecks , as the scheduler is slower in this case , what are the advantages of this ?
Do you accept a guess?

Well... They may be trying to save transistors, the SPs may need many specialized structures (shifts, adds, multipliers, special, something else) that won't be reused so often as TMUs and ROPs, how would be the die to have the same performance if SPs were running at core clock?
 
I wonder if there is a necessity for this , ATi chips obviously need more parallelism , does it help to have the whole pipeline operating at the same speed , smoothing data flow or something ?
Why and how is that obvious?

The advantage of having only one clock is that you don't have to cross any clock domains. Naturally, it simplifies the design at the interface. But intefacing clock domains is such a widespread practice that I imagine it will be second nature to the design team.

If that is true , then what is the implication of decoupling the SPs clock from the rest of the core in Nvidia chips ? operating the SPs at double frequency might create some bottlenecks , as the scheduler is slower in this case , what are the advantages of this ?

Decoupling SP clock from the rest of the chip, and clocking it higher, allows more flops/area. However, increasing clock speed means your SP's need somewhat more area, so the flops/area growth is not proportional to the clock speed increase.

I believe the scheduler is clocked at exactly half the SP rate, and is done so deliberately, so I am assuming that they must have taken care of possible bottlenecks there.
 
The relatively high ALU clocks in NVidia since G80 seem to have forced custom design to be used for hot-clocked areas of the chip. Those high frequencies chew through a lot of power, theoretically, so they need hand-tuning I suppose.

Since there seems to be a lot of hand tuning around design rules when using libraries on a given process at a fab (though not sure of the level of detail), it's hard to have much idea of the real impact of the extra work that custom actually amounts to.
 
The relatively high ALU clocks in NVidia since G80 seem to have forced custom design to be used for hot-clocked areas of the chip. Those high frequencies chew through a lot of power, theoretically, so they need hand-tuning I suppose.

Since there seems to be a lot of hand tuning around design rules when using libraries on a given process at a fab (though not sure of the level of detail), it's hard to have much idea of the real impact of the extra work that custom actually amounts to.

Custom designs are known to involve a lot of human design effort.
 
But NVIDIA doesn't do a true custom design. I don't know about their GF100 methodology, but basically they originally did automatic place-and-route on (comparatively) very large specialised/optimised cells. They might have moved to structured custom nowadays though (manual placement of small cells, automatic routing) but I'm not certain. Both require a lot less effort than true full-custom (although certainly more than classic design).
 
… operating the SPs at double frequency might create some bottlenecks , as the scheduler is slower in this case , what are the advantages of this ?

Let me take my guess: Nvidia wants to go in the direction of general processing on GPUs - that became very clear since G80 which is also the first chip to have a massively faster ALU clock than base clock. Compared to a CPU they are dedicating vastly more die space to improve parallel throughput instead of making serial operations run fast. Traditionally, CPUs have an advantage in clock speed of about factor 4 or more.

That basically means, even very powerful GPUs suck at serial workloads because that's not what they're designed for primarily - plus, according to Amdahls Law, sucky-serial performance prevents them from executing their full potential at not 100%-parallizable code.

So IMHO it just makes sense to try and close the gap a little by at least setting of the vast clock speed advantage CPUs have. I also imagine, Nvidia was planning originally to be around 2 GHz at this point in time and not at 1.4 for their highest performing or 1.8 for their highest clocking chip.
 
But NVIDIA doesn't do a true custom design. I don't know about their GF100 methodology, but basically they originally did automatic place-and-route on (comparatively) very large specialised/optimised cells. They might have moved to structured custom nowadays though (manual placement of small cells, automatic routing) but I'm not certain. Both require a lot less effort than true full-custom (although certainly more than classic design).
When I look at a die picture of an NVidia GPU I see structures that look like the structures I see on a CPU die shot. I don't see anything like those structures on the RV770 die shot.

I think you're exaggerating the "semi" in the semi-custom NVidia supposedly uses.
 
So IMHO it just makes sense to try and close the gap a little by at least setting of the vast clock speed advantage CPUs have. I also imagine, Nvidia was planning originally to be around 2 GHz at this point in time and not at 1.4 for their highest performing or 1.8 for their highest clocking chip.
CPUs can get away with those super-high clocks, in terms of power budget, because so little of the chip is working hard at any one time.

The question about clocks can be re-formulated: "why isn't NVidia running the entire chip, control, scheduling, ROPs etc. at ~1.5GHz and using half of the units?" The chips would be a lot smaller.
 
That basically means, even very powerful GPUs suck at serial workloads because that's not what they're designed for primarily - plus, according to Amdahls Law, sucky-serial performance prevents them from executing their full potential at not 100%-parallizable code.

So IMHO it just makes sense to try and close the gap a little by at least setting of the vast clock speed advantage CPUs have. I also imagine, Nvidia was planning originally to be around 2 GHz at this point in time and not at 1.4 for their highest performing or 1.8 for their highest clocking chip.
Alright, but traditionally ATI GPUs beat nvidia GPUs in "single thread performance". Only Fermi (especially GF104) close that gap a bit (because of reduced ALU latency and increased potential use of ILP). Nvidia never gained the factor 2.5 advantage in clockspeed they would need to compensate their far higher ALU latencies. Factor in ATI's 5 wide VLIW units and with a single thread (or a low number) nvidia's GPUs get a sound beating. They simply need higher thread counts than ATI GPUs to achieve peak performance.
 
ATi gpu's have additional latency when they switch clauses, hence alu latency isn't everything.
I know, but that isn't an issue for computational dense kernels (i.e. long clauses) as the latency can be hidden with 5 instructions per clause or something in that range. For memory accesses it's also no problem, as the access latency is far higher than the clause switch latency (AFAIR it is 40 cycles). So the only problematic cases probably involve a lot of cascaded (and very short) control flow structures (where the performance probably takes a nose dive either way because of branch divergence).
 
I know, but that isn't an issue for computational dense kernels (i.e. long clauses) as the latency can be hidden with 5 instructions per clause or something in that range. For memory accesses it's also no problem, as the access latency is far higher than the clause switch latency (AFAIR it is 40 cycles). So the only problematic cases probably involve a lot of cascaded (and very short) control flow structures (where the performance probably takes a nose dive either way because of branch divergence).

Yes, but you weren't talking about computationally dense kernels earlier. You were referring to single threaded performance, and such code is likely to have a lot of branches and have low alu:memory operations.
 
The question about clocks can be re-formulated: "why isn't NVidia running the entire chip, control, scheduling, ROPs etc. at ~1.5GHz and using half of the units?" The chips would be a lot smaller.

The answer to that was answered in a patent or two if I recall correctly. ALUs are easy to pipeline to get up to those speeds. Don't remember the details but I recall a blurb about scheduler/dispatch logic being happier with lower clocks, maybe because some parts of the process just can't be broken down into more granular stages.
 
The answer to that was answered in a patent or two if I recall correctly. ALUs are easy to pipeline to get up to those speeds. Don't remember the details but I recall a blurb about scheduler/dispatch logic being happier with lower clocks, maybe because some parts of the process just can't be broken down into more granular stages.
I was also referring to TMUs and ROPs. Of course halving their sizes means reducing their count and sharing them with more ALUs - so you get into a communication trade-off.

There's an all-to-all crossbar between the ROPs and ALUs, whose complexity could be cut by halving the ROP count.. But there's another trade-off there as ROPs would need beefier crossbars between them and the MCs, and that's theoretically the highest data rate on the chip...

As for intra-SM scheduling/despatch, part of the trade-off is the latency of the memories. Area can be saved by bursting data (reduced addressing density/clocking). That area saved might exceed the area saved in double-clocking.
 
But NVIDIA doesn't do a true custom design. I don't know about their GF100 methodology, but basically they originally did automatic place-and-route on (comparatively) very large specialised/optimised cells. They might have moved to structured custom nowadays though (manual placement of small cells, automatic routing) but I'm not certain. Both require a lot less effort than true full-custom (although certainly more than classic design).

Just an FYI, full-custom IS classic design!
 
For a GPU?

for semiconductors, period. And yes, even for GPUs, dating back to SGI et al.

And pretty much all leading edge products contain at least some full custom design even now.

And if nvidia needs full custom to run at 1.6 Ghz, then they have a lot of other problems. Simple datapath P&R will hit that.
 
And if nvidia needs full custom to run at 1.6 Ghz, then they have a lot of other problems. Simple datapath P&R will hit that.

Is it possible that they might be doing full custom design (even if for only a part of the entire chip) aiming to lower area and power, and not with an aim to increase clocks?
 
Back
Top