AMD: R9xx Speculation

Indeed. And actually, the P4 is still the fastest x86 CPU ever in terms of clockspeed. The fastest commercial version was clocked at 3.8GHz, and if I'm not mistaken it still holds the overclocking record above 8GHz. And that was on 65nm...
It might lose the "nominal" fastest clock soon. Core i5-680 supposedly is clocked at 3.6Ghz and 3.86Ghz with Turbo, so if you count Turbo as non-OC (since it is an officially sanctioned clock) that would break the P4 3.8Ghz. That is on 32nm though, and IIRC record for Clarkdale OC is indeed "only" around 7Ghz...
 
They don't necessitate, they *allow*. Big difference.

But yeah, I would like a detailed explanation myself.

I can give you a simple explaination, not a detailed one ;)

stole this line from 3dilettante
" If 9 stages of a 10-stage pipeline take 1ns, but the 10th takes 5ns, the cycle time must be 5ns."

image you break an instruction down into 10 micro-ops each one executed in the stage of a pipeline
now image you break the intruction down into 31 stages (aka prescot) each stage is simpler therefore it can be executed faster
 
I can give you a simple explaination, not a detailed one ;)

stole this line from 3dilettante
" If 9 stages of a 10-stage pipeline take 1ns, but the 10th takes 5ns, the cycle time must be 5ns."

image you break an instruction down into 10 micro-ops each one executed in the stage of a pipeline
now image you break the intruction down into 31 stages (aka prescot) each stage is simpler therefore it can be executed faster

What would be the limiting factor here, why dont they break it into xxx stages and allow the instruction cycle to be executed faster plus opening room for higher clocks...

Also, what would take more power, read/write in the processor cache or a instruction cycle or any other process a microprocessor can execute.
And how would this compare to a GPU ? I think in the GPU case, a lot more power is required when reaching texture units and things in cache (which is done a lot in a GPU) - that would explain the higher TDP for transistor.
 
I Guess there is a limit to how far an instruction can be broken down

If you liken a pipeline to an assembly line, the less work each person on the line performs, the less time it takes them, therefore the conveyor belt can move faster
 
What would be the limiting factor here, why dont they break it into xxx stages and allow the instruction cycle to be executed faster plus opening room for higher clocks...
Splitting stages in the pipeline is not free.
Let's say a pipeline stage takes 2ns, and we split it into two stages.
The expectation will be that the amount of time the two stages will take will be 2.x ns.

Splitting a stage means placing pipeline registers to hold the result of the previous stage for the next clock cycle. When the stage starts, some amount of time is needed for the voltages in the stage to stabilize at the appropriate levels.

So every stage has a latch setup period, a propagation period where the signals go through the logic and head to the latch on the other end, and then for various reasons designers can have extra slack in the timing.

You can't divide the stages forever because setup time and overhead do not scale down.

If the cycle time is reduced to below what it takes to for setup, no work gets done.

Before that happens, overhead in the form of branch mispredicts gets significantly worse the more stages there are, because more and more instructions are in-flight that have to be discarded.
Without the stream of good instructions, the longer time it takes to process an individual instruction is revealed.
 
Additionally, if you had say 30 stages which would theoretically allow 30 instructions in flight at once and there are some dependencies, then you've now got to flush the entire pipeline and redo it. This would then incur a rather large performance deficit. It's one of the reasons (among many) that P4 wasn't faster per mhz, than P3. But P4 being able to reach much higher clocks still ended up overall faster.

I'm sure I might not be entirely spot on with the above, it's just my layman's understanding of it.

Regards,
SB
 
there are some dependencies, then you've now got to flush the entire pipeline and redo it

No, typically dependencies are just stalled (held) in something like a decode stage. Suppose instruction B is dependent on instruction A. B is currently held in the decode stage. After A executes, the result can be forwarded to the decode stage (meaning B doesn't even have to wait for A to fully travel through the pipeline, i.e. write to memory, before it can execute).

What I think you are referring to is when a branch gets executed. After a branch gets executed (and taken) any instruction behind it gets squashed (hence why branch prediction is important). But even then, the whole pipeline isn't flushed (the instructions in stages behind the branch instruction are just marked so they can pass through the whole pipeline without executing/writing to memory/etc).
 
What I think you are referring to is when a branch gets executed. After a branch gets executed (and taken) any instruction behind it gets squashed (hence why branch prediction is important). But even then, the whole pipeline isn't flushed (the instructions in stages behind the branch instruction are just marked so they can pass through the whole pipeline without executing/writing to memory/etc).

Ya, if branch prediction miss u have the disvantage of a pipelined approach.

Now, back to topic, is AMD willing to sacrifice the small die area concept to overcome the memory bandwith limitation due the 256-bit interface ? They cannot pull another GDDR5 out of the hat again. 512-bit I think would be too expensive and big but a 384-bit interface would leave room for the beloved sideport but if the fab process remains the same (40nm) then the chip size would grow a lot...more transistors, more memory channels, sideport....omg
 
What memory bandwidth limitation? I don't think its ever been demonstrated that memory b/w is a limiting factor for Cypress; certainly some code is shader bound, others memory bound. Balancing in the middle is the name of the game. Triangle setup and on-the-fly optimizations from the compiler seem to be the areas needing most attention... unless they make a FireStream card to market to Radeon customers and dump a bunch of cache on there.

I wonder when the grouping of stream cores will be re-evaluated, perhaps change from 4+1 = thread processor to 3+2? I speculate wildly.
 
It's 1+1+1+1+1. :smile:

There are two ideas:
They make the "Rys-ALU" (the T-unit) thinner: no MADD for it (Gipsel's idea).
They merge the W-unit and the T-unit to one unit (found in the iXBT forums).
 
It's 1+1+1+1+1. :smile:

There are two ideas:
They make the "Rys-ALU" (the T-unit) thinner: no MADD for it (Gipsel's idea).
They merge the W-unit and the T-unit to one unit (found in the iXBT forums).

You forgot the physx unit. As for the 3+2 idea, is r870 bottlenecked by having only one T-unit (I assume no)?
 
What would be the limiting factor here, why dont they break it into xxx stages and allow the instruction cycle to be executed faster plus opening room for higher clocks...

At some point you have 1 simple gate per stage, say a 2 input nand or nor. This simple gate takes, say, 10ps in delay. The setup, hold, and uncertainties for the flop per stage, say, take up 90ps. At this point, you've obviously gone off the deepend. 10% of your cycle time, you are doing useful work. And you are burning a lot of area and a lot of power.
 
No, typically dependencies are just stalled (held) in something like a decode stage. Suppose instruction B is dependent on instruction A. B is currently held in the decode stage. After A executes, the result can be forwarded to the decode stage (meaning B doesn't even have to wait for A to fully travel through the pipeline, i.e. write to memory, before it can execute).

There are times when you cannot know if something is dependent on something else. This is even more so in a register-mem architecture like x86.
 
What memory bandwidth limitation? I don't think its ever been demonstrated that memory b/w is a limiting factor for Cypress; certainly some code is shader bound, others memory bound. Balancing in the middle is the name of the game. Triangle setup and on-the-fly optimizations from the compiler seem to be the areas needing most attention... unless they make a FireStream card to market to Radeon customers and dump a bunch of cache on there.

I wonder when the grouping of stream cores will be re-evaluated, perhaps change from 4+1 = thread processor to 3+2? I speculate wildly.

The memory bandwidth limitation that 256-bit has....I never said rv8xx suffers from memory limitation.
 
Problems with increasing the clock speed of a GPU include increased power usage and complexity. Because the GPU's pipeline is more than just ALUs and shortening the pipeline means more registers are needed you end up with a lot of extra registers and thus a lot of extra power.

Nvidia tackled this by putting the shaders on a separate clock from the rest of the pipeline. AMD has a reduced shader clock relative to Nvidia, but more shaders.

Also, CPU designs use a more painstaking layout process which increases development time in addition to increasing clock rate.
 
Back
Top