AMD: R9xx Speculation

If you include the bulk half nodes, it is the other way round.

Depends how you look at it :smile:

From where I'm sitting there is AMD hinting us in various interviews that they are preparing N.I. architecture for this year refresh, on the other hand we have GloFo and TSMC process roadmaps stating 28nm is basically Q1 2011 at best for volume.
This leaves us with either refresh on 40nm or 32nm SOI from GloFo which should be volume production ready by Q3 2010.

I'm hoping it will be 32nm SOI but realistically I'm expecting 40nm refresh (it's wise not to do new process and new architecture at one go :)). As a nice surprise ahead of schedule 28nm bulk would do!
 
If NI is @28nm, then AMD has a chance to nuke nv out of orbit :devilish:

However, a 40 nm chip on a new architecture seems way more likely.
 
However, a 40 nm chip on a new architecture seems way more likely.
I'd say it's obvious - though "new architecture" might be stretching it.

What would AMD add on to Cypress? What did AMD cut out of Cypress in order to meet the W7 deadline last year? How big does AMD want to go? Will AMD care about an X2 version - why bother if the new chip is way faster?...

Jawed
 
If they stick to cypress's die budget for it's successor, then I'd say to recover R&D cost, if not to make people go out and buy new cards, it needs at least 10-15% perf/mm advantage.

OTOH, rv670->rv770->cypress has seen a ~30% die size increase with each new chip, despite the sweet spot strategy. :rolleyes:

I wonder what is gonna give?
 
If they stick to cypress's die budget for it's successor, then I'd say to recover R&D cost, if not to make people go out and buy new cards, it needs at least 10-15% perf/mm advantage.
Why stick to Cypress's die budget?

OTOH, rv670->rv770->cypress has seen a ~30% die size increase with each new chip, despite the sweet spot strategy. :rolleyes:
Sweet-spot strategy isn't "smallest die we can make".

I wonder what is gonna give?
See my sig. Nothing needs to give.

Jawed
 
I'd say it's obvious - though "new architecture" might be stretching it.

Do you consider gt200->fermi a major architectural overhaul? I would say so. And I think it will be something in similar vein. This is AMD's first gpu whose development probably began right around with Bulldozer. I would not be surprised at all if it was redesigned to merge with bulldozer at the core level, not just sit-on-the-same-die level like Llano.

While we are it, I'll wager that in BD refresh (if not @ 22nm SOI) they are planning to take a couple (somewhere between 1-4) SIMD engines and put it into a bulldozer module. :) The SIMD's will have their private L1 data and texture caches (just like BD cores have theirs), and the L2 caches of the CPU and SIMD cores will be unified.
 
Do you consider gt200->fermi a major architectural overhaul? I would say so.
I agree. Almost nothing carries over.

And I think it will be something in similar vein. This is AMD's first gpu whose development probably began right around with Bulldozer. I would not be surprised at all if it was redesigned to merge with bulldozer at the core level, not just sit-on-the-same-die level like Llano.
Bulldozer's been work in progress for yonks, longer than the next GPU.

To merge with Bulldozer it needs to integrate with BD's memory system.

That's fundamentally a cache/MC question. Which, incidentally, is what I think got chopped out of Cypress. AMD retained the R700 memory system for Cypress.

I wouldn't be at all surprised if the next chip has the same ALU, TU and ROP counts as Cypress.

While we are it, I'll wager that in BD refresh (if not @ 22nm SOI) they are planning to take a couple (somewhere between 1-4) SIMD engines and put it into a bulldozer module. :) The SIMD's will have their private L1 data and texture caches (just like BD cores have theirs), and the L2 caches of the CPU and SIMD cores will be unified.
BD needs a GPU on-die, basically. AMD's strategy is OpenCL: CPU and GPU integrated makes for a compute monster. SSE will become irrelevant if you want throughput.

The idea of putting a GPU SIMD engine "in-line" as part of a BD module is problematic - it has a very different concept of registers and instruction streams. To do this would require an entirely new GPU SIMD. I won't say that's unlikely, but I do think it's a long way off since the x,y,z,w,t set in current GPUs is heavily refined.

I suppose it's possible to strip-down the x,y,z,w,t set for implementation within a BD module (re-work it for core clocks, remove the optimisations it has for clause-by-clause execution), but then it's hardly different or novel from what BD already has, except for the t lane (which is actually pretty useful, to be fair).

Jawed
 
Bulldozer's been work in progress for yonks, longer than the next GPU.

AFAIK,
  • Barcelona taped out in late 2006, early 2007
  • Rv770's developement began in 2005
  • Evergreen's development began in 2006
  • IF if itakes 3 years to build a gpu, starting from the beginning, then NI is the first GPU designed while being under AMD's roof since it's conception. Hence, overall, BD and NI are the first chips to be designed start to finish under one roof.

To merge with Bulldozer it needs to integrate with BD's memory system.

Llano integrates <some-bastard-tree-of-evergreen> with K10.5's mem system.

That's fundamentally a cache/MC question. Which, incidentally, is what I think got chopped out of Cypress. AMD retained the R700 memory system for Cypress.

I wouldn't be at all surprised if the next chip has the same ALU, TU and ROP counts as Cypress.

BD needs a GPU on-die, basically. AMD's strategy is OpenCL: CPU and GPU integrated makes for a compute monster. SSE will become irrelevant if you want throughput.
Actually, I feel SSE has more relevance than AVX for going forward (ie 5 years down the line), but not for raw throughput.

The idea of putting a GPU SIMD engine "in-line" as part of a BD module is problematic - it has a very different concept of registers and instruction streams. To do this would require an entirely new GPU SIMD. I won't say that's unlikely, but I do think it's a long way off since the x,y,z,w,t set in current GPUs is heavily refined.

I suppose it's possible to strip-down the x,y,z,w,t set for implementation within a BD module (re-work it for core clocks, remove the optimisations it has for clause-by-clause execution), but then it's hardly different or novel from what BD already has, except for the t lane (which is actually pretty useful, to be fair).

I guess a picture is worth a thousand words.

amd_bulldozer_gpu_simd.jpg

I had something like this in mind. Keep the SIMD's as they are. Just integrate tightly with the cache hierarchy and kill driver latency. A little bit like the TMU's moved inside SM's in fermi, push SIMD engines inside a BD module.
 
AFAIK,
  • Barcelona taped out in late 2006, early 2007
  • Rv770's developement began in 2005
  • Evergreen's development began in 2006
  • IF if itakes 3 years to build a gpu, starting from the beginning, then NI is the first GPU designed while being under AMD's roof since it's conception. Hence, overall, BD and NI are the first chips to be designed start to finish under one roof.
I've got no trouble with that. I just disagree that BD started development as recently as your earlier message implies:

This is AMD's first gpu whose development probably began right around with Bulldozer.

I guess a picture is worth a thousand words.

I had something like this in mind. Keep the SIMD's as they are. Just integrate tightly with the cache hierarchy and kill driver latency. A little bit like the TMU's moved inside SM's in fermi, push SIMD engines inside a BD module.
Oh, that's much looser than I thought you were saying. I thought you meant the GPU style SIMDs would replace the shared FMA SIMDs that are currently in BD.

It seems to me that loosely coupled GPU SIMDs that are module-dedicated is troublesome - if they're loosely coupled, then they might as well be shared by the CPU cores as a whole, with L3 taking the strain.

Jawed
 
I've got no trouble with that. I just disagree that BD started development as recently as your earlier message implies:

>4 years to make a new uarch? WTH are they trying to do with BD? re-implement x86 from scratch? :rolleyes:

Any way, with >4 years in the oven, and a new gpu being built along side, it seems reasonable to assume that there has been a lot of influence both ways.

Oh, that's much looser than I thought you were saying. I thought you meant the GPU style SIMDs would replace the shared FMA SIMDs that are currently in BD.
Nah, the sse unit will remain there until the end of AMD, or end of x86, whichever is earlier.

It seems to me that loosely coupled GPU SIMDs that are module-dedicated is troublesome - if they're loosely coupled, then they might as well be shared by the CPU cores as a whole, with L3 taking the strain.

What troubles do you foresee with this kind of coupling?

The idea is that by putting a simd engine next to a full blown x86 core, the gpu kernels can be programmed to call back x86 code when it is finished, and have it's return data siting right there. Or vice versa, with the x86 setting up launch parameters in the L2 and then issuing kernel calls. IOW, reduce data latency between cpu and gpu.

Personally, I am looking out for kernel launch granularity to reduce from a bunch of work groups to just one work group. With recursive block calls, ie work groups calling work groups (same size of course) and with ability to launch x86 functions at the end of workgroups, it will lead to better synergy between cpu and gpu code.

Power management is another plus. With increasingly finer granularity of core level power gating, the same infrastructure can be reused for gpu power gating.
 
>4 years to make a new uarch? WTH are they trying to do with BD? re-implement x86 from scratch? :rolleyes:
CPUs, particularly new designs that have the complexity of a multicore high-performance OoO x86, do take that long.
It's not clear just how long elements of Bulldozer have been in development, as various x86 projects to succeed K8 have been cancelled, and the internal project code-named Bulldozer may have been restarted or had its codename transfered at some point.

However, it's not exactly like GPUs are as fleet of foot as some may think. The upper bounds of GPU gestation periods overlap pretty well with the lower bounds of CPU design periods as of late.
 
>4 years to make a new uarch? WTH are they trying to do with BD? re-implement x86 from scratch? :rolleyes:
I really don't care much about x86, it's boring (except Larrabee, but that's only because it hasn't arrived, once it's here it'll be even more boring). But hasn't AMD junked at least two designs since K8?

What troubles do you foresee with this kind of coupling?
The GPU would munch through L2, starving the core, which has other things to do. Better to stream through L2 into L3 for the GPU-SIMD to use, I suspect.

Also, with module-local, you can only maximise GPU-SIMD usage by pushing work through all the modules, enforcing fine-grained tasks when only a coarse-grained (single host thread) task is required.

The idea is that by putting a simd engine next to a full blown x86 core, the gpu kernels can be programmed to call back x86 code when it is finished, and have it's return data siting right there. Or vice versa, with the x86 setting up launch parameters in the L2 and then issuing kernel calls. IOW, reduce data latency between cpu and gpu.
Going through L3 seems better to me. The difference in latency won't break your balls. These are still OpenCL kernels, they're not a few in-thread SSE instructions you're trying to speed-up.

Personally, I am looking out for kernel launch granularity to reduce from a bunch of work groups to just one work group.
Going to be hard to avoid having at least 1 workgroup per SIMD.

With recursive block calls, ie work groups calling work groups (same size of course) and with ability to launch x86 functions at the end of workgroups, it will lead to better synergy between cpu and gpu code.
Do things need to be more tangled than the fairly clean OpenCL event model?

Power management is another plus. With increasingly finer granularity of core level power gating, the same infrastructure can be reused for gpu power gating.
I don't see how that's affected by the tightness of coupling.

Jawed
 
>4 years to make a new uarch? WTH are they trying to do with BD? re-implement x86 from scratch? :rolleyes:

Any way, with >4 years in the oven, and a new gpu being built along side, it seems reasonable to assume that there has been a lot of influence both ways.
.

If i recall correctly, I can remember Bulldozer being talked about even at the beginnings of Intel Core era. It has been pushed back ever since.

Back in the days, AMD had also delayed the Hammer architecture for quite a while.

Edit: oups, 3dilettante already "produced" a better alternative to my post; yeah BD may have been just a name
 
Who's doing 28nm? and when?

*whistles*

Global Foundries 28nm??? He does say that in his lil snippet.

Now if we go by the piece in the Industry forums, whose party are the going to try and show up for on time? Could this be the first ever time they release the parts for laptops before they get the mainline desktop parts out the door? Noone sneezes at a 140mm^2 Juniper replacement with better power/performance in the laptop space and thats a pretty good pipe cleaner as well.

Does anyone see any reason why they shouldn't release from the middle and work up/downwards? Say 66xx part first followed by 64xx then 67xx and finally 68xx/69xx and bringing up the rear the IGP and Fusion solutions? Laptops are bigger now than the desktop markets and time to market for this segement is paramount given the lead time for manufacturing? A 66xx would be an excellent first OEM part, IMO!
 
I really don't care much about x86, it's boring (except Larrabee, but that's only because it hasn't arrived, once it's here it'll be even more boring). But hasn't AMD junked at least two designs since K8?
I don't care about x86 either. I have no idea about the mess AMD has made internally. I always thought BD began after Barcelona was finished.


The GPU would munch through L2, starving the core, which has other things to do. Better to stream through L2 into L3 for the GPU-SIMD to use, I suspect.
Yeah, the gpu will eat up the L2. But if you share via L3, it'll eat up L3 too. The only way out seems to be to put an upper limit on the amount of cache the gpu can hog. And if you are gonna cap L3 usage, might as well put it together with a module and cap L2 uage as well.

Also, with module-local, you can only maximise GPU-SIMD usage by pushing work through all the modules, enforcing fine-grained tasks when only a coarse-grained (single host thread) task is required.
Good point, this approach will need some kind of across module load balancer to throw work groups at other, free simd's. Probably better off implmented in sw with some hw support.

Going through L3 seems better to me. The difference in latency won't break your balls. These are still OpenCL kernels, they're not a few in-thread SSE instructions you're trying to speed-up.


Going to be hard to avoid having at least 1 workgroup per SIMD.


Do things need to be more tangled than the fairly clean OpenCL event model?


I don't see how that's affected by the tightness of coupling.

Jawed

Sharing via L3 means it is no better (architecturally at least) compared to Llano though they may very well g down that road. The overall direction is to make it system level programmable, instead of using the gpu as a device visible to apps. I feel they'll take their cues (both regarding what to do and what not do) from Cell's execution model (definitely not it's programming model though :smile:) which is amenable to providing a JIT level interface and allows flexible communication/signal patterns between the cpu and gpu cores.
 
"Lurkermode off"

What if Ati has been quite happy with being first to the table with HD5xxx and 40 nm in spite of the problems of TSCM. They are not that afraid of (and quite good at) being early adopters to a smaller process node.

It has several advantages:

1) They can be agressive while playing safe by double up with GF and TSCM making two layouts for the chip. Minimizing risks of a repeat of the problematic 40 nm adoption.

2)Whatever first fab ready with 28 nm out makes launch date. If glofo is delayed TSCM may come through and vv

3) If TSCM and glofo are ready quite simultaneously there will be less constraints of cards at launch since two 28 nm lines are used to satisfy demand. Even if early cypress yields wer low, say in 30-40% Ati may have realized that one line operating att 90% yield would not have been enough for early demand.

This could have been clear while still in negotiation with TSCM and glofo about NI. They just went for both.
 
Back
Top