AMD: R9xx Speculation

I'm not so sure ... a fundamental problem with Larrabee's approach is that it can only handle near certain high latency events, the cost of pushing/popping is way too high to use for every type of memory access. Everything else has to be rare.

In a throughput optimized architecture doesn't that strike you as strange? I would personally want an architecture which is able to deal well with any type of cache miss through vertical multithreading ... Larrabee ain't getting me there, not enough threads.
The developer, or compiler in the case of graphics, is supposed to do their own fibre-based multithreading on top of the basic hardware threading.

On the other side ATI and NVidia GPUs, on compute workloads, will often be running with less than 10 hardware threads per SIMD.

ATI has 4-cycle instructions and NVidia has 2-cycle instructions. That's half way to fibre-based - they don't expose their 16-wide SIMDs as raw 16-wide SIMDs like Larrabee does. Give Larrabee 4-fibres per hardware thread.

Or, maybe Larrabee should have 8 hardware threads instead of 4. Then again Larrabee is supposed to be the mind-melding of software and hardware guys...

Jawed
 
Exactly, if this is truly 67xx, I don't think they can or will simply yank NI core out and replace it with Evergreen shader core.
I can't reconcile this sentence...
The shader core per se is relatively easy, isn't it?
... with this sentence. I'm not sure what you're saying here.

Otherwise they might as well produce something like Cedar-like NI in 40nm to save some R&D.
RV740 has ALU level (burst fetch) and ROP-level (doubling of ROPs per MC) advances beyond any of the prior R7xx GPUs.

One thing I'm musing over is the peculiar dual-tessellator architecture of Evergreen. There's the old-style tessellator for compatibility with R600's tessellator, which is limited to a tessellation factor of 16 and then there's the new "more programmable" (which I can't see, frankly) tessellator that handles higher tessellation factors/D3D11.

My suspicion is that the D3D11-tessellator is a short-term kludge to get past the "missing" rasteriser/ALU architecture in Evergreen. Because stuff was cut, AMD had to retreat to an adjunct. What I'm wondering is if tessellation is meant to be kernel based, not fixed-function.

My problem with TS is that a lot of it is simply interpolation. A feature of Evergreen is that the interpolation for vertex attributes (Shader Pipe Interpolators) was deleted and became an ALU instruction. Why isn't TS using the same interpolation instruction?

If TS becomes a kernel (see the Export Shader kernel for an example of a "hidden" kernel in ATI) then this has ramifications on throughput - i.e. vastly higher than the current D3D11 TS, necessitating, in my view, an overhauled rasteriser and overhauled sequencer/SIMDs/ALUs.

Though I admit I haven't worked out what a TS kernel would look like - seems pretty fiddly.

Quite unlikely, Fermi-like cache doesn't really translate to real world performance,
How do you conclude that?

neither is Cypress cache-bound
How do you conclude that?

and GPGPU is not SI's forte anyway.
How do you conclude that? :oops:

Fact is, GPGPU is a major focus of D3D now, so one way or another ATI needs to catch up.

Like I said above, improving GPGPU-related performance will be highly unlikely on SI GPUs which is either a stop-over or the next 67x0, ie. 57x0 even removed DP capability.
I don't think it's wishful thinking to imagine that 1 year+ later than Evergreen, progress on all fronts would be made.

Clearly the disappearance of 32nm is potentially a big knock, perhaps causing feature loss as seen in Cypress. And I agree, we don't know if the next chip is just a minor refresh-style increment.

Anyway, you asked for suggestions on spending upto 20% extra die space - there it is. And 20% really isn't very much.

On the other hand, SI/NI might share some miracle-worker (MC/ROP) from Llano to drastically reduce or at least mitigate bandwidth requirement.
You mean like a Fermi-style cache hierarchy? ;)

Otherwise even GDDR5+ won't do it on NI, and SI won't even have faster memory parts. At least 20% increase in real-life performance, which should be the minimal expectation 6-month from now, can't just come out of nowhere.
32nm's disappearance spoils most guesstimations here :???:

Jawed
 
What is all this rubbish about variable latency RFs? The operand collector is really there to handle banking conflicts and improve bandwidth.
And there are banking conflicts (or if you prefer, orchestrated reads) in fetching operands from RF for a single instruction issue (see the patent I linked in post 737). Not to mention that in GT200 the three ALUs: MAD, MI and DP, all return their results at different rates, making a real mess of sequential-instruction-issue dependencies - stuff that falls upon the scoreboards and operand collector, to straighten-out.

Nobody makes variable latency RFs AFAIK, because it would cause an absolute shit storm for the compiler and rest of the pipeline.
This, the convoluted register-dependency-analysis/operand-readiness/instruction-issue scoreboarding, appears to be a key reason why G80->GT200 have such awful GFLOPs per mm².

What I'm curious about is how much of this complexity has been cleaned out in Fermi?

Jawed
 
The developer, or compiler in the case of graphics, is supposed to do their own fibre-based multithreading on top of the basic hardware threading.
I see the same conflict in this as others see in having both local memory and cache. They serve largely the same purpose and one could be subsumed in the other with the unification creating a more efficient use of resources and area.
 
My problem with TS is that a lot of it is simply interpolation. A feature of Evergreen is that the interpolation for vertex attributes (Shader Pipe Interpolators) was deleted and became an ALU instruction. Why isn't TS using the same interpolation instruction?
Most of the interpolation is done in the DS anyway, and also the tessellation stage is much easier to implement with fixed point math, unless you like to see cracks between primitives... :)
 
This may not be the best thread to link it but i would like someone to explain whats going on in those tests at computerbase.de.
It looks like cypress is fine with tesselation itself but it struggles when you add 4xAA.
 
Debugged ~10h today, so delete my message pease :D.

Anyway, interesting results there.
 
Last edited by a moderator:
Most of the interpolation is done in the DS anyway, and also the tessellation stage is much easier to implement with fixed point math, unless you like to see cracks between primitives... :)
I suppose this is like the rasterisation dichotomy: interpolation or handed-walk? Interpolation has to end-up producing the same result as a handed-walk.

Fixed point seems to be mostly an implementation factor for an FSM, i.e. it's needless to normalise at each state, particularly if the precision/range of the output is known in advance. We see a similar thing in the transcendental functions.

Is the precision of fp32 math a problem for tessellation? Do you end up having to emulate higher precision?

Jawed
 
This may not be the best thread to link it but i would like someone to explain whats going on in those tests at computerbase.de.
It looks like cypress is fine with tesselation itself but it struggles when you add 4xAA.
It takes a big hit with 4xAA in that game even without tessellation.

The problem is probably memory related. Does that game use deferred shading?
 
Wasnt there an announcement by Global Foundries that customers would announce real products on the 28nm process in Q1? I havent seen any such announcements
 
Wasnt there an announcement by Global Foundries that customers would announce real products on the 28nm process in Q1? I havent seen any such announcements

Which year's Q1 would that be?:p

Offtopic PS: Your username could easily be a part name of an upcoming ATI series!;)
 
Which year's Q1 would that be?:p

Offtopic PS: Your username could easily be a part name of an upcoming ATI series!;)

I meant Q1 2010 of course :) Here's the press release in question http://www.globalfoundries.com/pdf/GF_HKMG_TrifoldBrochure.pdf

The part im talking about is

Proof #5
Customer Product Introductions in Q1 2010:
GLOBALFOUNDRIES’ customers will begin to
announce HKMG product results in early 2010 - no
test chips, not 64M SRAMS, not IP shuttle results
but full products. As with the 45/40nm ramp, this
will be far ahead of any other pure play foundry.
So yeah, where are these full product announcements?

Haha my username is actually from the game Freespace :D
 
Early 2010 doesn't specify Q1, though, as long as it's during H1 it can be stretched to mean "early"
 
I'd expect January (like e.g. R580, RV620, RV635 etc.) or March (like R481, G71, GF100...). February is unlikely do to the China new year and I don't believe they'd call Q2 "early 2010"...
 
Back
Top