HD 5000 series: New architecture or more a major refresh?

HD5k: New archi or a major refresh?


  • Total voters
    49
  • Poll closed .
Cypress is more like Rampage+Sage on one chip...
 
Definitely not a simple refresh, which I usually associate with a tweaked chip (higher clocks, reduction/increase in transistors, etc) or just a shrink from one process node to the next.

But it's definitely not a new architecture either. Cypress is basically a HD 4870 X2 on one chip or 2 x RV770. I call it a major refresh, as per poll options.

Oh god...

Wavey your guys should have fluffed Evergreen much more. :LOL:
 
It's a whole new architecture since it's DX11, has tessellation hardware, double precision stream processors, and good filtering.

ATI has a tesselation unit since R600 (or even Xenos to be more precise) yet obviously didn't have until now support in the ALUs for hull & domain shader stages. Double precision is on their GPUs since RV670 and filtering (ie anisotropic) was on a very good level for several generations now.

If you would halt after DX11 your point would be as valid. ATI didn't need a whole new architecture for sure for the latter three.
 
ATI has a tesselation unit since R600 (or even Xenos to be more precise) yet obviously didn't have until now support in the ALUs for hull & domain shader stages. Double precision is on their GPUs since RV670 and filtering (ie anisotropic) was on a very good level for several generations now.

If you would halt after DX11 your point would be as valid. ATI didn't need a whole new architecture for sure for the latter three.
The tessellation unit is all new.

http://www.geeks3d.com/20100210/test-hardware-tessellation-on-radeon-in-opengl-radeon-hd-5000-tessellators-details/
On Evergreen family, both tessellation engines (fixed and programmable) are physically separated.
 

What did I say exactly that contradicts what's stated in the article?

...yet obviously didn't have until now support in the ALUs for hull & domain shader stages.

If you now want to call that "programmable tessellation unit" there's nothing to object there either unless you want to argue semantics in terms of the "unit" description.

From the article above:

The tessellator of Radeon HD 2000, 3000 and 4000 is a fixed function unit and we can’t program it with a shader...

On Radeon HD 5000 series (Evergreen family), things are different. The Radeon HD 5000 includes the fixed tessellator of HD 2000, 3000 and 4000 AND a new programmable tessellation unit.

A bit more clearer: http://www.geeks3d.com/20100209/test-hardware-tessellation-on-radeon-in-opengl-part-12/
 
What did I say exactly that contradicts what's stated in the article?
You said AMD didn't need a whole new architecture for the last 3 things Secessionist listed and while I don't disagree with that I was pointing out that the tessellation hardware is in fact different. Most people don't know that. Microsoft created a different algorithm than was used in the Xbox and this required new hardware.
 
There is a structure that, when looked upon at a high level, has similarities. There (more or less) isn't a single part of the architecture that hasn't changed though. Dig through the documentenation and play with some of the stream arch and you'll see it. (And note, internally this is deemed a new graphics IP number)

Just because something has changed doesn't necessarily mean that it's really a brand new from scratch microarchitecture. Penryn changed from Core2 and had a different microarchitecture, but it was obviously a derivative.

Even Nehalem is an obvious derivative of the P6. Almost everything in the microarchitecture was improved, but it's still a derivative of the same underlying microarchitecture. The way the pipeline works in terms of renaming, issuing, etc.

Ditto for Barcelona, Shanghai and Istanbul - they are all obvious descendants of the K8, which is itself descended from the K7....

Back to the topic on hand, the underlying microarchitecture for the shaders in ATI's GPUs have changed an awful lot, but seem to retain the same fundamentals introduced in R600. Is it a new microarchitecture - absolutely. But it's pretty clear that the more recent GPUs are in fact derivatives of the older ones.


Just to be clear with a few examples here: The P5-->P6 was a brand new uarch and the P6-->P4 was a brand new uarch. The P4-->Prescott could be considered a new uarch, but it's a lot less clear. It was definitely a very very different core.

GPU wise, I think the DX9-->DX10 transition was pretty obviously a huge switch and comparable to the differences cited above in CPUs. I don't really see DX10-->DX11 being a very big shift.

Of course, there may have been a from scratch effort, where they simply decided it was best to recycle the underlying uarch they were using before, but I hardly consider that a change.

David
 
RV770 and RV870/Cypress are direct evolution changes to R600 that was designed during longer time frame. It's obvious bug polished up design now just like P6 was evolve adding more instructiions per cycle and better packing into todays Nehalem core, blended with some HT along the way.

So when they redesign outer logic of RV870 to better cope with outgrowing loads in next polishing shoul we consider that a new design. Or when they finall redesign SIMD like cores after that should we call that a new design?

I wouldn't call G80 a new design, by far. NV40->G70 was seamless transition that is still with us today incarnated as CUDA core clusters. Or just (inner) SPs dataflow redesign is concerned as major step thru? Then we'll need to wait something 'revolutionary' as G80, but how will we recognize it?

Even R520->R600 wasnt full major change cause outer memory ringbus and RISC scheduler shared same design


Should AMD go from current superscalar cores to scalar cores with separate clock domain in the future, that would be a new architecture IMO, even if that arch was still DX11.

That would be major step back if we look at G80->GF100 evolution. Or even unreleased
Larabee wannabe gpu hybrid. But should they convince us it's better that way we might accept it? :devilish:


Ditto for Barcelona, Shanghai and Istanbul - they are all obvious descendants of the K8, which is itself descended from the K7....

In pc performance overview these uP (K8L) are more major step from K8 than K8 over K7 with it's innovative IMC and x86-64 support ever was.
 
I would prefer they stick to the 4+1 VLIW setup, but give each VLIW it's own instruction cache and decoder :) Go from a Larrabee wannabe, to a Larrabee shamer.
 
I would prefer they stick to the 4+1 VLIW setup, but give each VLIW it's own instruction cache and decoder :) Go from a Larrabee wannabe, to a Larrabee shamer.

IOW, go serial, right? I guess, long term, 5way VLIW is looking to have less and less of a future.
 
IOW, go serial, right?
Approximately, you'd still be connected to the local memory, if your code wanted to use that for efficient communication it would still have to pay close attention to it's neighbours. Also programming wise you'd still just be running normal kernels (except it wouldn't stall threads on branch divergence).
I guess, long term, 5way VLIW is looking to have less and less of a future.
I don't think pure MPMD would be possible at all with scalar cores, too much overhead.

I think 5 wide VLIW presents just about the right size core to use to keep the overhead of the extra instruction caches and decoders small enough. You could of course use 4-wide SIMDs instead, but why would you? you can use VLIW to run scalar threads just fine, but you can't use scalar threads to run a single program. VLIW is a superset of scalar SIMD, and I doubt the overhead is enough to justify going with the lesser brother.
 
Last edited by a moderator:
So when they redesign outer logic of RV870 to better cope with outgrowing loads in next polishing shoul we consider that a new design. Or when they finall redesign SIMD like cores after that should we call that a new design?

I don't know what an 'outgrowing load' is and I don't know what 'outer logic' is and frankly I have a hard time understanding what you are saying at all.

DK
 
I would prefer they stick to the 4+1 VLIW setup, but give each VLIW it's own instruction cache and decoder :) Go from a Larrabee wannabe, to a Larrabee shamer.

Do you mean a decoder per cluster on the currently 16-wide SIMDs?
Making use of this would require independent control units.
I think 16x as many instruction sequencers would be notable.

The LDS and GDS are banked to fit the current setup, and without a SIMD aligning access, the contention and potential for bank conflicts would go up.

A number of items on the chip look like they are designed to work with the current SIMD width and would require duplication and arbitration if that assumption is removed.
The TMU and ALU sections are pretty tightly intertwined, which could not be done so simply if the ALUs weren't in lockstep.
 
Aren't the "stream processors" actually arranged in quad (so 4 quads per SIMD array)?
Would it be "easier" to go with a instruction cache and decoder per quad?
 
Do you mean a decoder per cluster on the currently 16-wide SIMDs?
Making use of this would require independent control units.
I think 16x as many instruction sequencers would be notable.
Why? Apart from the automatic insertion of PV/PS references by the hardware (I don't even understand why they put that in, since it doesn't always work) and the branching (which it already needs independent logic for any way) how much sequencing is there actually to be done? You'd need per thread logic to switch to a different thread on texture clauses, but that too is almost trivial.

I'd imagine the cost would be dominated by the instruction caches.
The LDS and GDS are banked to fit the current setup, and without a SIMD aligning access, the contention and potential for bank conflicts would go up.
It's an option, not a mandate.
The TMU and ALU sections are pretty tightly intertwined, which could not be done so simply if the ALUs weren't in lockstep.
I don't believe that's true, they are decoupled by quite a lot of hardware/buffering already ... they have to be because of the completely random latency of the actual lookup.
 
Why? Apart from the automatic insertion of PV/PS references by the hardware (I don't even understand why they put that in, since it doesn't always work) and the branching (which it already needs independent logic for any way) how much sequencing is there actually to be done?
The SIMD can swap threads out every four cycles. I may have been innacurate in saying sequencers, as the terms were that there are dual arbiter/scheduler pairs per SIMD.
With 16-wide, the work needed in setting up a thread is uniform for all 16 clusters. One and only one instruction group needs to be fetched.

Among other things, that means there can at most be 1 Icache miss. 16 separate decoders would indicate they are not fetching the same instruction, otherwise what's the point?
That's 16 potential cache misses, which will need monitoring and handling. At present, a miss of this sort puts the thread to sleep.

I don't believe that's true, they are decoupled by quite a lot of hardware/buffering already ... they have to be because of the completely random latency of the actual lookup.
This might explain why the latency of a cache hit is on the order of 180 cycles, unless I've misinterpreted the posts by prunedtree for the SGEMM code for RV770 and onward.
 
Back
Top