No. No, its nothing like that.But it's definitely not a new architecture either. Cypress is basically a HD 4870 X2 on one chip...
No. No, its nothing like that.But it's definitely not a new architecture either. Cypress is basically a HD 4870 X2 on one chip...
Cypress is more like Rampage+Sage on one chip...
Definitely not a simple refresh, which I usually associate with a tweaked chip (higher clocks, reduction/increase in transistors, etc) or just a shrink from one process node to the next.
But it's definitely not a new architecture either. Cypress is basically a HD 4870 X2 on one chip or 2 x RV770. I call it a major refresh, as per poll options.
It's a whole new architecture since it's DX11, has tessellation hardware, double precision stream processors, and good filtering.
The tessellation unit is all new.ATI has a tesselation unit since R600 (or even Xenos to be more precise) yet obviously didn't have until now support in the ALUs for hull & domain shader stages. Double precision is on their GPUs since RV670 and filtering (ie anisotropic) was on a very good level for several generations now.
If you would halt after DX11 your point would be as valid. ATI didn't need a whole new architecture for sure for the latter three.
On Evergreen family, both tessellation engines (fixed and programmable) are physically separated.
The tessellation unit is all new.
http://www.geeks3d.com/20100210/tes...n-opengl-radeon-hd-5000-tessellators-details/
...yet obviously didn't have until now support in the ALUs for hull & domain shader stages.
The tessellator of Radeon HD 2000, 3000 and 4000 is a fixed function unit and we can’t program it with a shader...
On Radeon HD 5000 series (Evergreen family), things are different. The Radeon HD 5000 includes the fixed tessellator of HD 2000, 3000 and 4000 AND a new programmable tessellation unit.
You said AMD didn't need a whole new architecture for the last 3 things Secessionist listed and while I don't disagree with that I was pointing out that the tessellation hardware is in fact different. Most people don't know that. Microsoft created a different algorithm than was used in the Xbox and this required new hardware.What did I say exactly that contradicts what's stated in the article?
There is a structure that, when looked upon at a high level, has similarities. There (more or less) isn't a single part of the architecture that hasn't changed though. Dig through the documentenation and play with some of the stream arch and you'll see it. (And note, internally this is deemed a new graphics IP number)
Should AMD go from current superscalar cores to scalar cores with separate clock domain in the future, that would be a new architecture IMO, even if that arch was still DX11.
Ditto for Barcelona, Shanghai and Istanbul - they are all obvious descendants of the K8, which is itself descended from the K7....
I would prefer they stick to the 4+1 VLIW setup, but give each VLIW it's own instruction cache and decoder Go from a Larrabee wannabe, to a Larrabee shamer.
Approximately, you'd still be connected to the local memory, if your code wanted to use that for efficient communication it would still have to pay close attention to it's neighbours. Also programming wise you'd still just be running normal kernels (except it wouldn't stall threads on branch divergence).IOW, go serial, right?
I don't think pure MPMD would be possible at all with scalar cores, too much overhead.I guess, long term, 5way VLIW is looking to have less and less of a future.
So when they redesign outer logic of RV870 to better cope with outgrowing loads in next polishing shoul we consider that a new design. Or when they finall redesign SIMD like cores after that should we call that a new design?
I would prefer they stick to the 4+1 VLIW setup, but give each VLIW it's own instruction cache and decoder Go from a Larrabee wannabe, to a Larrabee shamer.
Why? Apart from the automatic insertion of PV/PS references by the hardware (I don't even understand why they put that in, since it doesn't always work) and the branching (which it already needs independent logic for any way) how much sequencing is there actually to be done? You'd need per thread logic to switch to a different thread on texture clauses, but that too is almost trivial.Do you mean a decoder per cluster on the currently 16-wide SIMDs?
Making use of this would require independent control units.
I think 16x as many instruction sequencers would be notable.
It's an option, not a mandate.The LDS and GDS are banked to fit the current setup, and without a SIMD aligning access, the contention and potential for bank conflicts would go up.
I don't believe that's true, they are decoupled by quite a lot of hardware/buffering already ... they have to be because of the completely random latency of the actual lookup.The TMU and ALU sections are pretty tightly intertwined, which could not be done so simply if the ALUs weren't in lockstep.
The SIMD can swap threads out every four cycles. I may have been innacurate in saying sequencers, as the terms were that there are dual arbiter/scheduler pairs per SIMD.Why? Apart from the automatic insertion of PV/PS references by the hardware (I don't even understand why they put that in, since it doesn't always work) and the branching (which it already needs independent logic for any way) how much sequencing is there actually to be done?
This might explain why the latency of a cache hit is on the order of 180 cycles, unless I've misinterpreted the posts by prunedtree for the SGEMM code for RV770 and onward.I don't believe that's true, they are decoupled by quite a lot of hardware/buffering already ... they have to be because of the completely random latency of the actual lookup.