AMD: R7xx Speculation

Status
Not open for further replies.
Hmm, it might explain why they're easy to buy right now - is HD4870 going to be supply-constrained? If so, due to GDDR5 or yields?

Also, if SIMDs are "horizontal" now, it would mean that losing a dud TU would only affect one SIMD - whereas in R6xx losing a TU affects all SIMDs.

Jawed

Hmmm, that would enable coarse-grained redundancy. But I don't see any compelling reason for them to make that switch now all of a sudden.

Clocks on the 4870 are already 20% higher. 25% more units as well will push the 4870 to 50% faster. I think the expectation is that the two parts are much closer than that.
 
If this were true, with the addition of GDDR5 and much higher clocks, wouldn't that make the 4870 a monster? I'll disbelieve this to keep my hopes down.
Considering the much more robust cooling for 4870 (~1:1 replica of 2900XT's one) and the dual power inputs, that's quite plausible scenario. :rolleyes:
 
Considering the much more robust cooling for 4870 (~1:1 replica of 2900XT's one) and the dual power inputs, that's quite plausible scenario. :rolleyes:

Also considering that you go from 110 TDP to ~160 TDP. I don't think just a 20% core speed increase would account for that.

And considering GDDR 5 is supposed to consume less power than GDDR 3.

Hmmm, but if that were the case, wouldn't there be a larger increase than going from 1.0 Tflops to 1.2 Tflops?

So, I'm doubting this coarse redudancy theory. Unless 2 rows are also disabled in 4870.

Regards,
SB
 
It doesn't make them completely independent either as you can 'only' schedule 5 instructions per clock on the same pixel.
NVIDIA architecture is still way more flexible.

But you're talking about R6xx's VLIW MIMD architectural/scheduling limitations, as opposed to the ability to perform scalar operations.
 
Also considering that you go from 110 TDP to ~160 TDP. I don't think just a 20% core speed increase would account for that.
I can believe an increase like that if there's a voltage bump along with the clock increase.

A logic-heavy chip+near clock ceiling+foundry process+limited binning can lead to a SKU with a major power jump for a little clock rise.

If the chip is near the top range of its expected clock envelope, even a small bump can push the watts up.
We don't know what the expected clock range is, but a higher TDP would allow more devices to bin to the top SKU for decent volume.
 
How do GPGPU workloads map to these architectures nowadays? I imagine there are a lot of highly data parallel workloads that have very low ILP. That would be one area where Nvidia could establish a foothold as it essentially drops AMD's stuff to 1/5th theoretical throughput.
Theoretically the CUDA programmer never concerns themself with ILP, simply programming in terms of scalar data types.


But I think the reality is more complex:
  • register file usage varies considerably depending on how "effective-ILP" is compiled into the code (e.g. how loops are unrolled or how vector data types are splatted into serial-scalar operations). Register file usage then has a direct impact on performance due to the number of threads in flight versus latency-hiding needed for memory reads/writes
  • transcendentals introduce serial-dependencies in code, thus risking lowered utilisation
  • ILP helps avoid read-after-write (register) timing issues
That's not to say that a kernel running on a set of purely scalar datatypes is going to run badly - merely that performance isn't necessarily as clear cut as it first appears.

Under CAL/Brook+ the developer is forced to futz with ILP, using packed (vector) data types and some seriously unruly looking code - otherwise the performance really will be fractional.

---

When you have a data-parallel kernel with no ILP you can generally re-program it to have ILP simply by giving it more than one element to compute. Though you then increase the efficiency losses arising from poor coherency in dynamic branching.

Programming with ILP is likely to improve the performance of a CUDA kernel. But it needs to be tuned.

Jawed
 
I understand that ATi's SPs cannot be scheduled independently, but this does not make them any less capable of performing scalar calculations, as nao implicated by referring to them as vec5.

I think we're all in agreement now.
 
But you're talking about R6xx's VLIW MIMD architectural/scheduling limitations, as opposed to the ability to perform scalar operations.
I'm not sure I'm following you here. I'm just saying that it still makes logical (and probably even physical) sense to group 5 ALUs together as they operate on a clock per clock basis on the same entity (vertex/pixel/whatever..)
Obviously I'm assuming that in this case RV770 is no different from RV670.
 
Last edited:
Under CAL/Brook+ the developer is forced to futz with ILP, using packed (vector) data types and some seriously unruly looking code - otherwise the performance really will be fractional.

What's to stop a CTM implementation where computational entities are packed 5 to a primitive?
It's not like the VLIW should care, though it would waste any form of swizzling across lanes. (I can't remember where I read it, but isn't it also the case that the slim ALU lanes can interchange results with each other easily while the fat ALU is off on its own)

Well, besides likely blowing out the register file, code expansion, and assuming you only want the subset of ops the slim ALUs currently offer?
That and the insanely expanded effective batch size.
64 items per batch, with each item being a packet of 5 scalar values is 320 elements.
 
I understand that ATi's SPs cannot be scheduled independently, but this does not make them any less capable of performing scalar calculations, as nao implicated by referring to them as vec5.
No offense but maybe you should re-read what I wrote, I have not implicated anything like that. Thank you.
 
You still have to tie the operation to the pixels and to the simd operation. So its not really Independent.

Well yeah but both major GPU architectures have these same limitations.

the reality is more complex:
  • register file usage varies considerably depending on how "effective-ILP" is compiled into the code (e.g. how loops are unrolled or how vector data types are splatted into serial-scalar operations). Register file usage then has a direct impact on performance due to the number of threads in flight versus latency-hiding needed for memory reads/writes
  • transcendentals introduce serial-dependencies in code, thus risking lowered utilisation
  • ILP helps avoid read-after-write (register) timing issues

Aren't those all considerations for a VLIW architecture as well. Or are there things you need to worry about in a "scalar" architecture that you don't when dealing with VLIW?
 
Status
Not open for further replies.
Back
Top