AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

My guess for 7870,

- 32 simd's
That would equal just 8 CUs (512 vALUs) and probably 32 TMUs, I doubt that would be faster than Barts on a lot of workloads, irrespective of the efficiency gains as long as AMD does not increase the clocks significantly.
 
A user on another forum suggested, that the way he reads those slides, the new architecture is a lot closer to Larrabee, excluding the fact that this isn't x86 obviously, than any nV or AMD previous design.

Is he completely off or right on the money?
 
If that's true we should be expecting at least 32 CU's. In which case it should rape everything.

http://www.hardware.fr/news/11648/afds-architecture-futurs-gpus-amd.html
I just wanted to cite that too. Hardware.fr goes on to speculate about at least 30 CUs (with half speed DP :oops:), so in the same ballpark as your guess. From the presentation, the number of CUs should be divisible by 4, though. What about 40 CUs to really gain the high ground? :LOL:

Let's see what would come out of this:

32 CUs = 2048 vALUs (+ 32 scalar ones) = 4096 flop/cycle => 3.5 TFlop/s @ 850 MHz
Sounds reasonable for the next 28nm high end. Maybe they are aiming at 4TFlop/s (975 MHz) for the fastest part to claim at least a 50% increase in raw flops compared to Cayman. If all works out, this could easily mean a blunt doubling of typical gaming performance, in corner cases maybe even a factor of 3-4.
 
This may have been made more clear elsewhere and I missed it, but I wonder how much of the scalar pipe and arbitration logic was already present in the sequencer blocks and thread engine in current chips, just not exposed to the outside world. At the very least, some of that hardware would be repurposed and made available to the software stream.

The CU is no longer capable of 4 simultaneous ALU op issues, but this is compensated for by having 4-cycle execution. Basically, there is a 3-cycle spin-up period of successive vector issues before we see the same utilization as a best-case fully-packed VLIW instruction.
This is probably compensated for by better utilization and the removal of certain latency penalties related to clause switches and other contributors spin-up latency that was present but not explicitly mentioned.

The vector ALU and register system strikes me as taking Cayman SIMD and putting it on its side.
Instead of a 16 lane 4-way ALU cluster SIMD with a 4-banked register file, we have 4 16 lane SIMDs and 4 register files.

This removes the rather baroque register file read system employed in the VLIW system, or at least hides it in hardware.

What does it do for multi-lane special ops, though?
 
A user on another forum suggested, that the way he reads those slides, the new architecture is a lot closer to Larrabee, excluding the fact that this isn't x86 obviously, than any nV or AMD previous design.

Is he completely off or right on the money?
Are you refering to me? Because that's not what I was saying. I wonder about how the well the chip could handle software rendering model.
If not... well just ignore my post ;)
 
That would equal just 8 CUs (512 vALUs) and probably 32 TMUs, I doubt that would be faster than Barts on a lot of workloads, irrespective of the efficiency gains as long as AMD does not increase the clocks significantly.

Oops, :oops:

I meant 32 cores, or 32 cu's.
 
A user on another forum suggested, that the way he reads those slides, the new architecture is a lot closer to Larrabee, excluding the fact that this isn't x86 obviously, than any nV or AMD previous design.

Is he completely off or right on the money?

Completely off.

The scalar unit is not exposed to user. Compute wise, it is more like fermi than lrb,
 
This may have been made more clear elsewhere and I missed it, but I wonder how much of the scalar pipe and arbitration logic was already present in the sequencer blocks and thread engine in current chips, just not exposed to the outside world. At the very least, some of that hardware would be repurposed and made available to the software stream.

The CU is no longer capable of 4 simultaneous ALU op issues, but this is compensated for by having 4-cycle execution. Basically, there is a 3-cycle spin-up period of successive vector issues before we see the same utilization as a best-case fully-packed VLIW instruction.
This is probably compensated for by better utilization and the removal of certain latency penalties related to clause switches and other contributors spin-up latency that was present but not explicitly mentioned.

The vector ALU and register system strikes me as taking Cayman SIMD and putting it on its side.
Instead of a 16 lane 4-way ALU cluster SIMD with a 4-banked register file, we have 4 16 lane SIMDs and 4 register files.

This removes the rather baroque register file read system employed in the VLIW system, or at least hides it in hardware.

What does it do for multi-lane special ops, though?

My guess is that transcendentals are processed like in fermi, no multi lane magic to get them to work.
 
I'm a but puzzled with the "Scalable Graphics Engine" block diagram (link).
What exactly is RB, is this the same as RBE/ROP and if it is, what is it doing as a part of the Primitive Pipe (the one for scan-out conversion)?
 
Completely off.

The scalar unit is not exposed to user. Compute wise, it is more like fermi than lrb,

Yeah, it shares a lot with Fermi but takes it another step further. A CU resembles an SM and even has similiar throughput (64 scalar ALUs @ core clock vs 32 scalar ALUs @ 2x core clock). Should be much easier to compare the two architectures going forward.
 
Is this the first GPU that can actually claim MIMD since Larrabee didn't materialize anywhere really?

php0NkoTf08.jpg
 
Any indication what the cache latency and/or the minimum number of workgroups needed to cover it is?
 
Back
Top