AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

rpg.314 · Jun 16, 2011

My guess for 7870,

- 32 simd's
- 256 bit mem
- gddr5+
- ~250mm2 area
- 4 tri's/clock setup raster

dizietsma · Jun 16, 2011

Is Eric's speech today or tomorrow?

Gipsel · Jun 16, 2011

rpg.314 said:
My guess for 7870,

- 32 simd's

That would equal just 8 CUs (512 vALUs) and probably 32 TMUs, I doubt that would be faster than Barts on a lot of workloads, irrespective of the efficiency gains as long as AMD does not increase the clocks significantly.

trinibwoy · Jun 16, 2011

Regarding the cost of this new architecture, AMD told us it was only slightly higher than current architectures, some parts are more complex but more simplified. It should not be a barrier to increasing the number of computing units.

If that's true we should be expecting at least 32 CU's. In which case it should rape everything.

http://www.hardware.fr/news/11648/afds-architecture-futurs-gpus-amd.html

AlexV · Jun 16, 2011

dizietsma said:
Is Eric's speech today or tomorrow?

Today. Looking forward to it, nom nom!

Kaotik · Jun 16, 2011

A user on another forum suggested, that the way he reads those slides, the new architecture is a lot closer to Larrabee, excluding the fact that this isn't x86 obviously, than any nV or AMD previous design.

Is he completely off or right on the money?

Alexko · Jun 16, 2011

Well, the Scalar + SIMD design is Larrabeeish, but I think the similarities pretty much end there.

Gipsel · Jun 16, 2011

trinibwoy said:
If that's true we should be expecting at least 32 CU's. In which case it should rape everything.

http://www.hardware.fr/news/11648/afds-architecture-futurs-gpus-amd.html

I just wanted to cite that too. Hardware.fr goes on to speculate about at least 30 CUs (with half speed DP

), so in the same ballpark as your guess. From the presentation, the number of CUs should be divisible by 4, though. What about 40 CUs to really gain the high ground?

Let's see what would come out of this:

32 CUs = 2048 vALUs (+ 32 scalar ones) = 4096 flop/cycle => 3.5 TFlop/s @ 850 MHz
Sounds reasonable for the next 28nm high end. Maybe they are aiming at 4TFlop/s (975 MHz) for the fastest part to claim at least a 50% increase in raw flops compared to Cayman. If all works out, this could easily mean a blunt doubling of typical gaming performance, in corner cases maybe even a factor of 3-4.

3dilettante · Jun 16, 2011

This may have been made more clear elsewhere and I missed it, but I wonder how much of the scalar pipe and arbitration logic was already present in the sequencer blocks and thread engine in current chips, just not exposed to the outside world. At the very least, some of that hardware would be repurposed and made available to the software stream.

The CU is no longer capable of 4 simultaneous ALU op issues, but this is compensated for by having 4-cycle execution. Basically, there is a 3-cycle spin-up period of successive vector issues before we see the same utilization as a best-case fully-packed VLIW instruction.
This is probably compensated for by better utilization and the removal of certain latency penalties related to clause switches and other contributors spin-up latency that was present but not explicitly mentioned.

The vector ALU and register system strikes me as taking Cayman SIMD and putting it on its side.
Instead of a 16 lane 4-way ALU cluster SIMD with a 4-banked register file, we have 4 16 lane SIMDs and 4 register files.

This removes the rather baroque register file read system employed in the VLIW system, or at least hides it in hardware.

What does it do for multi-lane special ops, though?

liolio · Jun 16, 2011

Kaotik said:
A user on another forum suggested, that the way he reads those slides, the new architecture is a lot closer to Larrabee, excluding the fact that this isn't x86 obviously, than any nV or AMD previous design.

Is he completely off or right on the money?

Are you refering to me? Because that's not what I was saying. I wonder about how the well the chip could handle software rendering model.
If not... well just ignore my post

rpg.314 · Jun 16, 2011

Gipsel said:
That would equal just 8 CUs (512 vALUs) and probably 32 TMUs, I doubt that would be faster than Barts on a lot of workloads, irrespective of the efficiency gains as long as AMD does not increase the clocks significantly.

Oops,

I meant 32 cores, or 32 cu's.

rpg.314 · Jun 16, 2011

Kaotik said:
A user on another forum suggested, that the way he reads those slides, the new architecture is a lot closer to Larrabee, excluding the fact that this isn't x86 obviously, than any nV or AMD previous design.

Is he completely off or right on the money?

Completely off.

The scalar unit is not exposed to user. Compute wise, it is more like fermi than lrb,

Ethatron · Jun 16, 2011

3dilettante said:
The vector ALU and register system strikes me as taking Cayman SIMD and putting it on its side.

Yes, as a mental image it looks to me like they transposed the op<->reg matrix.

rpg.314 · Jun 16, 2011

3dilettante said:
This may have been made more clear elsewhere and I missed it, but I wonder how much of the scalar pipe and arbitration logic was already present in the sequencer blocks and thread engine in current chips, just not exposed to the outside world. At the very least, some of that hardware would be repurposed and made available to the software stream.

The CU is no longer capable of 4 simultaneous ALU op issues, but this is compensated for by having 4-cycle execution. Basically, there is a 3-cycle spin-up period of successive vector issues before we see the same utilization as a best-case fully-packed VLIW instruction.
This is probably compensated for by better utilization and the removal of certain latency penalties related to clause switches and other contributors spin-up latency that was present but not explicitly mentioned.

The vector ALU and register system strikes me as taking Cayman SIMD and putting it on its side.
Instead of a 16 lane 4-way ALU cluster SIMD with a 4-banked register file, we have 4 16 lane SIMDs and 4 register files.

This removes the rather baroque register file read system employed in the VLIW system, or at least hides it in hardware.

What does it do for multi-lane special ops, though?

My guess is that transcendentals are processed like in fermi, no multi lane magic to get them to work.

Dave Baumann · Jun 16, 2011

http://www.pcper.com/news/Graphics-Cards/AMD-Fusion-Developer-Summit-2011-Live-Blog

Few more details being discussed now.

fellix · Jun 16, 2011

I'm a but puzzled with the "Scalable Graphics Engine" block diagram (link).
What exactly is RB, is this the same as RBE/ROP and if it is, what is it doing as a part of the Primitive Pipe (the one for scan-out conversion)?

trinibwoy · Jun 16, 2011

rpg.314 said:
Completely off.

The scalar unit is not exposed to user. Compute wise, it is more like fermi than lrb,

Yeah, it shares a lot with Fermi but takes it another step further. A CU resembles an SM and even has similiar throughput (64 scalar ALUs @ core clock vs 32 scalar ALUs @ 2x core clock). Should be much easier to compare the two architectures going forward.

Kaotik · Jun 16, 2011

Is this the first GPU that can actually claim MIMD since Larrabee didn't materialize anywhere really?

MfA · Jun 16, 2011

PowerVR is the only one I'd call MIMD.

MfA · Jun 16, 2011

Any indication what the cache latency and/or the minimum number of workgroups needed to cover it is?

AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

rpg.314

dizietsma

Gipsel

trinibwoy

Meh

AlexV

Heteroscedasticitate

Kaotik

Drunk Member

Alexko

Gipsel

3dilettante

liolio

Aquoiboniste

rpg.314

rpg.314

Ethatron

rpg.314

Dave Baumann

Gamerscore Wh...

fellix

trinibwoy

Meh

Kaotik

Drunk Member

MfA

MfA

Similar threads