That would equal just 8 CUs (512 vALUs) and probably 32 TMUs, I doubt that would be faster than Barts on a lot of workloads, irrespective of the efficiency gains as long as AMD does not increase the clocks significantly.My guess for 7870,
- 32 simd's
Regarding the cost of this new architecture, AMD told us it was only slightly higher than current architectures, some parts are more complex but more simplified. It should not be a barrier to increasing the number of computing units.
Is Eric's speech today or tomorrow?
I just wanted to cite that too. Hardware.fr goes on to speculate about at least 30 CUs (with half speed DP ), so in the same ballpark as your guess. From the presentation, the number of CUs should be divisible by 4, though. What about 40 CUs to really gain the high ground?If that's true we should be expecting at least 32 CU's. In which case it should rape everything.
http://www.hardware.fr/news/11648/afds-architecture-futurs-gpus-amd.html
Are you refering to me? Because that's not what I was saying. I wonder about how the well the chip could handle software rendering model.A user on another forum suggested, that the way he reads those slides, the new architecture is a lot closer to Larrabee, excluding the fact that this isn't x86 obviously, than any nV or AMD previous design.
Is he completely off or right on the money?
That would equal just 8 CUs (512 vALUs) and probably 32 TMUs, I doubt that would be faster than Barts on a lot of workloads, irrespective of the efficiency gains as long as AMD does not increase the clocks significantly.
A user on another forum suggested, that the way he reads those slides, the new architecture is a lot closer to Larrabee, excluding the fact that this isn't x86 obviously, than any nV or AMD previous design.
Is he completely off or right on the money?
The vector ALU and register system strikes me as taking Cayman SIMD and putting it on its side.
This may have been made more clear elsewhere and I missed it, but I wonder how much of the scalar pipe and arbitration logic was already present in the sequencer blocks and thread engine in current chips, just not exposed to the outside world. At the very least, some of that hardware would be repurposed and made available to the software stream.
The CU is no longer capable of 4 simultaneous ALU op issues, but this is compensated for by having 4-cycle execution. Basically, there is a 3-cycle spin-up period of successive vector issues before we see the same utilization as a best-case fully-packed VLIW instruction.
This is probably compensated for by better utilization and the removal of certain latency penalties related to clause switches and other contributors spin-up latency that was present but not explicitly mentioned.
The vector ALU and register system strikes me as taking Cayman SIMD and putting it on its side.
Instead of a 16 lane 4-way ALU cluster SIMD with a 4-banked register file, we have 4 16 lane SIMDs and 4 register files.
This removes the rather baroque register file read system employed in the VLIW system, or at least hides it in hardware.
What does it do for multi-lane special ops, though?
Completely off.
The scalar unit is not exposed to user. Compute wise, it is more like fermi than lrb,