It appears that the biggest concern about unifying the CPU and (integrated) GPU is that sequential scalar workloads demand designing for ~4 GHz operation while having wide SIMD units for graphics and compute workloads operate at such frequency is not power efficient. So I've been thinking about an architecture that has its SIMD units running at half the base frequency, while still being homogeneous and offering plenty of throughput...
I think a key part of the solution is to do the reverse of what Bulldozer does: Have one scalar execution cluster shared between two threads, and two vector execution clusters each dedicated to one of the threads. The vector clusters would run at half the frequency of the scalar cluster, and as a better alternative to Hyper-Threading they could support AVX-1024 instructions which are executed on 512-bit SIMD units to help hide latency.
One possible implementation for the vector cluster is to have two identical FMA-capable SIMD units that can start an operation on odd/even cycles, and one SIMD unit for simple logic operations which runs at full frequency. This still corresponds relatively closely to Intel's current three SIMD units and thus minimizes the impact on legacy vector workloads.
Just four of these modules would deliver 1 TFLOPS of power efficient homogeneous throughput computing bliss. You can master it using any programming language you desire, without any quirky abstraction layers or unexpected overhead. This architecture would also fully retain legacy scalar performance.
Thoughts?
I think a key part of the solution is to do the reverse of what Bulldozer does: Have one scalar execution cluster shared between two threads, and two vector execution clusters each dedicated to one of the threads. The vector clusters would run at half the frequency of the scalar cluster, and as a better alternative to Hyper-Threading they could support AVX-1024 instructions which are executed on 512-bit SIMD units to help hide latency.
One possible implementation for the vector cluster is to have two identical FMA-capable SIMD units that can start an operation on odd/even cycles, and one SIMD unit for simple logic operations which runs at full frequency. This still corresponds relatively closely to Intel's current three SIMD units and thus minimizes the impact on legacy vector workloads.
Just four of these modules would deliver 1 TFLOPS of power efficient homogeneous throughput computing bliss. You can master it using any programming language you desire, without any quirky abstraction layers or unexpected overhead. This architecture would also fully retain legacy scalar performance.
Thoughts?