Bulldozers are faster in reverse

Right now, when I'm playing a game on a system that has a separate GPU and CPU, the CPU handles the game logic and the GPU handles the graphics. Ideally, they could both be fully utilized. On your proposed system, with, say, 8 threads and 8 FPUs, with 4 threads doing logic and 4 doing graphics, 4 FPUs would be fully utilized, and 4 would be almost completely dark.
I'm thinking more like 8 cores, 16 threads, 16 SIMD clusters with 2 FMA units each. That would still be barely bigger than 100 mm² at 14 nm, but could pump out 2 TFLOPS.

Leaving some of it dark isn't an issue at all. It's a necessity due to the power wall and the bandwidth wall. The important thing is to have the right execution logic for any type of workload, right where the data is.
Should you share or dedicate some resources, the FPU should be the first one shared and last one dedicated.
AMD tried that with Bulldozer, and look where it got them. Even for Intel, the highest Linpack scores are obtained when turning off Hyper-Threading. Granted, that's a synthetic benchmark and Hyper-Threading typically does help, but it shows that it comes with overhead that needs to be 'overcome' before it starts to contribute anything. That's why I'm proposing to execute AVX-1024 instructions on 512-bit units, instead. It helps hide latency, while doubling the scheduling window, and lowering the front-end activity.
Rather than see lots of rarely utilized specialized execution resources, I expect we'll just see cores swimming in a sea of cache.
Unified cores then, right?
 
I'm thinking more like 8 cores, 16 threads, 16 SIMD clusters with 2 FMA units each. That would still be barely bigger than 100 mm² at 14 nm, but could pump out 2 TFLOPS.

Leaving some of it dark isn't an issue at all. It's a necessity due to the power wall and the bandwidth wall. The important thing is to have the right execution logic for any type of workload, right where the data is.

I disagree on the dark silicon issue, as I already posted. We already have a way to profitably use every last transistor (even if with diminishing returns), we're not going to start reducing cache to plop in extra FPUs that will idle for most of the time.

AMD tried that with Bulldozer, and look where it got them.
I still hold that the FPU is not the weak point. Every single FPU-heavy load I've ever ran on the BD has been cache or memory throughput-limited, not compute limited. Every last one. I have never managed to do anything that approaches looking like real work and that runs out of execution resources on BD. The caches are that bad.

I honestly think that for a lot of real problems, a write-back L1 would double the total throughput of the system.

Even for Intel, the highest Linpack scores are obtained when turning off Hyper-Threading. Granted, that's a synthetic benchmark
Not only is linpack so unrepresentative of real work that it should never be mentioned, this completely skirts my point. My point wasn't "HT speeds up FPU-heavy loads", it was "in mixed-load environments, HT increases efficiency". Run linpack and 4 scalar threads at the same time, and you bet you're going to see very good gains for HT.

but it shows that it comes with overhead that needs to be 'overcome' before it starts to contribute anything.

This overhead is cache pressure. There is no overhead to overcome in the actual execution parts. Dedicating FPU clusters does not help overcome the cache pressure overhead in any way, shape or form.

Unified cores then, right?

Eventually, but I think this will take a long time still.

Note that I don't necessarily think this is the best way to go (I haven't studied the problem long enough to pick a position), but I do think that Intel is institutionally predisposed to go this way, and thus it will happen.
 
One question Nick: Were do you put the shared ressources for the integrated GPU's math engines (which you want to merge with the Vector units of the CPU core) and how do you propose the work should be fused afterwards?
 
I disagree on the dark silicon issue, as I already posted. We already have a way to profitably use every last transistor (even if with diminishing returns), we're not going to start reducing cache to plop in extra FPUs that will idle for most of the time.
Why settle for diminishing returns already? That can always work as a last resort. Intel has been able to keep the L3 cache for four cores at 8 MB for three process nodes now, using the increasing transistor budget for better purposes. I'd rather know that each core is concentrating on one task, achieving maximum performance/Watt, than ensure full utilization of a few but leaving performance on the table.

There are three basic kinds of workloads to deal with: high ILP, high TLP, and high DLP. For high ILP, you want four scalar execution ports like Haswell. For high TLP, you want to share those between two threads. And for high DLP, you just want wide SIMD units, and lots of them, at a lower frequency and hiding latency while maximizing data locality with long-running instructions.

The architecture I'm proposing here is designed for each of these, and any mix of them. Basically, CPUs like Haswell are already very good at ILP and TLP, but they need GPU-like SIMD units, without lowering the data locality by running lots of threads, and other overhead associated with that.
I still hold that the FPU is not the weak point. Every single FPU-heavy load I've ever ran on the BD has been cache or memory throughput-limited, not compute limited. Every last one. I have never managed to do anything that approaches looking like real work and that runs out of execution resources on BD. The caches are that bad.
I have no first-hand experience with Bulldozer, so I'm not doubting your findings at all. But improving the caches and lowering the latencies would still not make it a good step towards unified computing. To match my proposal, it would need 32 scalar cores. There's no use for that in the consumer market, and it would waste a lot of space (and I'm not talking about low utilization, I'm talking about a complete waste - even using it for more cache would have been better).

Of course the root of the problem is that AMD doesn't want unified computing at all. It wants small scalar cores that share an FPU for legacy purposes, and a big GPU to handle all throughput computing needs. That may sound good on paper, but it's fraught with heterogeneous computing issues. They're hoping for developers to miraculously deal with that, while Intel is pampering developers with a better ROI proposal.
Not only is linpack so unrepresentative of real work that it should never be mentioned, this completely skirts my point. My point wasn't "HT speeds up FPU-heavy loads", it was "in mixed-load environments, HT increases efficiency". Run linpack and 4 scalar threads at the same time, and you bet you're going to see very good gains for HT.

This overhead is cache pressure. There is no overhead to overcome in the actual execution parts. Dedicating FPU clusters does not help overcome the cache pressure overhead in any way, shape or form.
Like I said, I mentioned Linpack only to illustrate that Hyper-Threading has an overhead. I'm not arguing that it increases utilization in mixed-load environments, but I do argue that it doesn't offer the best performance/Watt, precisely due to the inherent overhead.

Then why do I propose keeping Hyper-Threading for the scalar portion of the core? Because there's a tipping point where low utilization becomes a waste and using four scalar execution ports by two threads during high TLP workloads is more power efficient than one thread using on average only a couple of them. They do matter for increasing IPC in single-threaded workloads though due to Amdahl's law.

So it's all about finding the right balance for each type of workload. I don't think SMT is optimal for the SIMD units. They suffer the most from cache pressure when the thread count is increased. AVX-1024 offers the necessary latency hiding qualities to increase utilization (the good kind that improves power efficiency) while lowering front-end and scheduling overhead.

I'm open to alternatives, but I really don't think Bulldozer sets a good example.
Eventually, but I think this will take a long time still.

Note that I don't necessarily think this is the best way to go (I haven't studied the problem long enough to pick a position), but I do think that Intel is institutionally predisposed to go this way, and thus it will happen.
I think it will happen regardless of Intel's desire for it. That, plus their process advantage, will just make it happen sooner than the sheer necessity dictates. I have continuously underestimated how fast they would converge things, and I think that's saying something. If Skylake features 512-bit SIMD units, then we're looking at a 32-fold increase in computing power between a dual-core Westmere and an 8-core Skylake, in five years' time. That would probably put getting rid of the integrated GPU on the roadmap next.
 
One question Nick: Were do you put the shared ressources for the integrated GPU's math engines (which you want to merge with the Vector units of the CPU core) and how do you propose the work should be fused afterwards?
I think it can be done in the execution units instead of requiring a separate math box. Sines and cosines can be closely approximated with mainly just a handful of FMA operations. SSE/AVX also already has support for pipelined reciprocal and reciprocal square root approximation (although it could use some improvement). Next, gather instructions can be used for table look-ups for piece-wise approximations. The starting point for the exponential and logarithm function is to extract/insert the IEEE-754 exponent field, for which Xeon Phi has some interesting instructions. Extending the 'Bit Manipulation Instructions' to AVX would also help with that.
 
I think we're talking about different things here Nick. Maybe I should not have called the Execution Units "Math Engines" - to close to the Intel term for special function units perhaps.

What I actually meant was: Were to put the front-end, if you distribute/merge the execution units into/with the Vector units of the CPU cores. You'll need some kind of front end (work distribution) and some kind of back end (fusion) which unifies all the involved cores.
 
I think we're talking about different things here Nick. Maybe I should not have called the Execution Units "Math Engines" - to close to the Intel term for special function units perhaps.

What I actually meant was: Were to put the front-end, if you distribute/merge the execution units into/with the Vector units of the CPU cores. You'll need some kind of front end (work distribution) and some kind of back end (fusion) which unifies all the involved cores.
I'm not sure if I'm interpreting you correctly this time, but there are no different cores involved. Instead of like today's CPU cores with Hyper-Threading where both the scalar and vector execution units are shared, my proposal is to have dedicated vector execution units for each thread. That's really the gist of it. The front-end and back-end basically stay the same.

Just look at Bulldozer. It has two dedicated scalar clusters, one shared vector cluster, and one shared front-end and back-end. I just want the reverse for the scalar and vector clusters.

To keep that many vector units occupied with just a 4 instruction wide front-end, there would be AVX-1024 instructions that split into two 512-bit operations on issue, executed sequentially. Running the SIMD clusters at half frequency would save power and further lessen the burden on the front-end.
 
I think he is referring to the fact that a lambda function is divided into many work-items that progresses in parallel in GPU.
Imagine a shader that copy a texture, with some custom pattern, some dependency on other parallel copy, and some IF in between that make the flow diverting.
Unless you want to make your compiler can manage all this, you need 'something' that deals with that (including managing possible not-coherent retirement of data).

i.e. you cannot spawn hundreds of threads in a CPU as in a GPU - that is not convenient.
 
Last edited by a moderator:
Software scheduling FTW. :D

But you are right, it is in my opinion much more convenient to run larger tasks asynchronously and automatically distributed over the available resources, which effectively necessitates a separate scheduler with access to all vector units if you want to do it in hardware. Wait! That wouldn't be that homogeneous anymore. So scratch it.
Otherwise you would need a specialized scheduler in the OS handling this (and some feedback to the app of course to spawn the right amount of threads at the right time). Basically it would amount to some kind of runtime environment for these tasks (which does the thread creation, work distribution and maybe even the scheduling [at least helps the normal thread scheduler of the OS]).
 
Current high frequency Haswell chips achieve 6 GFLOPS/Watt, while GPUs achieve up to 18 GFLOPS/Watt. But with 512-bit SIMD units, two clusters of them, at half frequency, running AVX-1024 instructions, how much could it achieve?
I'm thinking more like 8 cores, 16 threads, 16 SIMD clusters with 2 FMA units each. That would still be barely bigger than 100 mm² at 14 nm, but could pump out 2 TFLOPS.
Knights Landing is supposed to be around 3+ TFLOPS on 14 nm, and Haswell already reaches Knights Corner-levels of FLOPS/W, so the difference between Haswell and Xeon Phi is not very big once TDP is factored in. Do you see Xeon Phi as being relevant 5 or so years from now?
 
26d.png


It may not be a measured 3+ TFLOPS, but it is at least their plan.
 
I only see GFLOPs/Watt for the Knights, not for Haswell.

But we have wide access to haswell! Peak flops for 4770 is ~500 gflops and it's tdp is ~100w (rough estimations, can't exactly remember). iMac's statement seems fair to me.
 
But we have wide access to haswell! Peak flops for 4770 is ~500 gflops and it's tdp is ~100w (rough estimations, can't exactly remember). iMac's statement seems fair to me.

DP FLOPS is half of that though. So Haswell is roughly half of KC's FLOPS/W in DP.
 
The Xeon E3 1280v3 (integrated GPU disabled) is 3.6 GHz at 82 W for 2.8 DP GFLOPS/W, while Knights Corner parts range from 3.3 to 4.5 DP GFLOPS/W (unless I missed some parts). So admittedly not equal, but not half either…. [I wasn't thinking when I made the quick calculations in the above post.]
 
But we have wide access to haswell! Peak flops for 4770 is ~500 gflops and it's tdp is ~100w (rough estimations, can't exactly remember). iMac's statement seems fair to me.
Yeah I just calculated and was pretty shocked that Larrabee is so weak with 60 cores compared to Haswell. Especially with the power it uses.

Or maybe it's the very fat 512 bit GDDR5 controller that burns a lot of power?
 
LRBs isn't that much different from the GK110 if you look at FlOp/s/w. one outstanding point is the 30MB of cache (I think GK110 has 1.5MB(?)). while caches are very optimized, I'd guess it's still having some impact on the power consumption.

I've played with my new haswell this weekend, I'm really impressed by how efficient you can run code on it. got some stuff vectorized/optimized to reach 90% of the theoretical peak and using HT it wen't to those 10% further. I've tried some matrix*matrix (SP FMA), MD5 (AVX2).
My older i7 could get up to 30% boost with HT, but was never closer than 90% of the peak performance.
I'd be really curious how well I could optimize some code for the xeon phi parts. I feel like comparing just the pure numbers is not doing justice to the excellent efficiency of haswell.
 
Back
Top