So you proved it with your wishful thinking?
No, those were separate things. I proved my claim of his prejudice with current data.
For your allegation of wishful thinking (which isn't about having no hard proof since we're talking about a prediction, but about going against all logic), you need to show that there are
no trends to support unification. There are clearly several trends that do support it, and others that make it a big challenge. So my theory is at least as good as yours and you cannot accuse me of wishful thinking. Else anyone with a theory and arguments to support it would be a wishful thinker. If that's what you call all of us, I'll take it as a compliment from now on.
But sorry, you proved nothing.
I never claimed I had full proof. And neither do you have hard proof for the opposite claim, because there are no true unified architectures yet (that cannot clearly be significantly improved upon). It's just one theory against another for now. There will be many more arguments and experiments before either of us can start proving anything.
All you offered is some kind of (in my opinion flawed) conjectures.
Good. All we can do is offer conjectures, and agree or disagree on them. I value your arguments and opinions.
I suggest you think again about that. It is not fully unified. The FPU in BD/PD/SR is even shared between two cores. How can this be fully unified with the integer core?
I never defined a unified architecture as having floating-point units as part of the integer core. So I apologize for not specifying it before but you're attacking a straw man. To me a unified architecture has a homogeneous ISA for high ILP and high DLP workloads, and very low overhead for switching between them. Basically, it should be able to take any code in which the loops with independent iterations have been vectorized, with practically no risk of being slower than legacy CPUs.
I'm very open to alternate implementations, but it appears clear that each integer core will need to have access to a nearby SIMD cluster that feeds off the same instruction stream. I wouldn't recommend sharing such SIMD cluster between two scalar cores like AMD currently does, but I wouldn't call it not a unified architecture. It's just a weak one right now, with a seemingly bleak future. But I'd love to be proven wrong in that regard.
You are wrong. You miss the important point here. Vertex and pixel processing loads do not differ in some fundamental point of view. Both are throughput oriented workloads on 32bit floats.
Yes, but that wasn't the case a priori. Vertex and pixel processing did differ very much in the early days. Polygon counts were really low, in part because vertex processing used to be done on the CPU and it didn't have a lot of computing power. And pixel pipelines were fixed-function and integer only. It would have been madness back then to suggest to unify them because of some common ground. The differences far outweighed the similarities - which is why they were processed by separate hardware back then. It's only after vertex processing became programmable, after it used multiple cores, after polygons became smaller, after pixel processing became programmable, and after pixel processing became floating-point, that unification became a viable theory. I still recall though that there were naysayers on this very forum, only days before the GeForce 8800 announcement. So no, unification was never an obvious thing. Nowadays it's a given, because the software started to fully exploit the unification. We take it for granted and find it very valuable. But that is a posteriori knowledge talking when we find it obvious.
Unification of the CPU and GPU may not seem very obvious to you right now because there is no software to really exploit it, which is only the case because there are no unified architectures with such high compute density yet. It's a chicken-and-egg issue that is only going to get resolved gradually. This lack of software also makes it hard to see the common ground. But the CPU and GPU are already trading blows when it comes to OpenCL performance (which ironically was targeted specifically at the GPU). So there are some faint signs that in my opinion should not be brushed off as irrelevant.
Really the common ground between the CPU and GPU is 'computing'. Any application is a mix of ILP, DLP and TLP. It does not make sense much longer to drive a wedge in the middle. Intel is adding AVX2+ to its CPUs
because there's a lot of DLP left to exploit. GPUs are lowering their average instruction latency to increase efficiency on complex workloads with limited parallelism and to keep scaling under limited bandwidth conditions.
Once this convergence culminates into a unified architecture, it will spark the development of new applications that depend on it, and we'll soon wonder how we could ever live with heterogeneous compute hardware.
But you want to overcome the fundamental difference between latency and throughput oriented workloads. That is not going to happen so easily as the processed data types won't change anything on that distinction.
I beg to differ. Despite being throughput oriented, today's GPUs cannot deal with an average instruction latency of a hundred cycles or more, like they used to in the fixed-function pixel pipeline days. So clearly they have become far more latency oriented too, whether you like it or not. It's really simple: Once you're out of data parallelism, everything becomes latency sensitive. That includes graphics.
So it's becoming harder to label these workloads as fundamentally different. Sure, these so-called throughput oriented workloads can use thousands of ALUs instead of a few, but from the individual ALU's point of view it's all just a thread with similar characteristics to the one you'd feed a scalar core. Again, once you're out of data parallelism, which means your SIMD units can't be made wider without losing performance, everything becomes latency sensitive.
That was discussed already endless times I think. A throughput oriented task usually access amounts of data not fitting into the L1 or L2. Streaming it through a cache structure optimized for low latency access is just wasteful, also from a power perspective. Calling an asynchronous subroutine on a dedicated througput unit with its own L1/L2 specialized to those tasks don't increase data movement at all. The data just gets moved from the LLC or the memory hierarchy to a different place than your latency optimized core.
The problem is that you're assuming "a throughput oriented task usually access amounts of data not fitting into the L1 or L2" and you somehow appear to think that's something that can not and should not be changed, as if it's a good thing. It is not a good thing. For argument's sake let's say L1 and L2 are your only caches, so everything else is a RAM access. When your ALU count doubles with the next silicon node, your RAM bandwidth does not. So the only way to feed your ALUs is to get more use out of your caches. To increase the hit ratio you need fewer thread, and this can only be achieved through latency-oriented techniques.
This isn't some recent problem. It has been going on for years. Early graphics chips relied on multi-pass techniques to make pixels a little more interesting. All data was read from RAM, sent through the pipeline, written back to RAM, read back from RAM in the second pass, and then written back to RAM. Things evolved to single-pass techniques because we ran out of RAM bandwidth to keep doing that. But is meant we needed on-chip storage for the soon to be reused data. To not require tons of it to cover for the entire latency of the fixed-function pipeline, while the ALU:TEX ratio was increasing, the solution was to decouple the arithmetic and texture pipelines. That's a latency-oriented optimization, and it was a necessary step toward the much celebrated unification of vertex and pixel processing.
The bandwidth wall is real. And is has to be dealt with every time a new process node gives us more transistors to work with. It's not going away, and it's only getting worse. And yes, the dark silicon issue is real too, but fighting it by moving data around is only making the bandwidth wall hit you in the face faster. Fighting the bandwidth wall head-on by aggressively providing more bandwidth isn't smart either, because an off-chip DRAM access takes far more power than an SRAM access.
The solution for the near future is to make it an on-chip DRAM access. But everyone agrees that's only going to work for one or two process generations, and comes with a considerable added cost. So it's not the only avenue being pursued. NVIDIA has revealed to also be experimenting with an architecture that has a really tiny register file right next to the ALUs. The purpose of this seems very similar to the CPU's bypass network; to minimize the latency between dependent arithmetic instructions.
If you look back a bit, AMD's VLIW architectures could issue up to 5 different ops from the same thread in one cycle. That actually even beats Intel's 4 wide OoOE cores. Does it help a lot or is it even necessary? Obviously not as usually one can exchange ILP for DLP and TLP in throughput oriented cores quite easily. It's just a matter of balancing execution latencies and data access latencies with the expected amount of DLP and TLP to keep everything running at full load with minimized hardware effort.
I mostly agree. Note though that Intel has only had 3 arithmetic ports for now (Haswell adds a fourth), and out-of-order scheduling and Hyper-Threading improve the occupancy. Trying to statically schedule 5 instructions from a single thread was clearly overkill. VLIW4 improved things, but there are signs that AMD went too far in abandoning multi-issue with GCN. Dual-issue appears to be the sweet spot for this generation.
As detailed above, the pressure is on to keep improving the IPC per thread to reduce the number of threads. Multi-issue plays a role in that but as you correctly point out it should not be exaggerated. It is clear that future GPUs will need moderate amounts of many different techniques. But they have one thing in common: they're all used by CPU architectures. CPUs in turn can learn from GPUs by utilizing multi-core, SMT, wider SIMD units, gather/scatter, FMA, long-running instructions, etc.
The convergence may appear slow and bumpy when comparing one year to the next, but over multiple years there has clearly been a phenomenal exchange of technology between the CPU and GPU already. It is the most logical hypotheses to assume that these old but strong forces which have caused/allowed this will continue to cause more convergence. I also like to think that the thousands of Intel engineers who put so much effort into AVX2 believe it will be of great relevance for the many years to come (and not become a burden instead), and that they made it extendable to 1024-bit for good reason.
You mean like the 8 MB of register files of a Tahiti GPU? I want to see that in a CPU with heavily multiported reg files running at 4 GHz.
Tahiti does not have 8 MB register files. They may call them that, and they may be used that way from a software perspective, but they are not register files in same the sense as the ones you're trying to compare against in the CPU. They are SRAM banks. And with that cleared up, the CPU does have copious amounts of SRAM already. Furthermore, the experimental NVIDIA architecture I mentioned earlier has an 'operand register file' of just 16 entries.
So you really have to look at the entire storage hierarchy and not single out any layers that happen to be called the same due to historic reasons more than anything else. That said, the CPU can increase its latency hiding capabilities with long-running vector instructions, which necessitates growing the register file. AVX-1024 on 512-bit execution units would increase the architectural register space fourfold over what we have today and double the latency hiding per thread. How much the physical register space would grow depends on the interaction with out-of-order execution and Hyper-Threading, but in any case the register space would no longer be small compared to the L1 space, which is certainly atypical compared to a legacy CPU, and more GPU-like.
Or all the write combining and memory access coalescing going on to use the available bandwidth as efficient as possible but causes a massive (about one order of magnitude) latency penalty which would be completely prohibitive for a latency oriented task? Putting this on top of a latency optimized cache architecture would just burn an unnecessary amount of power. It's very likely more efficient to keep these structures separate maybe save for the LLC (with special provisions to avoid excessive trashing of the data for latency optimized tasks by throughput tasks).
It used to make a lot of sense to keep the GPU's memory separate, and yet for a while now the most frequently sold GPU is one that uses system memory. And now you're telling me it makes sense to share the LLC... Can't you see the trend here? I call that convergence, and apparently with every step of it we obtained qualities that are considered better than what separation gave us. It used to also be inconceivable to unify vertex and pixel processing due to prohibitive differences, and yet one step at a time and perhaps not fully consciously we were converging them, until the final leap was made.
What I'm trying to say here is that CPU-GPU unification will not happen overnight, and there are most definitely many more issues to overcome. But rather than looking at these as prohibitive and throwing in the towel, there are qualities of unification that are worth pursuing which should at least motivate us to try and take the next convergence step.
I'm taking it as a challenge. It is why I started this thread to look for a part of the solution for the remaining fixed-function logic. The write combining and read coalescing aspects are definitely also interesting, but again I see ways to continue converging it one step at a time. The CPU already does write combining and Intel claims that Haswell does not suffer from bank conflicts despite having two read ports and a write port, while it also implements gather. Furthermore, long-running vector instructions would easily allow the minimal latency to be a bit higher. Also note that Nehalem increased the access latency by one cycle and it had no measurable effect on single-threaded performance. SIMD accesses also take a few more cycles than scalar integer accesses. With long-running instructions and Hyper-Threading I'm sure that many more additional cycles of latency can be covered. Also note that FO4 continues to go down while clock frequency stagnates. More work can be done per cycle, which has already resulted in some instruction latencies going down. It is also what allows Haswell to implement FMA at no added latency, and to have four integer scalar units with zero cycle bypass. And despite all that, Haswell is going to be more power efficient. So if improved coalescing is required, I'm sure they can handle it, even if it takes a couple more generations.