First of all, thanks a lot for the feedback and I appreciate it, many good points I agree with. I replied point-by-point so my post is even longer than yours though, sorry about that...
One possible pitfall I forsee in treating such a broad topic holistically is that without a very structured approach, the reader might lose direction. One can get lost in a sea of anecdotes without being able to synthesize the information.
Agreed, I'm hopeful that's less of a problem when looking at specific real-world examples though. Of course, any exotic design worth its salt is a composed of many noteworthy design decisions, so even then it's far from obvious...
One area where I had a little trouble following initially was the example of fixed-function hardware using floating point as an example. [...] Where is the line that says FP is fixed function?
Yeah, that's definitely a more complex problem than I made it out to be, both technically and semantically. The way I look at it is this: any ALU is basically a fixed-function unit, which is more or less configurable based on the implementation. When you add a *separate* FP unit, you simply add *more* fixed-function logic, making the design more fixed-function centric.
If you replace the integer unit by a more advanced unit that is capable of both fast integer and floating-point operations with a significant amount of hardware reuse between the two modes (i.e. the mantissa), then you have both increased the amount of fixed-function logic *and* the configurability of the unit. In this case, you increased the amount of fixed-function logic, but you have not necessarily made the design more fixed-function centric overall, although you have made it more specialized. At that level it truly becomes a semantic problem in my mind, and we could debate about it for centuries to come, so I'll refrain from wasting too much time on it unless we can imagine a more fundamental disagreement that I couldn't think of...
As a counterexample to the claim that porting part of the functionality to software can lead to simplification of implementation: I should point out the x87.
Well, yeah. The only thing I can reply is that beyond the usual advantages/disadvantages of a given trade-off, the quality of the specific design choice is of very high importance. For example, simply off-loading everything to a fixed-function accelerator isn't going to improve the design's efficiency if the accelerator's implementation is awful...
The dichotomy between specialization and unification is a valid one, I guess.
I tend to think of it as specialization versus generality, since unification implies things about physical location and sort of betrays using GPUs as a starting point. I'm not sure how often "unification" as a term dominates chip discussions not related to the graphics pipeline, although it has been used before.
In theory, we could have separate fully programmable units that aren't specialized, but still physically divided.
We could have specialized units pulled into a functional group or cluster, which leaves the hardware at least partially specialized, but physically unified.
Okay, I think I see the main reason we're not looking at it the same way: you're considering where the processor physically is; I'm considering what executes the computation. The basic example would be a chip with a CPU, a DSP, and a bunch of other logic. I don't really care about whether the CPU and DSP are next to each other or at opposite corners of the chip; in either case, I might still have the choice of running a piece of signal processing software on either the CPU or the DSP. These computations would run on a highly unspecialized processor if done on the CPU, and on a more or less specialized one if done on the DSP.
Based on that definition, you could argue a chip is more specialized than another aimed at the same workload if computations are run in more specialized blocks *on average*. So to increase hardware specialization for that piece of software, you might add fixed-function accelerators or a more specialized DSP, and to reduce hardware specialization and you remove the DSP and run everything on the CPU (assuming it's fast enough for the intended market). In the latter case, hardware unification has clearly increased - efficiency has also gone down unless average DSP utilization was very low.
The problem when you think about it in terms of generality is that you risk confusing it with programmability. To take an extreme case, I could imagine a processor that is an Universal Turing Machine also being more specialized (accelerators, VLIW, FP, etc.) than one which is not, resulting in lower hardware unification. In my opinion, you do need to be able to make a distinction between the two.
The specific case of identical (even possibly copy-pasted at a silicon level!) processor cores on the same chip is a problematic one from a semantic point of view. The 'easy way out' is to remember that unification is always relative; and compared to a processor with only one such core but more fixed-function hardware to achieve a similar performance level for the given workload, clearly unification is higher. In the GPU world, you could imagine offloading some/all of triangle setup to the shader core, possibly also allowing for more ALUs in a given die size... This would clearly increase unification, but not necessarily generality or programmability.
One thing that could also be explored is the implied position that just because not every algorithm is serial, the serial hardware pipeline isn't the fastest implementation.
Technically, in the absence of other physical constraints, it possibly could be.
If we could arbitrarily speed up a single-threaded pipeline, its speedup would be much more universal than a parallel one.
You want to look at unrealistic corner cases, I can do that too!
Assuming that information cannot move faster than the speed of light, and that an internal movement of information is necessary for computation, there must be an upper limit to maximum serial performance. Now, assuming that the algorithm is truly serial all steps of the way and that there is always some overhead to adding the potential for parallelism, I guess you could argue that there are corner cases where serial hardware must indeed be faster. I'm not sure it really matters in practice, though.
The claim that statistical speculation is always inferior to static programming guided by knowledge of the problem is not technically true, either.
Some problems are of a nature that you can know fully well what the various optimal software behaviors are, but the inflexibility of static software or the lack of dynamic information leads to inferior results.
Hmm, that's indeed a very good point. The only reply I could come up with is that theoretically that's a limitation of the instruction set; if the hardware can gather statistics about something, I could just as well force-feed them to it. But assuming a non-perfect knowledge of the data (at least by the compiler!) even then that's not necessarily better than speculating about it in hardware.
In practice, the proof is in the pudding and clearly the limits of statistical speculation with no explicit static knowledge remain fairly low. But it's probably true however that for optimal results, you'd want a pragmatic combination of the two, and that's something I made a very bad job of pointing out.
The claim in an *Advanced section about using the number of maximum number of NAND gates in parallel in a given stage does run into physical constraints. Circuits don't do unbounded fanout and fan-in.
There's also a philisophical question about whether Von Neumann's machine works so well in part because it maps so well to the inflexibility of silicon.
The circuit point makes sense. So much for trying to be 'advanced'
As for the second one, hmmm, that looks like a good question to me on first glance but I never really thought about it on these terms - it's 1:42AM as I write this sentence so probably not a good idea to start thinking about it now though, hehe.
Debates about specialization and generalization are moot if you're an FPGA or neuroscience nut...
I'm not sure about that. From what I've heard, modern FPGAs are not practical to reconfigure non-stop in real-time - so you still need to implement the right level of configurability or even programmability in what you choose to implement on your FPGA. As for the human brain, obviously the neurons are all very similar, but there is still a specialization into processing 'modules' both physically and functionally. This is probably in part because of the links they have created between other neurons and how these neurons link themselves directly or indirectly to external sensors like the eyes, and perhaps also in part because of some kind of storage inside each neuron that somehow influences its future replies to signals.
However I guess your point is that there definitely are cases where it matters less, and that seems correct. You'll have to excuse me while I focus on the ones where it does then...