What is really awkward in those charts is that they're comparing a GK110 cluster with a GM107 cluster; I do get the point the slide is trying to make even with 192 vs. 4*32. Are there no dedicated FP units in Maxwell or was some of the marketing guys just to overeager and thought 256 look "prettier" on the left?
I agree, this is odd [I assume by "fp" you're referring to the dp units]. Earlier I had asked where the dp units were and the logical response was that these units weren't in the block diagram in the previous release; now we're sitting here with these two charts which would seem to contradict that point. I don't think I buy the aesthetic or ignorance arguments -- not with TegraK1 throwing around the 192 term with abandon and the marketing team making crop circles. A more likely explanation would be the dilution of "192" as a magic marketing term, but even that seems like a stretch. So, are the smaller maxwell chips not dp-capable?
That seems likely at first blush -- we've violently agreed that there's no real market being fulfilled, so it seems completely reasonable to bifurcate your product line. But, then, why put the GK110 slide next to your GMxx7 slide? If there's no expectation for dp in your consumer product line, why raise the issue?
Another possibility is that these two models are somehow comparable, so the GM107 is capable of dp. Would it be possible that they made the alus capable of half-rate dp? Half of 128 alus is comparable to the 64 units in GK110, and presumably half-rate logic is cheaper by area and power.
Also curious -- the wording of increased performance per alu. Had the increase in performance been compared at the SMX to SMM level, an increase in utilization would be the reasonable assumption, but at the alu level, it implies the alu is capable of more. Can you issue separate mul & add, fp32 & int32, are there currently instructions that take more than one clock cycle to issue that can now get better throughput, or is there something else? I similarly find it odd that there is one scheduler and two dispatchers per 32 alus -- why does one need two dispatchers? 16 alu-wide dispatch? One external (tmu/sfu/???) and one internal (in which case, why co-locate them in the diagram)? Or is there some kind of co-issuing being done here? [Or, are those not dispatch units?]
Lots of questions, I wonder how many answers we'll get on Tuesday....