Is it me or I can't spot on either die shots an antenna?
Another interrogation is the TDP? Ain't 5 Watts a bit high?
nVidia is well known for painting over die shots. That said, I don't think any SoC has an integrated
antenna. Or even an RF transceive. I think you're thinking of an integrated baseband, which is mostly logic with some mixed signal interface stuff.
And no, 5W isn't high if we're talking full tilt for the whole SoC or even just the CPU cores. Several mobile SoCs today can hit around that point and are allowed to maintain it for at least some duration of time. For tablets that may even be okay. Qualcomm for example has stated that Snapdragon 800 is meant to consume 5W in tablet environments, IIRC.
Let's say we're really saying just those two CPUs cores at 5W, or 2.5W each. For such a big bad wide CPU at 2.5GHz, on let's say TSMC 20nm, I would be surprised to see such a
low power consumption. On that process note, Anandtech says that nVidia got silicon back and it therefore means the Denver K1 must be 28nm. I call bunk, nVidia could have gotten back 20nm silicon for engineering samples by now.
Here's a copy + paste of the post I made on Anandtech's forums, apologies for cross-posting if that bugs anyone
"Very surprised to hear Denver is hitting an SoC before Parker. This is a pretty aggressive move for nVidia, at least suggesting a six month cadence with genuinely different SoCs that are both targeting the high end. With the first one kind of laughably following not far off the heels of Tegra 4i.
These "K1v2" figures do seem... weird. 7-way superscalar? I can't think of a single even remotely general purpose CPU that's so wide at the decode level, not even IBM's POWER processors. It might be technically feasible, especially if they can only handle that throughput in AArch64 mode. But the cost of being able to actually rename the operands of 7 instructions is high, finding enough parallelism to actually even come close to using that decode bandwidth even a small percentage of the time is slim, and they'd need a much wider than 7 ports backend to facilitate all that execution which means a lot in terms of register file ports, forwarding network complexity, and so on. I also don't think they'd get terribly far without quite a bit of L1 cache parallelism which isn't cheap. I could possibly see them sort of reaching this width if it involves SMT, but even then it seems pretty overboard. Maybe not if it's not full SMT and there are limits to what a single thread can utilize.
What I'm suspicious of is that they're counting the width of A15 and Denver at different parts of the pipeline. It makes sense to have 7 execution ports/pipelines. In that case, A15 has 8. I've seen some (scarce, unfortunately) mention that A57 is consolidating the number of ports, which is almost certainly a power consumption optimization. So 7 seems like more than enough even for a pretty high-end aggressive design - afterall, Ivy Bridge only had 6 (while Haswell extended it to 8).
I know Cyclone was purported to be capable of decoding 6 instructions per cycle (and sustaining 6 IPC execution) but until I see the exact methodology of this test I'm skeptical of it as well.
One other consideration is that some of that number may be accounted for by instruction fusion. This could include x86-style branch fusion but possibly other classes of instructions as well, although none immediately spring to mind.
The 128KB L1, presumably instruction cache figure is also far out there. The only place I recall seeing such a large L1 instruction cache was Itanium where the VLIW-ish nature of the instructions led to some relatively low density. A possible consideration here is that some of the frontend, including the L1 icache, is shared between the two cores, Bulldozer style. Would be interesting, to say the least, although even Steamroller still doesn't hit such a big shared L1 icache. I hope they're not actually storing decoded instructions in some wider format, that seems like it'd be pretty wasteful even for a strategy to support AArch32 + AArch64..
With such big caches and such a wide execution at such a (relatively) high clock speed we could be looking at some long pipeline lengths and long L1 latencies, coupled with some really deep OoOE buffering to try to keep up with it. We could be looking at some relatively gigantic cores, which is more or less what you'd expect with nVidia only offering two of them. Unlike Apple, they have the most to stand behind since they offered the first mobile quad core SoC with Tegra 3 and defended it pretty aggressively. I don't think they'd be going dual core here unless the cost difference was huge; I think they'd go quad core even if it meant they could only run all four at a greatly reduced frequency (which Tegra 4 is basically doing that anyway).
Also, no mention of a companion core for Denver. And I don't think we'll see one. Pairing an A7 cluster with that would be very interesting, and would mean quite a particular design investment on nVidia's part, which I don't see it happening. But who knows, we didn't learn about Tegra 3's companion core until pretty late in the game.
So two things nVidia has to eat some crow on.. which I doubt we'll see much actual discourse on, but that'd be pretty fun...
Two final thoughts: I wonder if the Denver part is legitimately meant to replace the A15 part, or if the former is going to be targeting phones while the latter targets tablets or even beyond that. IF that's the case then it's possible nVidia will continue to license ARM cores for some parts, and this isn't just a time to market feasibility thing. Lastly, I noticed that nVidia had quite good documentation in its anti-Qualcomm propagada.. er.. technical white papers, where they went into a fair amount of details of how A15 operated. Now that they're using their
own core I sure hope we get something even more thorough. Great take that to Qualcomm who says nary a thing about their tech. Fingers crossed."