The wide issue could point to a LIW or VLIW processor core for Nvidia, if the patents turn out to be implemented.
The patent makes note that there is a hardware decoder for non-native instructions and that native instructions bypass it. There's some ambiguity here as to whether that means there's a decode engine and an implied native decoder, or if Nvidia is going for a really old school VLIW internal representation. I'd feel more comfortable with the idea that there's at least some level of decode still present, even if it's not mentioned in the patent.
Since Nvidia wants HPC, giving details should be encouraged.
On, the other hand, if it is a code-morphing processor, it's a mark against transparency if Transmeta's behavior is a guide. RWT had an article about reverse-engineering it.
One handy thing that might happen with this isn't so much ARM or x86 compatibility, but compatibility with the GPU ISA could be done as well, if for example a kernel showed bad divergence or small granularity.
Kepler's inclusion of dependence data in its ISA would make creating instruction packets easier, but that's more a what-if and not a KNL speculation.
As far as KNL goes, if it uses Silvermont as a basis, how it works in relation to a 4-way SMT setup with vector requirements will be interesting.
That core's rename resources are not big enough for 4-way threading to not feel significant resource pressure, the load and store queues are going to fill up with outstanding accesses with that many threads in a stream compute context, and the FP rename capability is simply not big enough for an AVX3 32 register context.
The design choices for Silvermont would point to a multithreading scenario where register files get duplicated, instead of pointer following in a physical register file like Haswell and the like.
I'm curious if the explicit breakout of the VPU means a new domain or possibly a coprocessor model like AMD's FPU, or something even more separate.
The VPUs are going to like using gather, have many outstanding accesses, tolerate latency, and like a lot of bandwidth. Itanium routed FP accesses straight to the L2.
What if they combined all of this and gave a separate VPU ALU and memory domain that hooks into that new and custom L2?
(edit: not sure how they'd approach the reorder buffer, sharing one with the current size poses the risk of it becoming a hazard, but the size or sharing/duplication of it are knobs that can be turned)