Since Shader Model 2.0, there has been no specification for "dual issue" at all. The token stream comes in and it's up to the driver's compiler to pack things optimally for the HW on which the shader will be run.
The main improvements of DX10.1 over DX10 are increased flexibility of the pipeline. For example, you can specify some interpolants to be computed per sample instead of per pixel, allowing for better results while using multisampling. These improvements increase the application's (and HW's) efficiency because it removes the need to multipass certain effects.
I'm not CGP but basically that's what i said (or not?) Now they pack more things to be done simultaneously. And after all dx11 (multithreaded) is something dx10 expected to be (as they claim in the other thread), if nV didn't crap out some dx10 inoculated versions of their earlier dx9.0c parts. Well they had their own way of packing these samples inside their Core.
While ATi can easily shuffle (theoretically) all data between their SIMDs and main impact on R600 performance came up with large but insufficient global (InterPipeline) cache and they done pretty good job patching that cache impact issues with dx10.1 superset cause all that samples didn't have to be flushed into global cache and fragment it even more, just for store some final results as initially is supposed to do. Did i get it?
But in terms of GFlops they use it while trying to describe GP-GPU computing power and depending how ultimately smaller fragments can be packed and processed for max-width data execution they get more real performance hitting that peak (nominal) performance. And use same amount of power (same number of transistors) that all do some useful job not just leaking current between drain-n-source.