I was thinking about a architecture idea were one instruction is issued to multiple register files in a core. it would be implemented like this were you would heave the op code with two sets of three operands. you could also use vector instructions to run alot of data on two or four threads in a core. you would only need to fetch and decode on long instruction saving a lot of bandwidth and energy.