...I was hoping that AMD takes a big leap forward in their primitive processing performance. Nvidia is much faster rejecting invisible triangles.
...
I actually had a conversation regarding to instruction pre-fetch just a few weeks ago in Twitter. I didn't find any public documents describing GCN (1.0-1.2) instruction pre-fetch. This is important to know if you want to build a "jump table" style shader system. Sort all shaders by GPR count and bucket them to GCN occupancy classes (
http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2014/05/gcn_vgpr_table.png). This results in 10 different shaders in total, allowing you to execute any amount of different shaders in just 10 compute dispatches. Nice trick for tiled/clustered lighting for example (especially when combined with deferred texturing). Simple pre-fetch (load first N instructions of each shader) is not a good idea for shaders like this. Pre-fetch after the jump would of course work just fine.