Ah, but what if increasing ILP is cheaper than adding full parallelism? In this case, there's some optimal amount of ILP, which will be architecture-dependent.DemoCoder said:Since graphics is "embarassingly parallel" and the workloads fit the profile of TLP instead of ILP, that suggests to me that the ideal graphics architecture maximizes concurrent threads, rather than trying to use ILP to increase instruction throughput within a thread. Bulking up each pipeline with more and more execution units invariable will lead to idle units, since the workloads won't always allow optimal scheduling/packing multiple instructions into a multiple dispatch.
Edit: Oh, and this isn't exactly ILP, since the NV4x and G70 have two units in serial, not parallel (with additional possibility of parallel instructions within units, though).
Particularly with the G70, though, the two units are quite similar, and thus can be kept active without worrying quite so much about which order the instructions come in.