No don't! Use an index which you update only once per iteration. Most CPUs have the ability to index relative to a pointer so incrementing 3 pointers per loop is often a bad idea (apart from the code being bloody hard to read!)
The only realistic alternative to multithreading for very wide issue processors is good old vector processing. Most of the time parallelism suited for a vector processor is just as suited for a multithreaded one though, but not vice versa.
No don't! Use an index which you update only once per iteration. Most CPUs have the ability to index relative to a pointer so incrementing 3 pointers per loop is often a bad idea (apart from the code being bloody hard to read!)
The only realistic alternative to multithreading for very wide issue processors is good old vector processing. Most of the time parallelism suited for a vector processor is just as suited for a multithreaded one though, but not vice versa.
For a floating point processor with a fair bit of cache I dont think VLIW makes much sense. Dual issue with SIMD is the sweetspot IMO. With control circuitry already being a very small factor you might as well stay with superscalar.
They report IPC ranging from 100+ for hand coded data-flow kernels (based on a synthesizable RTL model and cycle accurate simulators) to 1-2 on Spec2000.