It's not that hard to generate code that has a high degree of ILP in the arithmetic portions. Your big tool is loop unrolling combined with not reusing registers between iterations, which trades off higher register count for higher usable ILP. Here's an example of what I'm talking about. The code is just a naive loop that sums up the numbers in a buffer in serial.
Naive:
MOV $r9,BufferPointer
MOV $r10,BufferSize
MOV $r0,0
LOOP:
ADD $r0,$r0,[$r9]
ADD $r10,$r10,-4
ADD $r9,$r9,4
JPGZ $r10,LOOP
With loop unrolling (Assume size is divisible by 4).
MOV $r9,BufferPointer
MOV $r10,BufferSize
MOV $r0,0
LOOP:
ADD $r0,$r0,[$r9]
ADD $r0,$r0,[$r9+4]
ADD $r0,$r0,[$r9+8]
ADD $r0,$r0,[$r9+12]
ADD $r10,$r10,-16
ADD $r9,$r9,16
JPGZ $r10,LOOP
With loop unrolling and register optimization:
MOV $r9,BufferPointer
MOV $r10,BufferSize
MOV $r0,0
MOV $r1,0
MOV $r2,0
MOV $r3,0
LOOP:
ADD $r10,$r10,-16
ADD $r0,$r0,[$r9]
ADD $r1,$r1,[$r9+4]
ADD $r2,$r2,[$r9+8]
ADD $r3,$r3,[$r9+12]
ADD $r9,$r9,16
JPGZ $r10,LOOP
ADD $r0,$r0,$r1
ADD $r1,$r2,$r3
ADD $r0,$r0,$r1
Without any unrolling, we waste a lot of time updating pointers and counters. Unrolling fixes this, making most of the instructions inside the loop contribute to the actual computation. However, the repeated adds to $r0 generate read after write hazards, meaning the pipeline will stall before each one! To fix that, we split the accumulation to use 4 independent registers and reduce them at the end. This gets rid of all the RAW hazards between the adds. We also moved updating the loop counter to the top of the loop so that we avoid the read after write hazard on the jump comparison.
Latencies between arithmetic instructions can be fairly easily avoided through this type of optimization, simply because they're so short. This is somewhat different on a superscalar CPU where instead of a 6 cycle dependency lasting 6 instructions, you have it last, say, 4*6 instructions due to having 4 parallel pipelines! Oh, and the pipelines are longer due to the high target clockspeeds. This is where OoO starts being relevant.
There is, of course, a pink elephant in the room: Memory transactions. When you suddenly have to cover a 200 cycle access to DRAM, trying to find other instructions to cover this is sketchy at best. Your best hope is that you have a really good cache system to prevent as many of these transactions as possible, or else have another handy thread to switch to.