Like what was said, recursive code can probably be cached very easily if it is compact.
For that at least, recursion does pretty well. On the other hand, a loop of similar code size would benefit just as much, and modern hardware design seems pretty fixed on loop-based code as an optmization target.
There would be some complications resulting from how recursion is carried out. I believe it is usually implemented as a repeated function call, and I am not sure if this might make it harder to schedule instructions for an out of order processor. I know speculation can carry over jumps, but I don't know enough to say if it will skip over a function call.
A function call usually means the processor has to put something on the stack, which means an access to cache at the very least for setting up the return address and stack frame. If there are local variables, they will be placed in the stack frame as well.
On out of order processors or properly optimized and unrolled loops, this overhead is either hidden or reduced. In addition, intelligent loop unrolling can save on redundant variable allocation and skip over a lot of address calculations.
I do not believe modern hardware is smart enough to pick up on recursive calls to figure out more optimal ways to handle stack and allocation issues. Software optimizations might be.
Then there's the danger of a stack overflow if something recursive goes on for too long. Even trivial recursive solutions can overtax the small amount of local storage on a processor. How the stack works in a multiprocessor environment might also be an issue, if the processors repeatedly fight over access to it.
Unfortunately, I am not very proficient in the low level issues of recursion on modern hardware. I welcome corrections to any misconceptions I may have.