I'm no E.E., so if I've misinterpreted the predicted flow of events then feel free to correct me, as I have no inside information WRT this matter.
Well I'm no E.E. either but I do agree that the end of hardware TMUs will probably happen eventually... at least for the more complicated parts. 8-bit/component bilinear filtering is one case that I've been told is a hell of a lot more efficient to do in hardware, but maybe trilinear, aniso or >8bit/component filtering would make sense to move into software.
What I was asking about though is when you said "rolled into ALUs", what extras in particular do you expect to be added to the ALUs? I mean, you can already do texture filtering entirely in software if you want, and once you start to throw in features to accelerate it (which tend to be at a minimum bilinear filtering math/hardware nearer to the memory), you're back to a TMU again...
You only know your ray is going to hit an object when the intersection occurs, so I don't see it.
Ah, but if you trace nearby rays they will tend to follow the same path down the spatial acceleration structure, and thus will tend to be cache coherent. There are a number of improvements that you can make to the storage layout of the accelerators as well to help even further. Take a look at some of the Cell ray tracing literature for more info. The cache/local working set is explicitly managed there which makes it more obvious what's going on, but the concept is the same on a hardware-managed cache, except it's implicit.
You could do raytracing down to given level and then if you end up with a branch with potentially visible voxels give up on image order rendering for a moment. Instead switching to object order rendering and splatting the voxels beneath that level on the screen before continuing with raytracing (with the splatting providing Z-buffer data which will allow all subsequent rays hitting that cube trivial intersection testing).
Right, well that's basically what an optimized rasterization engine does, except with fairly large "leaf nodes" since that keeps the number of batches down.
However even if you ray trace down to the very finest level, coherent rays that all hit the same triangle will all run in lock step with no incoherent branching and perfectly coherent caching too. This is basically the same case as a hierarchical rasterizer, which is why I take every opportunity to point out the insane amount of similarity to people who dogmatically approach the rasterization vs ray tracing argument