There is a few problems with that approach. First, the surface textures may affect how rays are cast. For example, a mirror that has a normal/height map to encode deformations requires texture lookups to properly bounce rays. Second, if an object is using 2048x2048 textures, there's no way you're going to fit that into cache. More likely, it will have more then one of these, and you're really screwed.
No, you need to be more granular then just per object or surface. This is why I proposed the virtual tessellation system in my above post. It doesn't require any branching, other then a loop here and there, and it will partition the hits on any surface into a set of local areas. This way you only need to load a small piece of the texture into cache, and futhermore you know which areas you will be rendering next so you can start preloading that data too. Then you can efficently calculate all your interactions without trashing the cache ever.