As I was thinking about what you said, I can to the realisation that I had confused photon mapping and ray tracing in my crazy mind. However, then I came to further see that ray tracing is just a specialized type of photon mapping, where instead of scattering photons, you scatter observation points. Thus, using the same algorthim for photon mapping, which is basically what I have here, you can scatter your observation points, taking care to maintain a tree of their dependancies (i.e. from this observation point, these child points were generated). Then at the end of the frame you simply walk the tree, gathering the observations to create your final image. Thus, you have the duality of scatter and gather.
I further concluded that the reason rasterisation is faster is that it doesn't require the scatter step, instead directly gathers the contributions from the geometry itself. Of course, that is less dynamic then if you include the scatter stage. Plus, if you are doing photon mapping for GI, which I would assume is the case since we are talking about very advanced lighting, you can scatter your observers alongside the photons, leaving only the gather stage and making it conceptually just as fast as rasterisation.
Now, as for the issues with the algorithm, the idea isn't to have nodes absolutely locked onto a region, but to give them an awareness of the overall flow of points (be they photon or observers) and exploit the efficencies of such knowledge. If you know a certain volume of points tends to flow through a certain region of space, you can make a proportional number of processors responsible for that space. This keeps the data associated with a region of space in cache, saving system bandwidth, at the expense of inter-node bandwidth, which is much cheaper assuming all the nodes are on one chip. Of course, to get this information you have to be running the simulation, so you have to make the hardware able to respond to the changes in flow as they occur. This is similar to the efficencies of the whole unified shader thing, and where the PPE in the CELL would probably come into play as a manager.
Looking at the converse, when a ray goes in an odd direction, diverging from the main flow and entering an area with little activity, this can be solved with a bit of caching in a central manager (the PPE). You simply store the ray in a list associated with its region, waiting for a sufficent volume to build up to warrent assigning a processing node. Collectively, these lists should be small enough to remain in the L2, or at least mostly in it. At the end of the main body of processing any remaining points can be dealt with if there is enough time, or ignored entirely. Ignoring it shoud be a fairly safe thing to do, because regardless if there was or wasn't a major flow to the area and its just a straggler, its small contribution should be of little consequence. It's a sort of built-in importance sampling.
As I said before, the observer points would stream out a tree of their dependancies into main memory. However, this data is not used until after the scatter operation, so any latancy is a non-issue.
Don't forget, there is a whole host of other optimisations you can do when there are constraints on the observer/light's movement. You could exploit temporal coherency by recording the regions each point passes through, steamed out alongside the observer tree, only reworking points that passed through disturbed regions. This would be useful, for example, with a fixed camera or light, and even if it rotates you could probably still reuse some of the data. You can precalculate an inital state for these sorts of objects as well, to shorten loading times. I'm sure there are many more methods.