Stacks are hard on the GPU. Even with the PDC (Parallel Data Cache), you have to share the space with all threads in the warp and you have to be careful about conflics on bank access. In the GPGPU community, we adapt datastructure traversal to support "restart" or "backtrack" methods, see Foley et al's paper from Graphics Hardware last year or Horn et al's paper from the upcoming I3D, both on k-D tree raytracing. The later emulates a fixed size small stack using the register file and using looping constructs instead of pure streaming. With scatter and gather, you could emulate stacks in GPU memory (and even host on ATI chips with fine grain scatter), but it becomes *extremely* expensive. You are now talking about tremendous amounts of latency to cover, and you are still talking about defining a bounded region of memory for each thread, basically allocating for the worst case stack usage. However, someone can probably extend Lefohn's GLift GPU datastructure work to make this easier to use, but it's likely still going to be expensive.
The main issue with recursion or stacks is that the memory requirements are unbounded. On the CPU this really isn't a problem, but as you have to handle 100s-1000s of threads, the required space on a GPU or a CPU way into the future gets quite high.
I have now done both a grid and a BIH-tree based raytracer in CUDA. Both are pretty naive, giving med about 1 / 3 MRays/sec (grid/bih) for million polygon models. I simply put the stacks for the BIH one in shared memory, there is enough space there for one stack per thread. I probably have shitloads of bank conflicts though, and lots of divergent execution