It's unknown if this feature is in GP104. But again preemption support is a different feature than mixing graphics and compute kernels on one SM.
Preemption overhead is on the order of 10us, pretty similar to a kernel launch overhead. Likely slightly faster on P100 because of the higher bandwidth memory. So doing hundreds of premptions per second won't be a performance problem, but 10,000 per second would be. But again, that's not the same as async, which is not the quite the same as mixing compute and graphics on the same SM, and it may be different on P100 versus GP104. (It gets confusing to me since I'm a compute-only CUDA guy.)