Obviously it would need keep HW threads alive and so on, but does it affect any of the scheduling beyond just putting a thread to sleep? Globally that could certainly starve a CU if it was waiting on an earlier-launched thread to issue the "ordered" operation still; what happens if that thread never issues it due to dynamic control flow or otherwise? Does it have to wait until the thread ends completely?
Let's assume the simplest possible (global) GPU scheduler. The scheduler splits the kernel to N thread groups. Each thread group executes threads with id = [N, N+M], where M is the thread group size, and N is a (scheduler) counter that is increased by M every time the scheduler issues a new thread group to a compute unit. When a compute unit finishes any thread group (of any shader, assuming that compute units can run thread groups of multiple shaders), the scheduler checks whether that compute unit has enough resources (GPRs, threads, LDS space, etc) to start a new thread group. If the compute unit had enough resources, the scheduler gives it the next thread group.
This kind of scheduling would ensure that all the previous thread groups have always been issued to compute units (have resources allocated and are ready to execute), meaning that the previous thread groups will eventually also finish. There isn't any deadlock cases. Waiting for a previous thread group to reach point X in the shader is a safe operation. All modern GPUs have similar latency hiding mechanism: put waves/warps to sleep that are waiting for a memory operation and immediately switch to another wave/warp. I don't see a problem implementing ordered atomics using this existing mechanism.
Performance:
Let's assume GPU has the simple global scheduler described above. In the beginning the compute units will be scheduled thread groups [0, K-1], where K is the number of compute units. When the scheduler is done scheduling groups [0,K-1], it notices that each compute has still free resources, and schedules thread groups [K, 2*K-1]. So compute unit 0 will receive groups 0 and K, and compute unit 1 will receive groups 1 and K+1, etc... If we assume that each thread group takes approximately the same time to complete, the waits will be minimal. This is because the thread groups were issued in order, and will roughly finish in order. And this continues throughout the execution, as the scheduler will schedule a new thread group to a compute unit that finished a thread group first, and the new one to the compute unit that finishes a thread group second, and so on.
In my example prefix sum implementation, the ordered atomic would be in the middle of the shader. When the previous thread block is not yet finished, the GPU would behave in a similar way than noticing a cache miss in a middle of the shader. Other waves/warps (and thread groups) on the same compute unit would execute their instructions to hide the stall, until they also reach the ordered atomic. If there are enough instructions (and waves/warps), the GPU should be able to hide the stall quite well. Also it helps that there are also instructions in the end of the shader (that have no dependencies). These instructions can also be executed when another thread group on the same compute unit is waiting for the ordered atomic. So it all boils down to this: Is there enough instructions in the local prefix sum shader to hide the ordered atomic latency (time to bounce the counter value from one compute unit to another).
Divergent control flow:
Practically the same issue as barriers + divergent control flow. The compiler should give a compiler error.
DISCLAIMER: My analysis assumes a super simple global scheduler. Most likely real GPUs have more complex distributed schedulers, making all this much more complex (for both correctness and performance)