IMHO you thought much too complicated.
Let’s assume that every thread have an instruction pointer and a state flag. When a thread enters the shader unit the IP is zero and the flag says ready. Every time the scheduler needs a new thread it looks in its thread pool for ready threads and select one. The thread is then moved to execution unit and the flag is changed. The Scheduler will not touch this thread anymore until it is back from the execution unit and the flag is set back to ready again. This will save all the synchronization work.
Let’s assume that every thread have an instruction pointer and a state flag. When a thread enters the shader unit the IP is zero and the flag says ready. Every time the scheduler needs a new thread it looks in its thread pool for ready threads and select one. The thread is then moved to execution unit and the flag is changed. The Scheduler will not touch this thread anymore until it is back from the execution unit and the flag is set back to ready again. This will save all the synchronization work.