Barriers (at the task invocation level) are a mechanism that requires careful usage. They don't simply enforce that a kernel invocation waits behind another kernel invocation (or copy), they prevent the successor from starting until the predecessor has completely finished. Sometimes they are necessary, because the first task writes to random parts of the target and the second cannot use that target as input if writes from the first task would get lost.
Any time you ask a GPU to "completely finish" you're going to waste a lot of shader/TEX/ROP/memory cycles. The invocation that's finishing will gradually go from using the whole GPU down to using none of the GPU as work runs out. The invocation that starts after the barrier will take time to fill the GPU with work, too.
So, if you can, you should avoid an algorithm that requires task-level barriers.
Alternatively, you create multiple, parallel, queues each of which has its own sequence of tasks separated by barriers. But, now you need an algorithm which can be chopped up into such small pieces as to be spread over multiple queues.
If you can chop up an algorithm like this, then you can also stripe the tasks in a single queue: A1, B1, C1, A2, B2, C2.
The upshot is that it takes time to examine these implementation alternatives, multiplied by the types of GPU you might encounter. And there may be algorithms that obviate the need to wait for an invocation to completely finish, but they have to be found.
Hopefully low-level D3D12 coders will accept that they can write auto-tuning task managers rather than drudge through the full nightmare of catering for all the quirks. And PC gamers are used to frobnicating their graphics options to get the performance/IQ balance they want, so finding algorithms that are relatively stable in their performance profile across IQ and GPU capabilities is prolly more important.