The overhead is always greater than the performance gain, I don't think there's any case where SMT improves perf/W.
I think there is, once you max all cores. Running two threads/core would probably yield higher perf/power. That is a non-existing use-case in the mobile space though.
If you're not at max, it's a non-starter. Say SMT buys you 20% increased throughput at a given frequency, you can run two threads on one SMT core at 2.5GHz or run them on two cores at 1.5GHz; The latter wins because of (very) non-linear DVFS scaling.
Cheers