Try running this on Tahiti for example:
http://galaxy.u-aizu.ac.jp/trac/note/wiki/Astronomical_Many_Body_Simulations_On_RV770
With Cayman you'll get 780/1030/1500/1770/1800 of sustained GFLOPS for the five predefined n's.
With Tahiti, you should be able to beat that by at least 40% (clocks & unit count).
That's fairly heavy on trascendentals and I dare say that the compiler might be outputting "compute" transcendentals rather than graphics transcendentals. I suspect the former are much slower.
Though GCN is supposed to have "helpers" for compute transcendentals as far as I understand it, to reduce the hit of "precise" transcendentals that's seen on prior GPUs.
But this application is written with IL (it appears) which throws extra variables into the mix. e.g. I don't know if this is written using PS compute (unlikely, but not impossible) or CS. The former would be more likely to use graphics transcendentals.
In GCN I can hypothesise that a transcendental (SQRT say) that appears in a compute shader is always "slow", because the compiler sees that it is compute, not pixel shading. Whereas on R700/Evergreen the compiler would issue the literal instruction, the standard "imprecise" SQRT regardless of PS or CS mode that the IL is written in.
---
Another issue is register allocation. GCN's peak register allocation per work item is half R600's. Brute force N-Body is a perfect place to do maximal vectorisation, i.e. stuffing as many particles into a work item as is possible.
GCN may be over-stuffed and so running at much lower performance, with too few threads per core.
Horrible register allocation is a long standing problem for AMD. GCN may be extra-horrible in its youth. It should improve...