Not grim IMO, but rather shows what will become important. For example, note how the BG/P OS doesn't do disc backed memory, pages are always physically pinned so DMA engine has low latency and CPU doesn't touch pages during communication. What I gather from all of it is that eventually the hardware is going to consist of cores and interconnect which provides dedicated hardware support for the most important parallel communication patterns, so that the cores aren't involved in communication which is latency bound. Things like CPUs manually doing all the work on interrupts (preemption) just isn't going to scale ... nor is ALUs doing atomic operations on shared queues between cores ... etc. I think all this goes away at some point for dedicated hardware, and a different model of general purpose computing.
Yeah, in GPUs the hardware is effectively providing a queue for atomics and the pre-emptive scheduler is sleeping the context when it runs out of non-dependent instructions, and then using other contexts. GPUs currently have no useful concept of multi-GPU atomics, which is where AFR comes from.
My query is why can't software threading (fibre based) enjoy the same benefit? In software threading the scheduler would sleep/queue fibres.
The way I see it, where the atomic operation is performed doesn't impinge on whether the hardware runs 10s or 100s of contexts in order to hide latency, or whether software threading is used.
Going further out of my depth, how are atomics implemented in virtualised processors?...
My little brother (James Lottes, different last name) worked at Argonne in the MCS Division on tough scaling issues for Bluegene (until he decided to go back to get his PHD this year, now he works there on/off). An interesting paper related to the issues of scaling algorithms in interconnect limited cases,
http://www.iop.org/EJ/article/1742-...quest-id=12293745-5238-4326-9be2-43b91b4c4753, covers how they adjust data exchange strategies for the problem to lower network latency.
Being able to use the dedicated reduction hardware seems to be the biggest win there (it was put there with good reason, eh?) but the crystal routing is none-too-shabby!
If you haven't read this PTX simulator paper,
http://www.ece.ubc.ca/~aamodt/papers/gpgpusim.ispass09.pdf, you might find it interesting. Their results showed performance more sensitive to interconnection network bisection bandwidth rather than latency.
Ooh, that's interesing and it's Wilson Fung again. A nice range of applications and not too much low-hanging fruit computationally.
I've got a number of issues with that paper:
- it doesn't use the cache sizes that Volkov has indicated exist, as a baseline - though I'm still unclear on whether caching has any effect on global memory fetches (as opposed to fetches through texturing hardware)
- it doesn't allow for developers to re-configure their algorithms to the hardware configurations they evaluated - this is particularly serious as CUDA programming is very much about finding a sweet-spot for the hardware in hand
- the evaluations all seem too one-dimensional, with the exception of "are more threads better?" which groups several changes, but seemingly doesn't take the criticisms they make as clues for which other variables (e.g. memory controller queue-length) to take into account
- PTX is fairly distant from what the processor executes, both because of the dual-ALU configuration which isn't simulated and because of the re-ordering of program flow that the hardware can perform. The very low MAD count makes me wonder if driver-compiler optimisations, from MUL + ADD into MAD is one of the things they missed - though I know that some developers steadfastly try to prevent this particular optimisation from occurring
There's some truly cruel IPCs shown there
Warp occupancy (inverse of branch-divergence) seems pretty decent overall, but MUM is a disaster zone.
All the worst-performing applications (BFS, DG, MUM, NN, WP - NQU isn't an application if you ask me - also can't actually see performance there) show that "perfect memory" would make a very substantial difference in performance.
With regard to on-die communication topology I wonder if these GPUs are using a single communications network. The diagrams for ATI GPUs clearly indicate multiple networks and 3dilettante's point:
http://forum.beyond3d.com/showpost.php?p=1292923&postcount=1103
about texture cache traffic being uni-directional is a very powerful point that massively affects the simulations performed.
It's also interesting that the ring doesn't look very good there. Did they give it enough bandwidth?
They also added a cache in their simulation, which indeed helped some of the apps, but also reduced the performance of a lot of them.
"a lot"? CP suffers due to a simulation artefact. RAY and FWT may suffer from cache policy. And without developer-optimisation, as I said earlier, it's not saying much - developers have optimised for whatever cache the apps have on the hardware they've tested, e.g. MUM is 2D texturing and benefits greatly. LIB is making heavy use of local memory (private to the thread in video memory is the definition of local memory, I guess) yet shows a performance decline as more and more cache is added that isn't explained.
Jawed