. It's weird though, they say they can do 4 multiply and 4 adds, so it looks like it would have been at the same time.
They can do multiples and adds at the same time -- as separate instructions on separate registers, issued to separate execution units. This is different from doing FMA, where you do an add immediately on the result of the multiply and a third register. Which is better? It's complicated. Separate units often have lower latencies for FADD, which helps in some situations, but longer latencies when you are doing FMA, which is what a lot of the vector math you do will be all about.
Throughput should be as good as having a single FMA unit, with the caveat that you need two instructions, and that the chip frontend can only decode two instructions per clock. It's not as bad as it sounds, because in x86 an alu op can also contain a memory op (and you effectively need at least one of those for every alu op). However, since there is always some overhead, I'd expect that the chip can't quite reach it's theoretical max due to issue throughput,
unless it can decode two 256-bit AVX ops per clock. That would be awesome. (I'm not too optimistic about that, though -- BD can't do that.)
That would be nice if doable, I honestly don't know enough to say one way or another.
FYI hardware.fr (so in French) state that the L2 interface can deal simultaneously with 24 read/write operations, I don't know if that would be a limitation when adding (or at which point it would start to be a bottleneck once adding core).
The hard part of putting more cores in a system is cache coherency, or snooping. Every time you write to a cache line for the first time, or read in a new cache line, you effectively need to ask every cache in the system if they have that line. In older cache systems, that's literally what happened. As you add more cores, that doesn't scale.
So, instead you make your LLC (Last Level Cache, the cache that will pass requests to memory if they are not found, L3 for Intel, L2 for Jaguar) inclusive, so that every cache line must be found in it, and then deal with coherency there. So now you only have to ask one place. Having two L2 caches means you have to ask your own, and the foreign cache, and it's worse than the cleaner system, but it's not yet horrible (and it's probably much easier to do than just making a LLC that can support the volume of requests needed by 8 cores).
That, by the way, is exactly how ppc470 cores are laid out. They are coupled with L2 in "clusters" of 4, and then these clusters can be combined together on the bus.
The slides released are completely silent on the Jaguar system architecture, so all this is just speculation. However, I really do think that the very last point on the L2 slide points to this -- the 16 additional snoop queue spots would be very useful if you wanted to put more Jaguar "clusters" in a system.
As a side note from your post I can tell that you have a positive view of this architecture
BD disappointed me, Bobcat was a very positive surprise. Frankly, given the design restrictions, I did not expect to be anywhere near as good as it is. It has a few shortcomings (integer divide wtf), but all in all it's a very efficient, simple architecture. It's basically what Atom should have been.
A lot of people seem disappointed at the apparent lack of oomph in Jaguar. Don't be. It won't reach as high numbers as the best of them, but it is freakishly efficient compared to what console devs are used to. For walking down decision trees and the like it will easily be at least 5x better per clock, with intelligent optimizing even more than that. (I mean, this is a cpu with a 1-cycle latency, 1/2 cycle reciprocal throughput conditional move...). And in the areas where it will be at it's weakest, a lot of the work can be offloaded to the GCN CUs.
So hypothetically:
200mm SOC:
8 Jaguars @ 2GHz, ~ 64Gigaflops of AVX throughput
128GFlops. 8 cores * 2 instruction per clock (FMUL + FADD) * 4 elements per vector * 2GHz. And that includes doing the operand loads.