It is a dual issue architecture, it has two FPU/SIMD pipes so it can do separate MUL and ADD instructions at the same time. There is nothing misleading.
And AVX also does not include FMA, it's a separate extension (or actually two separate ones if you count FMA3 and FMA4).
But what is more interesting is that the DP multiplier is obviously only 1:4 rate. While Jaguar can do 4 SP muls (+ 4 SP adds), it can do only a single DP mul (+ 2 DP adds) in a cycle.
They can do multiples and adds at the same time -- as separate instructions on separate registers, issued to separate execution units. This is different from doing FMA, where you do an add immediately on the result of the multiply and a third register. Which is better? It's complicated. Separate units often have lower latencies for FADD, which helps in some situations, but longer latencies when you are doing FMA, which is what a lot of the vector math you do will be all about.
Throughput should be as good as having a single FMA unit, with the caveat that you need two instructions, and that the chip frontend can only decode two instructions per clock. It's not as bad as it sounds, because in x86 an alu op can also contain a memory op (and you effectively need at least one of those for every alu op). However, since there is always some overhead, I'd expect that the chip can't quite reach it's theoretical max due to issue throughput, unless it can decode two 256-bit AVX ops per clock. That would be awesome. (I'm not too optimistic about that, though -- BD can't do that.)
The hard part of putting more cores in a system is cache coherency, or snooping. Every time you write to a cache line for the first time, or read in a new cache line, you effectively need to ask every cache in the system if they have that line. In older cache systems, that's literally what happened. As you add more cores, that doesn't scale.
So, instead you make your LLC (Last Level Cache, the cache that will pass requests to memory if they are not found, L3 for Intel, L2 for Jaguar) inclusive, so that every cache line must be found in it, and then deal with coherency there. So now you only have to ask one place. Having two L2 caches means you have to ask your own, and the foreign cache, and it's worse than the cleaner system, but it's not yet horrible (and it's probably much easier to do than just making a LLC that can support the volume of requests needed by 8 cores).
That, by the way, is exactly how ppc470 cores are laid out. They are coupled with L2 in "clusters" of 4, and then these clusters can be combined together on the bus.
The slides released are completely silent on the Jaguar system architecture, so all this is just speculation. However, I really do think that the very last point on the L2 slide points to this -- the 16 additional snoop queue spots would be very useful if you wanted to put more Jaguar "clusters" in a system.
BD disappointed me, Bobcat was a very positive surprise. Frankly, given the design restrictions, I did not expect to be anywhere near as good as it is. It has a few shortcomings (integer divide wtf), but all in all it's a very efficient, simple architecture. It's basically what Atom should have been.
A lot of people seem disappointed at the apparent lack of oomph in Jaguar. Don't be. It won't reach as high numbers as the best of them, but it is freakishly efficient compared to what console devs are used to. For walking down decision trees and the like it will easily be at least 5x better per clock, with intelligent optimizing even more than that. (I mean, this is a cpu with a 1-cycle latency, 1/2 cycle reciprocal throughput conditional move...). And in the areas where it will be at it's weakest, a lot of the work can be offloaded to the GCN CUs.
128GFlops. 8 cores * 2 instruction per clock (FMUL + FADD) * 4 elements per vector * 2GHz. And that includes doing the operand loads.
Yes. One Jaguar core has half the peak flops compared to one Ivy Bridge core (at same clocks). So a 8 core Jaguar should match 4 core (8 thread) Ivy Bridge if both are clocked the same. That's pretty good for such a small CPU. I wonder if they are planning 16 core (four module) versions as well (would be nice for some server setups).
I think 16 core would start to have serious problems with coherency. The costs go up exponentially with agent count, and also with distance between the coherency agents (which would necessarily be much more in a 4-"cluster" system than in a 2-cluster one).
Besides, the kinds of server loads where Bobcat would be interesting don't really want large systems, they want a lot of systems per rack unit. So AMD might well manufacture a server soc with 16 (or even more -- the chip wouldn't be *that* huge) Jaguar cores in it, but I would expect it to contain 4 separate systems and a shared IO hub, sort of like the things Seamicro wanted to build before acquisition.
Because of how this sounds I just want to clarify that the person I got that info from has dev kit access. They went with another's guesstimate because they didn't have their own.
I think most people are expecting the GPU to be above 1TFLOP.
Besides, even if BGs info is up to date, a 1.8TFLOP GPU would still fit since anything between 1-2TFLOPs would still fall in line with the 1+TFLOP quote..
The 1.8TFLOP is for PS4. 1+TFLOP for Xbox 3. Just sounds like you were crossing the two for Xbox 3. And while not recent info, I still haven't heard if any changes have been made at this time.
The 1.8TFLOP is for PS4. 1+TFLOP for Xbox 3. Just sounds like you were crossing the two for Xbox 3. And while not recent info, I still haven't heard if any changes have been made at this time.
No, AlphaWolf has it right. I read Ruskie's post as saying 1.1-1.5 didn't fit within the 1+ estimate from your source. My point was that anything from 1.1-1.9 technically falls in line with that original estimate. I purposely used 1.8TF as an example since that's currently the highest rumored benchmark for GPU performance in these upcoming consoles. :smile:
The 1.8TFLOP is for PS4. 1+TFLOP for Xbox 3. Just sounds like you were crossing the two for Xbox 3. And while not recent info, I still haven't heard if any changes have been made at this time.
If it is still the best number we have, definitely the closest shipping GPU there is on the market is the hd 7850.
I don't remember if some rumors gave us a more specific number like the number of SIMD/CUs
definitely somebody has to open a thread akin to the one Acert opened for the next xbox that would centralized of the rumors we heard so far. Damned it's tough to keep track of what is going on, what are the last rumors, which have been debunked, etc. I'm a bit lost.
If Sony use a quad core Jaguar they could definitely have something sexy. Pitcairn is barely bigger than 200 mm^2, I would put the (4) jaguar cores + cache ~25/30 mm^2. With a pretty mature 28nm production should go well
With regard to the bus width, I've wondered if Sony should go with a 256 bit bus without much room for a shrink or go for a 192 bit bus that might allow for one shrink.
It's disputable it looks like shrinking cost more and more in R&D, new lithography are more and more expansive, the same for the wafers, etc.
My pov is that they should go for something cheap ( I know pretty much every body disagree with me here) so I would favor a 192bit bus and a design clean of coarse grain redundancies (no CUs or memory controller disabled). They should be comfortable with it without having to plan on shrinks.
To me the perfect chip would be 4 jaguar cores, 16 Cus, 24 ROPs and a 192bit bus. They should try to get as tiny as possible (south of 200mm^2).
For the memory I wish (I'm a crazy person I know) they go for 2GB of fast GDDR5, in the same set-up as Nvidia GTX660ti (x2 512Mb on two controllers and 1GB on the last one). It's a trade off but once it's all said and done it works pretty well.
What I want is Sony to have the product they can price competitively and be comfortable with the BOM. I don't want them either to go head to head with MSFT whereas the later if they bring some serious OS to their platform are going to enjoy imo a serious competitive advantage.
Imo Sony has to deliver a good "core" system priced and specced accordingly with their fiscal situation and the gloomy economic situation world wide.
My wish would be that Sony sticks to its free PSN further upgrade PSN+ and delivers a system at a really competitive price. I hope they could launch a single "HDD less" SKU 2 199$ next October/November (2013) with some flash storage on the board.
The system would have a slots (at least two I would wish for 4 they may cut the number down the road when sd cards capacity climbs) for PSV SD cards.
Sony needs money now more than ever. Creating this cash flow is a trade off to reach mass market price for the system at launch. If the PSV find its pace, a sane one Sony may be happy to have made this choice. MSFT went away this gen with abusive accessories pricing and no free online gaming policies. Fans and haters would both cry, but it's imo a nice trade off especially if Sony keeps the basic PSN free.
As a side note they should significantly sweeten the price of their proprietary SD cards 99$ for 32 is a bit crazy.
Not that they could not sell bundles a SD card but the message should be clear for the devs and the public: "there is only one SKU", the naked system has x GB of flash ram for caching (a sane amount, decided with editors or close enough to their wishes), install can't be mandatory. There would be some extra room for downloadable content, updates but that's all.
Overall I've read like everybody the requests of some big studio and editors, as well as the wish of a lot of fans but I hope Sony doesn't comply to any of those and that they remains reasonable (not like the PSV over specced imho as such over priced).
I hope that they start in the grey on the system alone and make money really fast. I hope they don't promise that the system will last ten years, that they avoid their usual PR mass disaster.
They have a X86 SOC with a known GPU architecture, a UMA, 2GB devs would push the crap out of that thing and for long a enough time.
EDIT
Wrt to the proprietary Sd card I believe that Sony could push their usage further, I mean on more of their products. There are more and more tablets and phones /products that no longer offer SD card extension slot. I find it bothering, it would be a pretty good compromise for Sony to still offer the option BUT only for their proprietary format. If the pricing it's sane (so a sane premium) I believe it's the lesser of two evils.
EDIT 2
wrt to the price 199$ might be pushing it, but they could go for 249$ and use the same 49$ rebate strategy MSFT used every or almost every years this gen during the Christmas season.
No, AlphaWolf has it right. I read Ruskie's post as saying 1.1-1.5 didn't fit within the 1+ estimate from your source. My point was that anything from 1.1-1.9 technically falls in line with that original estimate. I purposely used 1.8TF as an example since that's currently the highest rumored benchmark for GPU performance in these upcoming consoles. :smile:
If it is still the best number we have, definitely the closest shipping GPU there is on the market is the hd 7850.
I don't remember if some rumors gave us a more specific number like the number of SIMD/CUs.
Thanks, I don't like that 192GB/s it's exactly the bandwidth provided by the 256bit bus in the gtx 680... sounds suspicious to me (especially at that rumors was pretty contemporary of the gtx 680 launch). AMD parts using 256bit bus have less bandwidth to play with, at least nowadays products.
If there is truth to it I hope that it's a misunderstanding for a 192 bit bus, that's if Sony use jaguar cores instead of piledrivers one. If it's the latter whereas I don't like(*) the idea they would still have room after an hypothetical shrink for the 256 bit bus.
*I mean I would expect decent bins part for a two modules piledriver at a sane speed to have a TDP around 65Watts, more in peak. The HD 7850 (really close too 18 Cus at 800Hz it 16 @860MHz) by it self has been measure by hardware.fr just above 100 Watts.
As we speak of a SOC, with a memory controller clocked even higher, that's a lot of heat, should get close of 200 Watts. Damned that getting close to a George Foreman grill... for real
Either way if the TDP is set by some power control features, the CPU and the GPU will constantly fight if not for resources but for TDP.
Then there is the die size, quiet a beefy SoC.
I think it's unlikely that such a set-up happens (piledriver+pitcairn class GPu on a SoC).
With regard to the bus width, I've wondered if Sony should go with a 256 bit bus without much room for a shrink or go for a 192 bit bus that might allow for one shrink.
Despite having a 256bit memory bus, the memory controllers on Pitcairn are quite small relative to the total die according to die photos. (Tahiti however is another matter) There should still be space around the perimeter a future die-shrunk GPU/APU to fit them in, especially if there is a sizeable CPU component on the same die:
I'm not an expert on these matters though. There still could be problems routing traces from the memory controllers to the board on a die shrunk GPU/APU.
definitely somebody has to open a thread akin to the one Acert opened for the next xbox that would centralized of the rumors we heard so far. Damned it's tough to keep track of what is going on, what are the last rumors, which have been debunked, etc. I'm a bit lost.
I wouldn't mind making the thread, but it seems like the specs leaked by bg's source is the only one that matters.
Also, I agree with most of what you said but are we sure they are going to be starting off using a SoC? I always imagined they would start off with discrete chips and move to a SoC down the line for cost reduction.
Also, I agree with most of what you said but are we sure they are going to be starting off using a SoC? I always imagined they would start off with discrete chips and move to a SoC down the line for cost reduction.
I just depends on the sizes of the chips. If they are going with a Jaguar-based chip the CPU will be small and I doubt the GPU will be bigger than Pitcrain so it just might be more economical to integrate everything from the get go.
The 4 core Jaguar chip supposedly only supports 1 64-bit DDR channel and 8 core might be able to support a 128-bit one (I think the limit will be the physical pinout), 192 or more would probably be problematic.
I don't think a 128-bit for a slow DDR3/4 ram would support enough bandwidth so I think they need a bigger bus. Now this assumes they are going with a unified memory approach.
The CPU: The Wii U's IBM-made CPU is made up of three Power PC cores. We've been unable to ascertain the clock speed of the CPU (more on this later), but we know out of order execution is supported.
RAM in the final retail unit: 1GB of RAM is available to games.
GPU: The Wii U's graphics processing unit is a custom AMD 7 series GPU. Clock speed and pipelines were not disclosed, but we do know it supports DirectX 10 and shader 4 type features. We also know that eDRAM is embedded in the GPU custom chip in a similar way to the Wii.
You're thinking of their article on DaE, not Eurogamer itself saying anything. That is only as good as you trust DAE. Who later changed to say it was an AMD GPU (but still Intel CPU) anyway.