Thoughts on next gen consoles CPU: 8x1.6Ghz Jaguar cores

anexanhume · Feb 7, 2013

dobwal said:
http://i328.photobucket.com/albums/l327/encia/PC_hardware/36_zpsd15ac367.jpg

Unless someone added the extra commentary thats what the slide says.

http://aphnetworks.com/news/2012/10/19/amd-already-tests-next-gen-low-power-kabini-chip

What does that mean? Partial AVX2 support?

Silent_Buddha · Feb 8, 2013

dobwal said:
AMD first unified shader arch came on the 360 almost a year and a half before the R600 came to the PC market. I think that if MS wanted a customized CPU and tweaks to the core, AMD would have no problem.

There is no telling how deeply AMD is willing to work with MS especially when there are always AMD requests to be made on the Windows OS/AMD Desk Labtop side of their relationship.

Well, it's hard to tell exactly how much "extra" effort went into Xenos. After all, AMD was rumored to have a unified shader architecture already in development, however that was shelved for whatever reason (likely due to no support in DirectX until Dx10?) in favor of the traditional VLIW architecture for desktop graphics.

Regards,
SB

Gipsel · Feb 8, 2013

anexanhume said:
dobwal said:

http://i328.photobucket.com/albums/l327/encia/PC_hardware/36_zpsd15ac367.jpg

Unless someone added the extra commentary thats what the slide says.

http://aphnetworks.com/news/2012/10/19/amd-already-tests-next-gen-low-power-kabini-chip

Click to expand...

What does that mean? Partial AVX2 support?

It means a 256 bit instruction is excuted as two 128 bit instructions internally. As both halves usually don't have dependencies on each other, they issue over two clock cycles. The latency for the first half is identical to a 128 bit instruction, the second half has one cycle higher latency. The throughput for 256 bit instructions is half that of 128bit instructions (i.e. peak flops do not change, it's the same for both instruction types).
AMD simply used "double pumped" in a different meaning than intel back with the P4. It's not exactly a very well defined technical term.

(((interference))) · Feb 8, 2013

Brad Grenz said:
He also implied they were working on a compressed time table due to set backs, and that the Orbis silicon had sailed smoothly through testing at that point.

Well Oban is supposed to be a big SoC, and it would be with ESRAM (just look how much space EDRAM takes up on the Wii U die compared to the actual GPU), all these fixed function blocks etc. So maybe they were having trouble integrating it all or improving the yields.

swaaye · Feb 8, 2013

Silent_Buddha said:
Well, it's hard to tell exactly how much "extra" effort went into Xenos. After all, AMD was rumored to have a unified shader architecture already in development, however that was shelved for whatever reason (likely due to no support in DirectX until Dx10?) in favor of the traditional VLIW architecture for desktop graphics.

Yeah, Xenos has been said to be essentially the original R400 project that kept getting shelved for PC for some reason(s). Feature-creep or performance problems maybe? D3D9 wasn't a problem.

MBDF · Feb 14, 2013

So how does Jaguar in Orbis compare to the Cell in PS3?

patsu · Feb 14, 2013

Too different, or you just want to compare paper numbers ?

Wait for Sony to release PS4 Linux and we will run some Cell benchmarks on Jaguar. ;-)

Arwin · Feb 14, 2013

MBDF said:
So how does Jaguar in Orbis compare to the Cell in PS3?

It may be more realistic to compare Jaguar with the PPE components in the previous gen, the 4 CUs with the SPE components, and the rest of the GPU components with the RSX, if you know what I mean.

ZiGgY · Feb 14, 2013

Just joined to ask the following question regarding the CPU specs that have been leaked..

Is it reasonable to assume the quoted Gflops performance (104 Gflops) is calculated the same way as you would on a PC using LinX for example?

I just find it very impressive that 8 cores @ 1.6Ghz delivers 13Gflops per core, considering my Phenom II @3.8 Ghz gives 12.5Gflops per core.

More than twice the per core performance at roughly 45% the speed is some going.

Or is all this nonsense because a completely different method of Gflops calculation may have been used.?

3dilettante · Feb 14, 2013

If compared to the PPE, Jaguar should win in pretty much all but a few corner cases where straightline code with limited instruction parallelism allows the higher base clock to matter. However, the PPE's many weaknesses are such that even that might not save it for many items in that tiniest of subsets.

In terms of overall cache/LS memory capacity, Cell and two Jaguar modules are roughly equivalent.

The SPE array has higher FP throughput, and while the exact bandwidths and latencies of Jaguar's L2 and the ability for the two modules to communicate is not entirely clear, I think the EIB can probably win out significantly in bandwidth with large of data transfers.
However, in terms of outstanding requests, it seems possible that Jaguar can do better with a larger number of small transactions in-flight and better snooping capability.
The communication capabilities of the interconnect are for the most part automatic for Jaguar, but require more control for Cell.

There are areas where SPEs, and 7 of them, could still do things Jaguar cannot match, but there are likely a lot of areas outside of the SPE's comfort zone that Jaguar can win out despite lower straightline speed and less peak FP.
A big chunk of the workloads the SPE+ring bus has been used for are not going to be relying on Jaguar for the next gen consoles, however. Various specialized accellerators and the GPU are going to come into play.

Aeoniss · Feb 14, 2013

..So how big a jump we talking then from the PS3's Cell to this Jaguar? Performance isn't multiple times better?

Gipsel · Feb 14, 2013

Aeoniss said:
Performance isn't multiple times better?

It is, except for the (corner) cases where it isn't.

Cell just has a single PPE (the only "real" CPU core). In general purpose integer code where the SPEs can't be used, 8 Jaguar cores run circles around Cell.

3dilettante · Feb 14, 2013

It's going to depend on the workload, and it's going to be hard to give anything but hand-wavey responses without information.

In anything that was throttled by the weak PPE, it's safe to assume that Jaguar is going to be much better, perhaps several times better in single-threaded.
In multi-threaded PPE code, potentially an order of magnitude in the best case.
For things that the SPEs were used, but not particularly well, Jaguar could match or beat them.

In terms of interconnect bandwidth and peak FLOPs, Jaguar in isolation from the GPU and dedicated accellerators, might lag by half or more. If something uses the SPEs very well, Cell still challenges desktop processors with far more resources than Jaguar.

However, once the dedicated hardware and GPU are factored in, it's going to be a slim range of items that remain in Cell's favor, and that's not enough to change the overall picture.
In other aspects, Jaguar's tolerance for less than ideal code, superior branch prediction, actual power management, and generally accommodating memory pipeline should make it much more consistent across a broad spectrum of software.

Shifty Geezer · Feb 14, 2013

Gipsel said:
In general purpose integer code where the SPEs can't be used...

Really?! You honestly want to claim that SPE's can't do anything other than floating point and vector work? It can do any workload at all (like any CPU) with different focusses and efficiencies and requiring different, suitable datastructures. A great many 'integer' code operations (decompression for example, or sorting) are a perfect fit or have been reengineered into jobs that are a good fit. I suggest you go read up on what the Cell's integer capabilities really are.

dkanter · Feb 14, 2013

GraphicsCodeMonkey said:
It's certainly true that 90% of the codebase in a typical game will benefit to a greater or lesser extent from OOE. The areas of the code which we've optimised for in order execution, SIMD etc.. are the areas however were the CPU spends the majority of it's time rendering, physics, animation, pathing, particles etc..

I'm in now way against OOE units however in my hardware wishlist it's low down compared to an achievable bandwidth sufficient to keep all cores busy. The better the core is at rawing through instructions the better the bandwidth needs to be, no point in having amazing cores data starved half the time.

X64 is a dual edged sword, pointers will take up even more valuable cache line space than they do already. Of course we're all trying to use indices instead of pointers already right?

OOOE helps you get better bandwidth, by increasing the amount of MLP available. Look at how many misses you can sustain in Jaguar, that's probably what you want to know (as well as L2 bandwidth).

David

Gipsel · Feb 15, 2013

Shifty Geezer said:
Really?! You honestly want to claim that SPE's can't do anything other than floating point and vector work?

No. It were two separate conditions. Sorry, if that wasn't clear.
But on non vectorizable code, a great deal of the SPE's advantages are lost in any case. Then you are left with a relatively slow in order core which has only direct access to its 256kb local storage, which complicates things a lot.

MBDF · Feb 15, 2013

Thanks for the answers. I guess one thing is the Jaguar in Orbis wont have to assist the GPU any more as in PS3, which will free it up.

patsu · Feb 15, 2013

Gipsel said:
No. It were two separate conditions. Sorry, if that wasn't clear.
But on non vectorizable code, a great deal of the SPE's advantages are lost in any case. Then you are left with a relatively slow in order core which has only direct access to its 256kb local storage, which complicates things a lot.

Cell is designed as a media CPU. If you're not dealing with arrays or streams of data (integer or floating point), you should use a regular CPU. The original idea is to have more PPU cores for general purpose use.

When Orbis is released (and if 22nm Cell is truly under-development as some claimed), it will be fun to revisit the comparison. ^_^

Arwin · Feb 15, 2013

MBDF said:
Thanks for the answers. I guess one thing is the Jaguar in Orbis wont have to assist the GPU any more as in PS3, which will free it up.

Again, I think you should see the Jaguar as a successor to the PPE component only, and the 4 CUs as a successor to the SPE component. The Jag is 3-4x the GFlops of the PPE and should be quite a lot more efficient in most tasks. 4CUs should amount to almost 2x the SPEs, and will be more efficient in most tasks as well. The 14CUs for graphics deliver about 3x RSX GFlops, but again, should be quite a lot more efficient. Any component has access to the full 172GB bandwidth, instead of 25GB for RSX and 25GB for Cell. And the memory pool appears to be a unified 4GB, which is already 8x the memory of the PS3, which also had a split memory pool making it harder to use it fully.

So my estimate is that in total, Orbis should be able to be at least 8x as capable as the PS3, and because the system is far closer to PC, I expect far more developers to get close to its maximum potential than was the case for PS3.

But we'll find out more soon hopefully

Gipsel · Feb 15, 2013

patsu said:
Cell is designed as a media CPU. If you're not dealing with arrays or streams of data (integer or floating point), you should use a regular CPU. The original idea is to have more PPU cores for general purpose use.

If you scroll up a bit, it was my point that the SPEs are not really suited to usual general purpose code. We agree.

Thoughts on next gen consoles CPU: 8x1.6Ghz Jaguar cores

anexanhume

Silent_Buddha

Gipsel

(((interference)))

swaaye

Entirely Suboptimal

MBDF

patsu

Arwin

Now Officially a Top 10 Poster

ZiGgY

3dilettante

Aeoniss

Gipsel

3dilettante

Shifty Geezer

uber-Troll!

dkanter

Gipsel

MBDF

patsu

Arwin

Now Officially a Top 10 Poster

Gipsel

Similar threads