Is PS4/XONE CPU a bottleneck?

So now you're saying a CPU refers to whatever single die contains CPU cores, or what?
The CPU is the general purpose processor, with one or more cores, that deals with the main functional code and sends jobs to other ancillary support processors if there are any. In a mobile SoC, the CPU is a part of the die. In a typical PC, it's a standalone die. In PS3, the CPU is the Cell BBE, which is a heterogeneous multicore architecture. PPE isn't a CPU. Nor is SPE. Take a PPE out and put it on its own, it'll probably be a PPC CPU. Take an SPE core out on it's own and it likely won't be a CPU because it needs another processor to get going. But then if it's doing the main work and some little ARM just gets it going, if it's the single major processor in a device running all the code after getting started, it'd by definition be the CPU (a definition born in a period when there was one processor and a load of ICs maybe for controlling a beeper etc.).

This discussion wasn't about CPU though, but 'general purpose'. Allandor's accusation is that SPU's were limited in jobs they could do as it wasn't a general purpose processor. That assumes solutions to problems are tied to certain implementations and where a processor can't handle those algorithms effectively, it can't handle the problems effectively. But we know in software pretty much any problem can be designed around any processing paradigm, to the invention even of representing databases as textures to use a GPU to process them. The fact some algorithms didn't fit SPU's at all well doesn't mean SPU's aren't good at solving all problems; they just weren't good at executing the current common solutions to problems. If the evolution of microprocessors had gone from Z80 to Cell instead of from Z80 to x86, the evolution of software would be wildly different and the common algorithms would be all Cell-focussed, and we'd lament the lack of real performance economy from x86.

The problem with Cell was not that it couldn't handle some tasks, but that to reinvent solutions to those tasks that'd be good for Cell just wasn't worth the effort. Let the computer science theoreticians come up with crazy vectorised polynomial representations of AI entities - game devs need to use finite state machines and LUA script or whatever, and need processors that can handle these well with IFs aplenty.

However, a processor lacking good branch prediction and convenient memory access does not stop being a central processing unit nor a general purpose processor. It just makes it a bitch to use. ;)
 
Moving forward and assuming the next generation Xbox and PS5 again use AMD tech and assuming the CPU is once again part of an APU / SoC along with the GPU and everything else, what would might a near-ideal core be like in terms of capability and clock speed?
 
Whats the deal with Nick Baker and his 48 operations for the CPU?

Did he use fuzzy math?

48operations.png
 
tunafish said:
A Jaguar core can issue 6 ops per cycle -- two memory, two int alu, and two fpu/simd ops. However, it can only decode two, and retire two, so the average rate cannot ever go above two.

The primary reason for the 6-wide issue is that between decode and retire the separate pipelines are effectively independent and there's no reason they shouldn't be able to issue in parallel. However, being able to issue more than you can sustain in the long term does have some advantages, notably, a bad data stall will cause your execution to stall until the data arrives, while the decode parts can happily run forward until all the queues and buffers are full. When the data arrives, being able to issue more ops means you can clear these queues, and the retire can again catch up the next time you stall.

So basically, having more issue resources means that you can get closer to the max of 2 in practice.

Of course, picking the number 6 in particular is just the kind of "xbox 360 gpu has x TFlops" we all know and love from the past gen. There are many numbers that made varying amount of sense, this one got picked because it's the biggest one.

However I wouldn't be surprised if you-know-who digs up dubiously links crazy through crazy logic jumps, charts and documents and graphs and images and patents showing that they redesigned jaguar so it really is a true 6 ipc design. And then resort to nda explanations explaining why nobody can talk about the superior cpu and/or some strategic or legal or technical reason they have not allowed developers full access to the true 6 ipc capability.
 
Last edited by a moderator:
Perhaps it is, but I didn't think that was something people were actually debating. This is all semantics really, if you really want to press CPU as anything Turing complete then we can start including a whole lot of things.
But the way I see it the sticking point of this is less the "processing unit" part and more the "central" part, which will entail some certain level of architectural capabilities to support "normal" software. SPE doesn't fall under that category.
I would say the general lack of peer status for the slave GPU would be one of the primary reasons why it isn't considered a CPU. Its compute units might meet a general definition of being a core, but in many matters regarding a modern computing system, from interfacing with IO, interrupts, faults, privileged operations, and such, the GPU must give up and let the CPU take over.
The way the current shared memory works, the GPU is architecturally kept as a guest in the virtual memory system, which seems to be the only acceptable place for it in the eyes of system implementors.

Some elements might change someday, if the GPU or some new subunit can at least handing things like disk paging without hanging the whole system and needing the CPU to drop what it's doing in order to clean up after the GPU.

Cell's security hierarchy was stranger than that, particularly with an SPE actually able to burrow down to a privilege level that the rest of the system wasn't permitted to see.

As far as Turing completeness goes, from a more practical standpoint the lack of QoS and preemption makes the current consoles theoretically capable of running--but very, very uncomfortable about--kernels that have indeterminate run times, and the GPU is liable to find the kernel freaking out if shaders don't remain conveniently short in lifespan. So I suppose they could be considered Turing-complete, with the proviso that a vast range of normal looking programs are likely to make the GPU get shut down.

Whats the deal with Nick Baker and his 48 operations for the CPU?
Did he use fuzzy math?
It's pretty straightforward.
A jaguar core has 2 integer pipes, 2 memory pipes, and 2 FP pipes, each of which is capable of issuing an internal instruction per clock.
 
It's pretty straightforward.
A jaguar core has 2 integer pipes, 2 memory pipes, and 2 FP pipes, each of which is capable of issuing an internal instruction per clock.


too bad nobody was able to program it right. it ended up being the worst 5th gen console in both performance and games. its probably the only console that you can say its harder to program next to the cell processor. though i doubt the cell is hard to program. it was more of a Ram issue with the separate pools. sony should have went with a RISC based processor for the PS3 to emulate PS2, PS1 games. but they probably didn't do that so PCs don't emulate them.
 
too bad nobody was able to program it right. it ended up being the worst 5th gen console in both performance and games. its probably the only console that you can say its harder to program next to the cell processor. though i doubt the cell is hard to program. it was more of a Ram issue with the separate pools. sony should have went with a RISC based processor for the PS3 to emulate PS2, PS1 games. but they probably didn't do that so PCs don't emulate them.

What!?
 
its probably the only console that you can say its harder to program next to the cell processor. though i doubt the cell is hard to program. it was more of a Ram issue with the separate pools.
Cell has two challenges, only one of which was technical. Prior to multi-core hardware game loops and code to solve problems used to be linear but with Cell you needed to re-write code to break problems into smaller parallelisable jobs then work out how to feed Cell those jobs and, more importantly, the data those jobs needed.

sony should have went with a RISC based processor for the PS3 to emulate PS2, PS1 games. but they probably didn't do that so PCs don't emulate them.

Cell is a RISC micro-architecture. So.. job done :yep2:
 
Last edited by a moderator:
too bad nobody was able to program it right.

By your comment, you must be a senior software engineer - am I right?

its probably the only console that you can say its harder to program next to the cell processor. though i doubt the cell is hard to program.

Here you get all the guys in the world that dealt with cell ROFLing, as well as having all guys that used both (especially current gen) asking for a TSO (italian acronym...) on you.

sony should have went with a RISC based processor for the PS3

...start here:
https://www-01.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine
 
And we can see that in action on in Killzone 2/3 vs Shadowfall comparisons

You can see in terms of the ragdoll physics and hit detection its more impressive in Killzone for PS3 than for PS4.
Still... Killzone 4 is of course more burdening on the cpu in other regards for other things than Killzone 2/3 were for PS3...

Killzone 4??!! There's a Killzone 4?

The interviews with GG mentioned it wasn't a number, Killzone 4 hasn't happened or hasn't been released yet.

Killzone Shadow Fall (unlike Killzone 1, 2, and 3) is a mere launch title, despite all those marketing claims of "easy to dev for" it's still a launch title and launch titles usually have a purpose which is to be part of the launch lineup and hope that gamers buy and like and will get sequels.

Assuming that either PS4 or Xbox one's CPU Jaguar cores are bad at A.I. is foolish.

On average a proper game engine and game takes two to three or so years to make. I hope you noticed how neither Halo 3 nor Killzone 2 were launch titles and how long it took for them to materialize.

Also note that Killzone SF dropped (initially) features from previous games while adding new ones in both single and online mp...when a proper Killzone sequel is released we'll see a huge difference in A.I., features and perhaps even graphics.

To add to the CellBE discussion there's the old Killzone 2 "42 minute" presentation interview where they list how they used the SPUs and note that there actually was a graphics upgrade in Killzone 3 but on top of that they added Sony's inhouse MLAA which runs off the SPUs and has been used since by other PS3 specific games and games that it was used. MLAA frees up RSX to render more or higher polygon counts or reach the reasonable limitations.

Also keep in mind that since nVidia G80, G92 the overall GPU power doubled over G70 based RSX and even Xenos.

AMD had a golden opportunity contract...they got it and their solution as weak as we perceive is still more than last gen...however the gamble is that the PS4's 1.8TF GPU plus having more than 4GB (8GB) will obviously allow to take over tasks that PS3 overall had to do so we just have to wait and see until next year just like it was previous gen ramp ups.

Thinking of these consoles or last gen consoles as compared to PCs is just foolish PC gamer marketing hype mentality.

How long did it take for quad core CPUs to become the minimum requirements?

PCs may have great power but why didn't Dx10 become a standard minimum requirement and why did it take Crytek so many years to make their CE3? Or competition?

I believe a marketing rep claimed on a late night show that PS4 or xboxone was "ten times more powerful" than last gen and that footage showed up in the Video Game movie...

Also thinking Intel would get the contract is kinda foolish, Intel is so advanced and ahead because they not only supply CPUs for mainstream PCs and Laptops but also for Workstations and Enterprise. Intel's main focus is to invest and produce. AMD just went in a different route since the Athlon 64/Opterons when they bought ATI and frankly based on budget and TDP from the agreement perspective they thought they were right even if we don't believe it.

Consoles price, timing and unreliable customers prevented another PS3 type of upgrade.

at the end of the day it was probably the only choice to do, AMD probably offered a great deal for CPU+GPU, and Jaguar makes more sense than the Bulldozer based CPUs for size/power

I just think it's a shame that they are stuck with it for a while, I mean Jaguar was pretty immature, now with the latest refreshed Jaguar (Beema) is using "turbo" for higher ST performance and is more power efficient... people overclocking AM1 can achieve 2.5GHz relatively easy on Kabini Jaguar cores (when the motherboard is good), and Beema can run at 2.4GHz boost

now the PS4 is stuck with 1.6GHz fixed, if they could boost let's say a couple of cores to 2.4GHz it would probably be a significant help?

and I would guess a 20nm shrink could possibly achieve more... but... it's the same situation as old gen, with MS having to simulate a slow FSB, it's a shame.

Jaguar 1.6GHz is horrible for a gaming PC, but for the consoles they are going to get all they can out of it and the GPU and have amazing results I'm sure.

just dreaming a little, would have been cool of sony to add a die shrunken Cell as a co processor to the PS4, for physics and backwards compatibility, imagine that... or maybe it would be nightmare for the devs, useless and expensive I 'm so clueless :D

As great and cool as clock speeds do affect PCs, AMD was releasing statements that "Clock speeds didn't matter" about two years ago iirc...not sure if it was part of eh something else (marketing) but the difference between the custom consoles 1.X clocks versus a PC 2.5GHz APU cpu core is not gonna make a difference because the consoles are closed boxes with custom coding even with using older 3d engines.

PCs will just be graced and grateful if a dev/pub decided to make a PC version where most likely you get higher settings yet you still had to have paid over $300 for a single graphics card to deliver decent performance. ..it's just not the same...sure PCs will be more power but cost is higher and game port has to be decided.

We barely started this current gen...cool discussion buy still just being realistic.

I would have preferred that all three would have waited until 2013 and 2014 hardware solutions but they made those decisions way back. The biggest problem is comparing TDP and wattage when choosing parts otherwise it'd be interesting seeing an evolved CellBE 2 or CellBE 3 combined with a GTX 980 RSX-2 customized as a 2014 PS4 and a comparable AMD solution for Xbox but power draw might need big bulky consoles with big $600 dollar price tag...if only customers were reliable to buy they may have made it...otherwise it's harsh...
 
too bad nobody was able to program it right. it ended up being the worst 5th gen console in both performance and games. its probably the only console that you can say its harder to program next to the cell processor. though i doubt the cell is hard to program. it was more of a Ram issue with the separate pools. sony should have went with a RISC based processor for the PS3 to emulate PS2, PS1 games. but they probably didn't do that so PCs don't emulate them.

Damn I forgot this one...

CellBE stopped being "hard to program for" a long time ago.
what actual PS3 games did you actually play?

Did you notice how long last gen lasted?

Do you even own or did you actually purchase a PlayStation 3?

I ask because only first two years of PS3 had PS2 chips and perhaps due to misinformed or unreliable customers who were probably shifting to the other consoles and fearful of $600 price helped to cause Sony to cut costs and remove costly components.

PS3s since then are able to play PS3 and PS1 games...btw Xbox 360's last update for BC was in 2007. There's still many issues and non-compatible games and controllers...that $200 dollar Steel Battalion lack of support was a huge mistake that nobody complained about even though both xbox 1 and 360 use USB 2.0 controller interface.

You gotta be more careful in what you say...there's actual industry people here. Even if we disagree with hardware choices we still have to respect the hardware engineer architect teams and their team leads.

Finally many PS3 (which you imply) games improved or offered more features, MLAA, better graphics, 3d, etc however it's up to people to buy games.

I would stay away from most review sites as they honestly seem to dictate what people should buy and most of the time their opinions are way too critical when it isn't warranted. And not very informative. Try before buy or buy or look at gameplay mechanics and others who know what they are playing instead of the negativity that's out there.
 
Since, AMD Jaguar is on the table :

AMD AMD A8-6410 (Puma/Beema), 4 cores, 1800 Mhz, 28 nm, 8 GB DDR3-1600.
  • L1 Data cache = 32 KB. 64 B/line, 8-WAY. Parity Protected, Write-back. One 128-bit load and one 128-bit store per cycle. Prefetcher.
  • L1 Instruction cache = 32 KB, 64 B/line, 2-WAY, Parity Protected. 32 bytes are fetched in a cycle. On misses, the L1 requests the missed line and 1 or 2 sequential lines (prefetches).
  • L2 cache size = 2 MB. 64 B/line, 16-WAY. ECC protected, Write-back. Shared by up to 4 cores. L2 cache is inclusive of the L1 caches in the cores. The L2 to L1 data path is 16 bytes wide; critical data within a cache line is forwarded first. 4 512-Kbyte banks: bits 7:6 of the cache line address determine banks.
  • 4 KB pages DATA L1 TLB size = 40 items, full-assoc. Miss penalty = 6 cycles.
  • 2 MB pages DATA L1 TLB size = 8 items, full-assoc.
  • 4 KB pages DATA L2 TLB size = 512 items, 4-WAY. Miss penalty = 26 cycles?
  • 2 MB pages DATA L2 TLB size = 256 items, 2-WAY.
  • 4 KB pages Instruction L1 TLB size = 32 items, full-assoc.
  • 2 MB pages Instruction L1 TLB size = 8 items, full-assoc.
  • 4 KB pages Instruction L2 TLB size = 512 items, 4-WAY.
  • Page Directory Cache (PDC): 16 entries.
  • The page table walker supports 1-Gbyte pages by smashing the page into a 2-Mbyte window, and returning a 2-Mbyte TLB entry. In legacy mode, 4-Mbyte entries are also supported by returning a smashed 2-Mbyte TLB entry.
  • L1 BTB: 1024 entries. is a sparse branch predictor and maps up to the first two branches per instruction cache line (64 bytes).
  • The L2 BTB is a dense branch predictor and contains 1024 branch entries, mapped as up to an additional 2 branches per 8 byte instruction chunk, if located in the same 64-byte aligned block.
  • return address stack (RAS): 16-entry.
  • 2 ALU, LOAD, STORE, 2 * FPU.
  • ALU scheduler: 20-entry
  • Aaddress generation unit (AGU) scheduler: 12-entry
  • The floating-point retire queue: 44 floating-point micro-ops.
  • LS unit: 20-entry store queue
  • L1 Data Cache Latency = 3 cycles for simple access via pointer
  • L1 Data Cache Latency = 4 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
  • L2 Cache Latency = 26 cycles
  • RAM Latency = 26 cycles + 100 ns
2 MB pages mode (64-bit Windows)
  • Data TLB L1 size = 8 items. full assoc. Miss penalty = 6 cycles. Parallel miss: 2 cycle per access
  • Data TLB L2 size = 256 items. 4-WAY. Miss penalty = ? cycles. Parallel miss: ? cycles per access
Size Latency Increase Description

32 K 3
64 K 15 12 + 23 (L2)
128 K 21 6
256 K 24 3
512 K 25 1
1 M 25 0
2 M 26 1
4 M 26 + 55 ns + 55 ns + 100 ns (RAM)
8 M 26 + 80 ns + 25 ns
16 M 26 + 92 ns + 12 ns
32 M 29 + 98 ns 3 + 6 ns + 6 (L1 TLB miss)
64 M 31 + 100 ns 2 + 2 ns
128 M 32 + 100 ns 1
256 M 32 + 100 ns
512 M 32 + 100 ns

4 KB pages mode (64-bit Windows)
  • Data TLB L1 size = 40 items. full assoc. Miss penalty = 6 cycles. Parallel miss: 2 cycles per access
  • TLB L2 size = 512 items. 4-WAY. Miss penalty = 28? cycles. Parallel miss: 28? cycles per access
Size Latency Increase Description

32 K 3
64 K 15 12 + 23 (L2)
128 K 21 6
256 K 26 5 + 6 (L1 TLB miss)
512 K 29 3
1 M 30 1
2 M 32 2
4 M 49 + 55 ns 17 + 55 ns + 100 ns (RAM) + 28 (L2 TLB miss)
8 M 59 + 80 ns 10 + 25 ns
16 M 70 + 92 ns 11 + 12 ns
32 M 76 + 98 ns 6 + 5 ns
64 M 84 + 100 ns 8 + 2 ns

MISC
  • Branch misprediction penalty = 15-16 cycles
  • 16-bytes range cross penalty = 2 cycles
  • L1 B/W (Parallel Random Read) = 1 cycles per one access
  • L2->L1 B/W (Parallel Random Read) = 8 cycles per cache line
  • L2->L1 B/W (Read, 64 bytes step) = 8 cycles per cache line
  • L2 Write (Write, 64 bytes step) = 10 cycles per write (cache line)
  • RAM Read B/W (Parallel Random Read) = 19 ns / access
  • RAM Read B/W (Read, 64 Bytes step) = 6800 MB/s
  • RAM Write B/W (Write, 4-64 Bytes step) = 3600 MB/s
Fell free to customize to PS4/XONE...
 
Good read for everyone interested in the Jaguar CPU core:
http://www.realworldtech.com/jaguar/

It's pretty straightforward.
A jaguar core has 2 integer pipes, 2 memory pipes, and 2 FP pipes, each of which is capable of issuing an internal instruction per clock.
Yes... that is the peak rate (of micro ops). However it can only decode two (fastpath single) x86 instructions per clock, meaning that the sustained x86 IPC is never more than 2.
 
They both had a power & heat budget. I think they came out pretty good with the choice they made.

Any mobile Intel cpu would have utterly trashed their "good choice" both in performance and power/heat budget imo.
 
So any thoughts on a real world performance multiplier for the Jaguar at 1.6 Ghz over Xenon in the 360 given OoO, higher IPC and the other advantages?

We know the new consoles have 10x as much memory (more if you consider OS allocations) and the PS4 has 10x the GPU performance so it'll be interesting to see how the CPU's stack up.
 
Back
Top