PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

3dilettante · Jun 26, 2013

AMD tended to increase memory latency in tandem with increases in the OoO and latency-hiding capabilities of the cores, with varying levels of success.

Jaguar's resources are more limited, which means it shouldn't be expected to compensate for the same kind of latencies that Bulldozer or Trinity are expected to deal with.
The cycle numbers for Durango's L2 to L2 cache hits point to a memory pipeline that, is likely at least somewhat longer latency than Trinity's, without knowing if the remote snoop is additive to main memory latency.

It may be that the much lower clocks are what AMD is counting on to help mitigate the IPC losses due to trips to memory.

As for the comment concerning how AMD arbitrates under load: I haven't run across a review that runs a cache/mem latency benchmark while the GPU is also running at speed. The latency benchmarks for APUs tend to measure CPU memory latency when the GPU isn't fighting for access, and they are mediocre.

sebbbi · Jun 26, 2013

3dilettante said:
prefetching across the range of modern cores is good enough to make software prefetching a net loss in the general case.

I have came to the same conclusions in my own tests. Software prefetching only increases performance in very specific use cases on modern CPUs. But good data structure design is more important than ever. For example using bucketed lists instead of simple pointer lists yields up to 10x gains even on Sandy/Ivy Bridge CPUs. It's still up to the programmer to keep payload per cache miss as high as possible. The more instructions you have between two potential cache misses, the better.

3dilettante said:
It may be that the much lower clocks are what AMD is counting on to help mitigate the IPC losses due to trips to memory.

Bobcat/Jaguar have 2x slower clock ceiling compared to BD/PD. This should definitely help with memory latency (latency in cycles is halved). I remember that in some old latency chart Bobcat was scoring comparable memory latency results (in cycles) to Sandy Bridge (Sandy had of course around 2x higher clocks).

Love_In_Rio · Jun 26, 2013

sebbbi said:
I have came to the same conclusions in my own tests. Software prefetching only increases performance in very specific use cases on modern CPUs. But good data structure design is more important than ever. For example using bucketed lists instead of simple pointer lists yields up to 10x gains even on Sandy/Ivy Bridge CPUs. It's still up to the programmer to keep payload per cache miss as high as possible. The more instructions you have between two potential cache misses, the better.

Bobcat/Jaguar have 2x slower clock ceiling compared to BD/PD. This should definitely help with memory latency (latency in cycles is halved). I remember that in some old latency chart Bobcat was scoring comparable memory latency results (in cycles) to Sandy Bridge (Sandy had of course around 2x higher clocks).

Do you like Jaguar cores?. In some of your old posts i think it looked like that. Maybe you can´t even talk about this, but in case you can...

sebbbi · Jun 26, 2013

Love_In_Rio said:
Do you like Jaguar cores?. In some of your old posts i think it looked like that.

Jaguar is a huge jump in performance compared to it's predecessor (Bobcat) and compared to the current lineup of (in-order) Atom chips. It's a tiny modern CPU core with very good performance per watt. AMD has gotten a lot of design wins lately (lots of laptops and tablets with Kabini/Temash SOC inside them). However we have to wait for Intel's forthcoming (Q4?) Silvermont Atom in order to draw long lasting conclusions about Jaguar's relative strengths (and weaknesses) in the market. Haswell is also pushing down the TDP of Intel's main line of processors (at 15W nothing can compete against it in performance). ARM is scaling up performance at a rapid pace as well. A15 is a very nice core. It's NEON SIMD hardware is four wide and it processes most vector instructions at full rate (including multiply add). And it has all the modern CPU core goodies (deep OoO execution, automatic cache prefetch, robust store forwarding, etc). My previous experience on ARM cores was Nokia N-Gage, and it didn't even have a hardware floating point unit

After spending almost 6 years in in-order PPC programming, it feels refreshing to program all these modern out of order cores. I don't have a full control anymore, but at least I don't have to analyze other people's code and solve their LHS stalls and cache misses

. Writing efficient code is so much easier on these modern CPU cores.

At the beginning I was actually quite surprised how well poorly optimized code ran on my new Intel based workstation. With these new CPUs and new compilers you can actually use function calls (small functions without inlining, virtual functions), exceptions (not throwing), actual loops (no unrolling) and (predictable) branches with practically no performance cost. These new Xeons have 8 cores (16 threads), huge L3 caches and quad channel memory controllers, so there's plenty of raw performance under the hood. And that's very good for productivity. Optimizing your content production tools is as important as optimizing the final game product.

french toast · Jun 26, 2013

sebbbi said:
Jaguar is a huge jump in performance compared to it's predecessor (Bobcat) and compared to the current lineup of (in-order) Atom chips. It's a tiny modern CPU core with very good performance per watt. AMD has gotten a lot of design wins lately (lots of laptops and tablets with Kabini/Temash SOC inside them). However we have to wait for Intel's forthcoming (Q4?) Silvermont Atom in order to draw long lasting conclusions about Jaguar's relative strengths (and weaknesses) in the market. Haswell is also pushing down the TDP of Intel's main line of processors (at 15W nothing can compete against it in performance). ARM is scaling up performance at a rapid pace as well. A15 is a very nice core. It's NEON SIMD hardware is four wide and it processes most vector instructions at full rate (including multiply add). And it has all the modern CPU core goodies (deep OoO execution, automatic cache prefetch, robust store forwarding, etc). My previous experience on ARM cores was Nokia N-Gage, and it didn't even have a hardware floating point unit

After spending almost 6 years in in-order PPC programming, it feels refreshing to program all these modern out of order cores. I don't have a full control anymore, but at least I don't have to analyze other people's code and solve their LHS stalls and cache misses . Writing efficient code is so much easier on these modern CPU cores.

At the beginning I was actually quite surprised how well poorly optimized code ran on my new Intel based workstation. With these new CPUs and new compilers you can actually use function calls (small functions without inlining, virtual functions), exceptions (not throwing), actual loops (no unrolling) and (predictable) branches with practically no performance cost. These new Xeons have 8 cores (16 threads), huge L3 caches and quad channel memory controllers, so there's plenty of raw performance under the hood. And that's very good for productivity. Optimizing your content production tools is as important as optimizing the final game product.

Awesome description

Love_In_Rio · Jun 26, 2013

sebbbi said:
Jaguar is a huge jump in performance compared to it's predecessor (Bobcat) and compared to the current lineup of (in-order) Atom chips. It's a tiny modern CPU core with very good performance per watt. AMD has gotten a lot of design wins lately (lots of laptops and tablets with Kabini/Temash SOC inside them). However we have to wait for Intel's forthcoming (Q4?) Silvermont Atom in order to draw long lasting conclusions about Jaguar's relative strengths (and weaknesses) in the market. Haswell is also pushing down the TDP of Intel's main line of processors (at 15W nothing can compete against it in performance). ARM is scaling up performance at a rapid pace as well. A15 is a very nice core. It's NEON SIMD hardware is four wide and it processes most vector instructions at full rate (including multiply add). And it has all the modern CPU core goodies (deep OoO execution, automatic cache prefetch, robust store forwarding, etc). My previous experience on ARM cores was Nokia N-Gage, and it didn't even have a hardware floating point unit

After spending almost 6 years in in-order PPC programming, it feels refreshing to program all these modern out of order cores. I don't have a full control anymore, but at least I don't have to analyze other people's code and solve their LHS stalls and cache misses . Writing efficient code is so much easier on these modern CPU cores.

At the beginning I was actually quite surprised how well poorly optimized code ran on my new Intel based workstation. With these new CPUs and new compilers you can actually use function calls (small functions without inlining, virtual functions), exceptions (not throwing), actual loops (no unrolling) and (predictable) branches with practically no performance cost. These new Xeons have 8 cores (16 threads), huge L3 caches and quad channel memory controllers, so there's plenty of raw performance under the hood. And that's very good for productivity. Optimizing your content production tools is as important as optimizing the final game product.

Thank you for your viewpoint!.

Aeoniss · Jun 27, 2013

Sebbi, what do you think of them inside the consoles though? I know Sony\MS didn't have much of a choice really, but from a performance point of view what are your thoughts?

For pure gaming application in an embedded system (with 8 cores) will the consoles have legs?

zupallinere · Jun 27, 2013

After spending almost 6 years in in-order PPC programming,

Love_In_Rio said:
Thank you for your viewpoint!.

.... and your service :smile:

Bagel seed · Jun 27, 2013

Some slides from a presentation Cerny gave this morning at Gamelab 2013 in Barcelona.

Development time needed to get an engine up and running compared

The 'other' PS4

He mentioned how they could've made PS4 more powerful but felt they would be approaching PS3 levels of over built territory.

patsu · Jun 27, 2013

Where are the rest of the slides ?

Is this the Develop conference he's supposed to keynote ?

Gradthrawn · Jun 27, 2013

patsu said:
Where are the rest of the slides ?

Is this the Develop conference he's supposed to keynote ?

It's Gamelab 2013 which took place earlier today. The EU PS Blog had the livestream but that's already passed. They will, however, have the video up on YouTube "later this week".

Kb-Smoker · Jun 27, 2013

I hope that key note get reposted. Cerny = tech god.

This man has pretty much save playstation. Heard it a great listen.

patsu · Jun 27, 2013

Yeah, I am more interested in the developer's view of the system. So hopefully, the Develop keynote has more juicy details.

AlNom · Jun 27, 2013

Bagel seed said:
The 'other' PS4

Kind of hard to go beyond 8x GDDR5 chips with a 128-bit bus.

Love_In_Rio · Jun 27, 2013

AlNets said:
Kind of hard to go beyond 8x GDDR5 chips with a 128-bit bus.

Yeah, with that route they couldn´t have gone 8GB. They should have gone like X1 with DDR3 and 256 bits bus.

patsu · Jun 27, 2013

Ah, but what would be the CPU, GPU and clock speed for the other PS4 ?

DieH@rd · Jun 27, 2013

Cerny's presentation was very good, he talked how Yoshida helped him constantly for over 20 years. He gave first US-bound PS2 devkit to Cerny, allowed Cerny to establish ICE Team [sony ninjas!], he got him a chair in the design team of PS3, and supported him when Cerny wanted to become system architect of PS4. Yosp owns.

Cerny described himself, Andrew House and Yoshida as a great team [Three Musketeers] that has worked closely together at Sony for 20 years, all with aligned goals for gaming. He expressed wish that someday they three will become as successful as the core leadership team that made Nintendo so great.

Btw, interesting piece of news from Gaf insider zomgbbqftw:

I don't think this is thread worthy so I'm posting it here.

Had a conversation with a source in semi-conductors this morning, said that one of the major memory manufacturers is currently testing 8Gbit GDDR5 chips of identical specification that would be required for 8GB at 176GB/s. The source said that chips would be ready for commercial devices by the middle of 2014 and would lower the cost for Sony by around a third if they use these chips. They added that it seems like these chips are built to order for one major buyer (Sony I'm guessing).

http://www.neogaf.com/forum/showthread.php?p=66682231#post66682231

anexanhume · Jun 27, 2013

Wouldn't be surprised if the ok for 8GB was contingent on manufacturers signing up to making these possible. I'm guessing shrinks will be less painful too and we'll see 20nm production APUs in consoles by 2015.

Sony subsidizing enthusiast graphics card price reduction

AlNom · Jun 27, 2013

Love_In_Rio said:
They should have gone like X1 with DDR3 and 256 bits bus.

I'd start to wonder about the cost and time to implement a 1TB/s eDRAM...

3dilettante · Jun 27, 2013

anexanhume said:
Wouldn't be surprised if the ok for 8GB was contingent on manufacturers signing up to making these possible. I'm guessing shrinks will be less painful too and we'll see 20nm production APUs in consoles by 2015.

Sony subsidizing enthusiast graphics card price reduction

Why would SoC shrinks be less painful based on whether there's going to be 8Gbit GDDR5? Those operate in different manufacturing realms.

The high-density GDDR5 could make compact form factor SoCs with decent bandwidth and good-enough capacity possible--but uncertain economicaly, for SoCs beyond the PS4.

PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

3dilettante

sebbbi

Love_In_Rio

sebbbi

french toast

Love_In_Rio

Aeoniss

zupallinere

Bagel seed

patsu

Gradthrawn

Kb-Smoker

patsu

AlNom

Moderator

Love_In_Rio

patsu

DieH@rd

anexanhume

AlNom

Moderator

3dilettante

Similar threads