Thoughts on next gen consoles CPU: 8x1.6Ghz Jaguar cores

my ? goes to how Durango is getting the 200GFLOP rating when we're rating PS4 at 102.4GFLOPS using simple math. I did the math too so I understand how the 102.4GFlops came from. I also think the 102.4 Flops is most probably accurate.

Somehow these cores for durango would need to do something like 8 Adds + 8 multis in a clock.
If not that, then it would either be 3.2Ghz (lol) or 16 cores (clearly inaccurate)

[strike]or secret sauce[/strike]
standard Jaguar has a 128 but ALU, thus why 4 adds and 4 muls per clock. AVX uses 256 bit. It's possible MS had them extend the ALU width to 256 bit to take full advantage of AVX properly. That would result in 8 adds and 8 muls per clock.
 
With just 16 regs, that would be pretty badly limited by the memory pipeline.

The Bobcat FPU has a retirement queue of 40 entries (ie. only 40 of the 56 instructions in the ROB can effectively produce a FP result), which brings the number of physical registers to 56 (40+16). I'd expect Jaguar to be similar. This is enough to easily cover L2 hits, but not nearly enough to schedule around main memory latency.

I'm guessing they are relying on aggressive prefetching to get lower effective main memory latency. Alternatively they could use a chunk of ESRAM as low latency memory and orchestrate the move engines to double buffer matrix tiles (tricky, so best left for library functions like BLAS and generic box solvers etc.)

Cheers

EDIT: Sorry, this is the Orbis thread, ignore the ESRAM bit
 
Last edited by a moderator:
Prediction: The enhancements that were said to be happening to the Durango CPU will actually show up on the PS4 CPU article from vgleaks.
---
If the CPUs are identical in both consoles, won't both machines have to deal with this same problem? And I was under the impression that DDR3 sports superior latencies to GDDR5, which should make DDR the ideal choice for this CPU.

Is there something that I'm missing? I also believe these latencies are quite an improvement over the cache hit latencies that were present on the 360 and PS3, are they not?

What should latency look like, for example, when the same cpu has to access main memory on the PS4? I would imagine that the latencies won't be better on GDDR5. In fact, they may be quite a bit worse.
 
If the CPUs are identical in both consoles, won't both machines have to deal with this same problem? And I was under the impression that DDR3 sports superior latencies to GDDR5, which should make DDR the ideal choice for this CPU.

Is there something that I'm missing? I also believe these latencies are quite an improvement over the cache hit latencies that were present on the 360 and PS3, are they not?

What should latency look like, for example, when the same cpu has to access main memory on the PS4? I would imagine that the latencies won't be better on GDDR5. In fact, they may be quite a bit worse.
Well the point is CPU overall doesn't works well with any kind of RAM lol, the latency have gone up and up while CPU were getting better and faster.(Edit put another way CPU speed has increased way faster than latencies have decreased)
GDDR is worse but that is not to say that DDR3 'work well' with the CPU. L2 miss are still really bad as far as performance are concerned.


Actually the only device (I can think off) /produced in quantity) that can deal with repeated cache miss are the GPU, they have many thread to hide those x hundred cycles of latency. Pretty much if your code has no locality or the cache is too tiny, etc. the CPU performances fall of a cliff.

Wrt to the gddr in the ps4 that is off topic, but I don't expect it to make much of a difference, in both systems I expect the performances sensitive parts of the code to pretty much hit the L2 most of the time.
I think the whole issue is overblown massively.
 
Last edited by a moderator:
Well the point is CPU overall doesn't works well with any kind of RAM lol, the latency have gone up and up while CPU were getting better and faster.(Edit put another way CPU speed has increased way faster than latencies have decreased)
GDDR is worse but that is not to say that DDR3 'work well' with the CPU. L2 miss are still really bad as far as performance are concerned.


Actually the only device (I can think off) /produced in quantity) that can deal with repeated cache miss are the GPU, they have many thread to hide those x hundred cycles of latency. Pretty much if your code has no locality or the cache is too tiny, etc. the CPU performances fall of a cliff.

Wrt to the gddr in the ps4 that is off topic, but I don't expect it to make much of a difference, in both systems I expect the performances sensitive parts of the code to pretty much hit the L2 most of the time.
I think the whole issue is overblown massively.

My mistake, I didn't mean to say they work well together, but only meant to say the situation isn't likely to somehow be better on the flip side, but either way developers should be able to take care of things.
 
So the importance of having (four) separate (512KB) L2 caches per Core pair is clear then.
 
Each Jaguar compute unit's L2 is subdivided into four array.
For an 8 core solution, that's 8 L2 quadrants. There's no pairing of cores, and there's nothing like a 2:1 ratio of cache slices to cores.

You want the speed to depend on externally influenced temperature in a console?

If it's like AMD's other implementations, it would only be thermally influenced if the CPU was encountering dangerous temps.
It might mess with measured latencies from run to run.
 
You want the speed to depend on externally influenced temperature in a console?

"Some turbo" doesn't have to mean that. The limits could be chosen purely statically based on nothing more than how many cores are active. There are platforms today that only do this. Call it something different if you really want, but it still follows what I said.
 
So the importance of having (four) separate (512KB) L2 caches per Core pair is clear then.

That's not quite how it works. Each of those slices map a specific segment of all addresses (likely interleaved). So regardless of which core is doing the accessing, the cache line at address 0 would go to the first slice, cache line at address 64 would go to the second, etc etc.

This is because it's much cheaper to split a large pool into two halves that can each support as many accesses as it can than it is to double the amount of accesses per time from a single pool.

The Jaguar L2 is very similar to the SNB L3 in design and purpose. Like in SNB, a core will access all the slices evenly. The fact that there are 4 slices and 4 cores is effectively an accident -- the design might just as well have ended up with 4 cores and 2 or 8 slices. 4 just happened to be the sweet spot.
 
GDDR is worse but that is not to say that DDR3 'work well' with the CPU.
Actually, if you look it up, the latencies of usual DDR3 modules measured in nanoseconds are in the exact same range as in case of GDDR5. They use the same memory cells after all, just the interface is a bit different. Or to say a number, GDDR5@6 GBps may run at a CAS latency of 17 cycles (the 1.5 GHz ones, I only looked it up for one series of Hynix, at 1 GHz it supports CAS latency of 12 cycles), which equates to 11.3 ns. DDR3-2133 rarely comes at latencies below 11 cycles (1066 MHz ones), which would be 10.3 ns. Higher latencies for DDR3 are actually common. T_RP, t_RCD and t_RAS are also basically the same when measured in ns (10-12ns for t_RP and t_RCD, 28 ns for t_RAS). The Hynix GDDR5 I was looking was actually supporting latencies of 11.3-10-12-28 when expressed in nanoseconds in the usual order. In cycles @ 1066 the closest would be 12-11-13-30.
 
Last edited by a moderator:
Well, i would like to launch this question:

For developing games in a console what is better? 8 cores with its fpu unit each at 1,8GHz or 4 cores with its fpu unit each at 3,6 GHz suppossing the same vector througput?.

I would like to leave Haswell out of the discussion as its fpu units are another world. Remember i am talking about fpu units, not about general processing threads ( as then 2 Sandy Bridge cores with 4 threads at 3,6GHz would behave similar to 8 jaguar cores at 1,8 GHz ).

I am of the opinion that having 8 ACEs in PS4 makes more valuable to have 8 independant cores and its fpu units with its 512KBs L2 cache capable of throwing 8 independant kernels to the GPU than having 4, but maybe i am very wrong...and you could do the same with 4 fpu units at double the speed.
 
If power consumption scaled linearly, the consoles would probably have had 4 cores at 3+ GHz, but unfortunately it doesn't and I think power consumption was the main driver for 8 1.6+ GHz cores vs 4 3+ GHz cores.
 
id say go for 3.6ghz, single threaded performance is still very important.

Well in this case floating performance in Jaguar is good, not so integer performance. That´s why i wonder if is better to have 8 slower ( in raw GHz ) fpus to double speeded 4 fpus.
 
Well in this case floating performance in Jaguar is good, not so integer performance. That´s why i wonder if is better to have 8 slower ( in raw GHz ) fpus to double speeded 4 fpus.

John Carmack specifically addressed the question of more slower cores vs fewer faster cores recently on Twitter. His answer was if total performance were equal he'd go for fewer faster cores every time but the question got interesting if you could get 1.5x the peak power out of the slower, more numerous cores.

I'm not sure why the number of ACE's would matter, as I understand it they are for scheduling GPGPU work on the GPU. Having more CPU cores wouldn't really effect that as far as I can see - assuming fewer faster cores could allocate jobs to the GPU just as quickly.
 
John Carmack specifically addressed the question of more slower cores vs fewer faster cores recently on Twitter. His answer was if total performance were equal he'd go for fewer faster cores every time but the question got interesting if you could get 1.5x the peak power out of the slower, more numerous cores.

So 4 x Piledriver (2 modules) at something like 2.5 gHz (like on the faster binned 35W Richlands) would probably be preferable to 8 Jaguars at 1.6 gHz. And 4 Steamrollers would probably be preferable at even slower speeds.

Too bad that Piledriver didn't make it to 28nm and that Steamroller wasn't ready.

In a sense it doesn't matter because games will be designed around whatever's in the consoles. But you could say the same thing about the GPU too. It's hardly an inspiring thought even if it's not a troubling one.
 
Well if they weren't constrained by a single apu they likely could have gone with Steamroller or Piledriver. So what are the gains in going with a single APU and is it worth it?
 
So 4 x Piledriver (2 modules) at something like 2.5 gHz (like on the faster binned 35W Richlands) would probably be preferable to 8 Jaguars at 1.6 gHz. And 4 Steamrollers would probably be preferable at even slower speeds.

Too bad that Piledriver didn't make it to 28nm and that Steamroller wasn't ready.

In a sense it doesn't matter because games will be designed around whatever's in the consoles. But you could say the same thing about the GPU too. It's hardly an inspiring thought even if it's not a troubling one.

For reference:

PCmark Vantage
4 cores reference Kabini A6-5200: 5271 ( 17 watts )
2 cores reference Brazos E2-1800: 2807 ( 18 watts )
2 modules 4 cores AMD A10-4600M ( Trinity at 2,3 GHz-3,2 GHz Turbo ): 5552 ( and more or less 35watts )

Being GCN in Kabini vs being VLIW 4 in Trinity makes a difference -as well as being in a smaller process node-, but a 35 watt Kabini chip would perform much better than a similar wattage Trinity chip, and at least will be very similar to a Richland one at same wattage.

Perf/watt Jaguar is really stellar.
 
Back
Top