PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

Status
Not open for further replies.
Speaking of this, can any developers shed light on the impact of GDDR5 latency vs something like DDR3? Or to bring it back to last gen, GDDR3 OR RAMBUS?

The bandwidth of GDDR5 is obvious, but for smaller bits when the system isn't moving large chunks of data like textures, how much can the latency hold the system up? Would these smaller bits not fit well in the 2MB L2 cache?

I'm thinking the GPU portion of the performance equation isn't an issue. The issue I'm concerned with is CPU performance (which the GPU is reliant upon to feed it data).

Correct me if i'm mistaken, but the type of memory has nothing to do with latency, it all depends on the memory controller. And isn't precisely AMD's hUMA designed so CPU and GPU can access the same memory pool in a manner that boost the efficiency of the whole system?
 
The memory subsystem that is usually paired with GDDR5 in GPUs emphasizes high throughput and high utilization at the expense of very high latencies.

The devices themselves can be very close in latency. The big contributor is the latency of the DRAM arrays themselves, and those don't change much with the interface that differentiates DDR3 and GDDR5.
GDDR5 does have a larger minimum transfer size, which might impact bandwidth utilization depending on the access pattern. For the large wavefronts of graphics shaders and the GPU memory pipeline that aggressivly aggregates accesses into big blocks, it's not a problem.
 
The memory subsystem that is usually paired with GDDR5 in GPUs emphasizes high throughput and high utilization at the expense of very high latencies.

The devices themselves can be very close in latency. The big contributor is the latency of the DRAM arrays themselves, and those don't change much with the interface that differentiates DDR3 and GDDR5.
GDDR5 does have a larger minimum transfer size, which might impact bandwidth utilization depending on the access pattern. For the large wavefronts of graphics shaders and the GPU memory pipeline that aggressivly aggregates accesses into big blocks, it's not a problem.

Do we have any data on what percentage of the CPU time is spent getting data outside the 2mb cache for a typical modern game workload?

Are there certain engines or game types which would work better in that limited 2mb cache space to avoid cache misses?

I'm guessing this whole latency issue can be avoided for the most part by pre-fetching, but this would seem to require a lot of "hand holding" by devs to get it working optimally.
 
Speaking of this, can any developers shed light on the impact of GDDR5 latency vs something like DDR3?
There is no impact as the latency is virtually the same (measured in nanoseconds, not in cycles; if GDDR5 clocks twice as high as DDR3 the latency can be twice as many clocks to be the same absolute delay).
The bandwidth of GDDR5 is obvious, but for smaller bits when the system isn't moving large chunks of data like textures, how much can the latency hold the system up? Would these smaller bits not fit well in the 2MB L2 cache?
CPUs and GPUs move data usually with cache line granularity (64 Bytes). Again this makes no difference regarding the choice of GDDR5 or DDR3 as burst lengths are the same size or lower than that (depending on the organization of the memory controllers).

That's basically just a rephrasing of what 3dilettante wrote already:
The memory subsystem that is usually paired with GDDR5 in GPUs emphasizes high throughput and high utilization at the expense of very high latencies.

The devices themselves can be very close in latency. The big contributor is the latency of the DRAM arrays themselves, and those don't change much with the interface that differentiates DDR3 and GDDR5.
============================
GDDR5 does have a larger minimum transfer size, which might impact bandwidth utilization depending on the access pattern. For the large wavefronts of graphics shaders and the GPU memory pipeline that aggressivly aggregates accesses into big blocks, it's not a problem.
It's actually also not a problem for CPUs as long as the memory channel widths are narrow enough that a burst is smaller than or only as large as a single cache line (which is the case).
One large contribution to the higher latency of GPUs is exactly all the coalescing and combining of large numbers of memory accesses to get a higher bandwidth utilization (as GPUs tolerate higher latencies better than CPUs bandwidth is more important than latency). That's also the reason AMD basically splits the memory controller in its APUs. The CPU and GPU parts traditionally have largely separate memory pipelines (and only the one for the GPU part does the really aggressive coalescing and combining of accesses increasing the latency) both connected to and sharing a common DRAM controller (where an arbiter probably gives CPU request a higher priority to keep latency down).
 
I've also read somewhere that GDDR5 latencies are "programmable". I'm not even sure what that means and I don't think low latencies would be possible at high clock rates, but maybe they went with a balanced latency for CPU and GPU?

edit: Oh, still reading Gipsel's very delicate post.
 
I've also read somewhere that GDDR5 latencies are "programmable". I'm not even sure what that means and I don't think low latencies would be possible at high clock rates, but maybe they went with a balanced latency for CPU and GPU?
AFAIK, the programmable latencies mean that the absolute latency can stay largely constant when the frequency of the memory is ramped up or down (as part of the power management). I.e., half the frequency, half the latency in cycles, the same in nanoseconds. The absolute delay is the important one anyway if one wants to compare differently clocked memory types. I can only repeat that DDR3 and GDDR5 are offering the same absolute latencies. Therefore something as your "balanced latency for CPU and GPU" is already realized with the separate memory pipelines (only sharing the actual DRAM controller and offering dedicated channels for coherent exchanges) in all of AMD's APUs as I said.
 
...That's also the reason AMD basically splits the memory controller in its APUs...

Thanks to you and 3dilettante for your insightful responses.


On the above, can you give any insight into whether AMD has implemented the same dual-memory-controller type scheme into the ps4 apu?
 
On the above, can you give any insight into whether AMD has implemented the same dual memory controller type scheme into the ps4 apu?
From the vgleaks stuff I would say yes (same is true for the XB1 with the further complication of the eSRAM pool).
 
Do we have any data on what percentage of the CPU time is spent getting data outside the 2mb cache for a typical modern game workload?

Are there certain engines or game types which would work better in that limited 2mb cache space to avoid cache misses?

I'm guessing this whole latency issue can be avoided for the most part by pre-fetching, but this would seem to require a lot of "hand holding" by devs to get it working optimally.

I haven't run across a public disclosure of miss rates for a Jaguar chip running a modern game engine.

The last public run of game code while using performance monitoring software that I saw was in 2008, and with Conroe and K8.
http://www.realworldtech.com/cpu-perf-analysis
The games, architectures, and tools are so different now that I wouldn't count on drawing many conclusions.

If going with 2 per thousand instructions retired (the lower of misses and the L2s don't match very well), a single Jaguar thread would have 500 cycles of execution (excludes branches or issue restrictions, so this is a best case). Without knowing the latency of the memory subsystem, there's no way to reach a wall clock time. A miss may take over a hundred cycles if Durango's remote cache hits are an indication, which if not hidden could take 200+ cycles to resolve.
That's a third of its time on memory.

Memory pipelines and prefetchers have changed quite a bit since then, however, and memory penalties aren't as bad if the cores are clocked lower.


There's a lot of effort and time needed to get that kind of testing right, so few third parties are going to be doing this for ad hits.
Those most likely to have code runs instrumented like this would be behind a paywall or would be developers that might be under NDA.


edit: moved quote to new post
 
Last edited by a moderator:
What's the difference between implementing a single channel as 64-bit (ala desktop system memory) vs a 64-bit dual channel (ala Tahiti/7970)? (or rather, a single channel as 32-bit)
 
The CPU and GPU parts traditionally have largely separate memory pipelines (and only the one for the GPU part does the really aggressive coalescing and combining of accesses increasing the latency) both connected to and sharing a common DRAM controller (where an arbiter probably gives CPU request a higher priority to keep latency down).
There's been some penalty, at least for prior designs. AMD's APUs weren't benchmarked to have particularly good latencies. However, AMD's backslid these days quite a bit even without GPU contention.
 
Do we have any data on what percentage of the CPU time is spent getting data outside the 2mb cache for a typical modern game workload?

Are there certain engines or game types which would work better in that limited 2mb cache space to avoid cache misses?

I'm guessing this whole latency issue can be avoided for the most part by pre-fetching, but this would seem to require a lot of "hand holding" by devs to get it working optimally.
You had to manually prefetch data with older in-order PPC based consoles (x360 and PS3). It required lots of manual fine tuning to work well.

All modern x86 CPUs have data cache prefetchers. Whenever the CPU notices several memory accesses that form a linear access pattern (stride), the CPU starts to automatically prefetch data according to that stride. This means that as long as you are processing things linearly by incrementing / decrementing the memory address, your code will basically never cache miss... unless you run out of bandwidth (or get hit by cache aliasing).

It's hard to program anything more complex by using only linear data accesses (only using arrays, and only batch processing them). Pointer accesses are often needed, and pointers can point to any address in the CPU memory. It's impossible for the hardware to predict where to prefetch data, until the pointer value is loaded to memory. Pointer chains (pointer list for example) are the worst case offenders. Fortunately out of order execution helps in these cases.

A simple example: Lets assume a full cache miss takes ~150 cycles (this is typical for modern Intel x86 CPUs). We are iterating though a pointer list (nodes are at random memory accesses and thus cannot be prefetched). The minimum time it would require to iterate though this list is always at least 150 cycles * N (where N is number of nodes in the list). If your operation takes only 10 cycles per node (*), you are wasting 140 cycles per node waiting for the memory latency. You are proceeding 15x slower because of memory latency. However if your operation is more complex, and takes for example 200 cycles to process, the CPU can prefetch the next node while processing the current one (because of out of order execution), as long as the next node address calculation doesn't depend on results of the operations of the current node. You could of course also use manual cache line prefetch in this case (SSE has instruction for it). If you do it like that, you would like to start prefetching the next node before you start processing the current one (as soon as you can get the next node pointer to a register).

For best performance, you should always optimize your memory access patterns according to the caches available in your target hardware. All new data that enters caches should be prefetched in one way or another (to prevent stalls). Old data already in caches can be accessed by any random pattern without performance loss. If you know (with relatively high confidence) what data lies already in the CPU caches, you know that you can efficiently access it again. For best result you should both think your memory (spatial) layout and your (temporal) access pattern (where and when).

(*). A modern CPU core can process at least 2 uops per cycle (sustained),. Most AMD cores process 2 uops per cycle, while most Intel cores process 4 uops per cycle. However in general the actual IPC howers around 1.0-2.0 instructions per cycle. Dependencies and memory stalls prevent cores from reaching their maximum theoretical performance. Thus a 150 cycle memory stall could (in the worst case) cause a CPU core to skip executing up to 600 instructions (doing nothing).
 
Last edited by a moderator:
Bulldozer introduced a non-strided prefetcher, which might be using some kind of pattern-matching or history table to help predict certain accesses. Details are sparse.

I haven't seen much mention concerning its efficacy, but the weight of anecdotes seems to be that prefetching across the range of modern cores is good enough to make software prefetching a net loss in the general case.
 
It should be noted there repeated refrences to problems with the monitoring tools, and that there are anomalous counts that make some of the most extreme values suspect.
AMD's L2 has a suspicously bad count for Far Cry, per the next sentence.
 
There's been some penalty, at least for prior designs. AMD's APUs weren't benchmarked to have particularly good latencies.
One part of that is that the L2 latency of Llano increased by 5 cycles compared to earlier CPUs. But that is unconnected to the CPU vs. GPU balancing. And of course, the additional arbitration I mentioned also takes a few cycles. But this is always the case, regardless of the DDR3 or GDDR5 question we started with. In fact the benchmarks you mentioned are done with DDR3. It doesn't change that the CPU sees a different effective latency of the RAM than the GPU part does. I think you agree that if one doesn't have two completely separate memory pools for CPU and GPU there needs to be a small penalty somewhere because of possible contention and the arbitration. As long as it is quite small, the benefits are usually much larger.
 
One part of that is that the L2 latency of Llano increased by 5 cycles compared to earlier CPUs. But that is unconnected to the CPU vs. GPU balancing. And of course, the additional arbitration I mentioned also takes a few cycles. But this is always the case, regardless of the DDR3 or GDDR5 question we started with. In fact the benchmarks you mentioned are done with DDR3. It doesn't change that the CPU sees a different effective latency of the RAM than the GPU part does. I think you agree that if one doesn't have two completely separate memory pools for CPU and GPU there needs to be a small penalty somewhere because of possible contention and the arbitration. As long as it is quite small, the benefits are usually much larger.

The latency benchmarks at each generation have varied pretty widely, but the Athlon64 series had some variants that could pull 48-50ns latencies. Llano could break 70-80 in some reviews, with Trinity falling in the 60-70 range.

I would consider AMD's ability to arbitrate with its weaker OoO cores in the consoles, particularly under load, to be a question mark.
 
Status
Not open for further replies.
Back
Top