PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

Status
Not open for further replies.
The L2 sharing scenario may also be a worst-case number for Naught Dog's presentation, as Durango has something between 100 and 120
I believe the snoop penalty over L2 could be around 100 cycles, so probably PS4 figures are taken on 'worse' cases.

GDDR5 vs DDR3 is very much in the noise.
The small difference between GDDR5 and DDR3 is quite surprising me. Are they the best cases or what? I was expecting the difference to be higher than a mere 20% which, if we factor in the grand-total (cache access+dram access /total access) shouldnt almost affect the final result...


The heterogeneous memory subsystem of AMD's APUs is not breaking the overall trend from Llano, Trinity, Bobcat, Kabini, Kaveri, etc. in terms of latency.
I do not think it was even supposed to. HSA was more about offering a way to (in theory) avoid memory copies by allowing fast memory remapping/sharing, and adding CPU<->GPU intercommunication/event support.
Or are you referring to the way their MCT+DCT are tuned for APUs?
 
The small difference between GDDR5 and DDR3 is quite surprising me. Are they the best cases or what? I was expecting the difference to be higher than a mere 20% which, if we factor in the grand-total (cache access+dram access /total access) shouldnt almost affect the final result...
I believe latency numbers provided for an architecture tend to focus on best or good cases. It's almost trivial to force a hit to a closed bank or nail the bus with read/write turnaround penalties. There are a few potential spots with bank grouping that might inject a few extra cycles for GDDR5 in scenarios DDR3 wouldn't need to worry about, but in general the datasheets for modules with either tech tend to keep the wall clock times in the same band.
The common case should be that the memory subsystem moves heaven and earth to make DRAM hit the good case, but there lies the dark art of memory controller tech.
At least for this comparison, both platforms have the same technology pool to draw from.

I do not think it was even supposed to. HSA was more about offering a way to (in theory) avoid memory copies by allowing fast memory remapping/sharing, and adding CPU<->GPU intercommunication/event support.
Or are you referring to the way their MCT+DCT are tuned for APUs?

The latency numbers for APUs were a marked rise over their predecessors. In general, besides a possible small improvement from Llano to Trinity, the memory subsystem has gotten worse every chance AMD has had to improve its tech.
Looking at the PS4 numbers, we see an uncore and on-chip interconnect that could take 100 or more cycles to handle an access (meaning just the snoop, queueing, and notification of the DRAM controller).
If we assume for the sake of argument that the PS4 CPUs are 1.6GHz, we could be looking at 100ns or more devoted to the chip figuring out whether an L2 miss needs to go to memory.
The Xbox One is a little better, possibly because the uncore clocks are higher or because their tweaks there have shaved some time off. Another possibility, given that Kaveri has adopted some of the same bus elements as Orbis (and has surprisingly bad latencies), is that Durango benefited latency wise by stopping earlier on AMD's progression of worsening architectural latency.

That's enough time for 2-3 full memory accesses if we go by some decent desktop cores.
 
...but in general the datasheets for modules with either tech tend to keep the wall clock times in the same band.
Indeed, GDDR5 and DDR3 datasheet reports quite comparable results.

The latency numbers for APUs were a marked rise over their predecessors. In general, besides a possible small improvement from Llano to Trinity, the memory subsystem has gotten worse every chance AMD has had to improve its tech.
It mostly depends on the workload. Being an APU, you should remember that you might have quite a number of parallel guests to serve in time. It would be more interesting to test CPU part without GPU running in parallel, to see the results.
Obviously, on a mixed load, you will have both competing for resources and thus the delay required to serve the internal MCT/DCT queue(s).

If we assume for the sake of argument that the PS4 CPUs are 1.6GHz, we could be looking at 100ns or more devoted to the chip figuring out whether an L2 miss needs to go to memory.
I doubt it - see my remark above.

The Xbox One is a little better, possibly because the uncore clocks are higher or because their tweaks there have shaved some time off.
my 2 cents: part of GPU access goes to eSRAM, so there is less traffic in the bus.

That's enough time for 2-3 full memory accesses if we go by some decent desktop cores.
... which hints my 2 cents, I believe.
 
It mostly depends on the workload. Being an APU, you should remember that you might have quite a number of parallel guests to serve in time. It would be more interesting to test CPU part without GPU running in parallel, to see the results.
Obviously, on a mixed load, you will have both competing for resources and thus the delay required to serve the internal MCT/DCT queue(s).
The common case is that CPU memory benchmarks are run without the GPU under load, if it is active at all.
I don't see anybody running Sisoft Sandra while running 3DMark or a game.

The Naughty Dog presentation barely touches on the existence of the GPU in Orbis, and the latency examples focus on contention within the CPU portion. The other latencies given are consistent with the usual sort of testing the CPU hierarchy in isolation.


my 2 cents: part of GPU access goes to eSRAM, so there is less traffic in the bus.
While the eSRAM absorbs traffic in the real world, I have doubts that the numbers given come from testing with both sides of the APU under load.
The external memory bus is significantly narrower, so the bandwidth that isn't covered by the eSRAM remains dominant. Half of the DDR3 bandwidth is really only available to the GPU, since the coherent bandwidth of the APU is a little less than half that of the bus.
The likely case in a high-GPU scenario is that if the eSRAM is busy, the GPU is going to be hitting the DDR3 bus as well.
 
What's going on in the above post*, why post (from what I can tell) the same link 12 times? Spambot going berzerk or what?

*Two posts up now since someone replied while I was typing... :p
 
16x AF support is a common element of the base architecture's feature set. Other GCN cards have that checkbox as well.

It makes sense then to say there is no technical reason why the PS4 can't support it.
Saying there is no issue with using it offline doesn't mean there aren't reasons to avoid it when running a game.
 
I can't think of a reason why the OS should care about a texture filtering setting in a game.
Other games have at least some of the lower levels enabled, so why would the OS object to the number 16?
 
Maybe their OS has problems with it (broken, very slow... whatever) so activating is is discouraged.
It's a GPU hardware feature, so the OS would not have problem with it at all.

Most likely, if it's not being used it's because the game developers simply don't give a damn. They know the majority of console gamers don't know about aniso filtering, nor do they really care and likely wouldn't notice much, if any difference regardless of if the feature is on or off.
 
But if low level AF is effectively free, why not include it for the few who do care, including those that read DF and get told what to care about?
 
GPUs reduce the vram bandwidth required for AF with high texel hit rate in the local small kB sized texture caches.

However without factoring in that cache, and assuming a 4:1 texture compression ratio you are looking at ~8GB/sec for 16x AF @ 30 fps

30fps @ 1920x1080 & 4:1 texture compression
trilinear filtering = 0.5GB/sec
4x AF = 2GB/sec
16x AF = 8GB/sec

However developers are starting to use large uncompressed textures, so I would assume you might see these numbers go up and potentially quadruple.

But then we are not factoring in the all important cache hit rate. Maybe someone with more knowledge can enlighten about that factor.
 
Last edited by a moderator:
GPUs reduced the vram bandwidth required for AF with high texel hit rate in the local small kB sized texture caches.

However without factoring in that cache, and assuming a 4:1 texture compression ratio you are looking at ~8GB/sec for 16x AF @ 30 fps

30fps @ 1920x1080 & 4:1 texture compression
trilinear filtering = 0.5GB/sec
4x AF = 2GB/sec
16x AF = 8GB/sec
In 16xAF, do TMUs actually use the full 16 taps all the time?

I guess it would simplify the design of the hardware a little. But there's no point using a big sample pattern when, say, looking at a wall head-on.

The only benefit I can see would be if the GPU wanted to try and load a minimal number of MIP levels into the cache; if you had a surface that was very large in screen-space, parts viewed "head-on" wouldn't need access to a higher MIP level if they took multiple samples from a lower MIP level, just like the parts of the surface viewed at oblique angles.
 
You shouldn't apply AF to everything all the time. Only a few surfaces will be at an angle that warrants its use, and AFAIK consoles are in a position to apply AF selectively. This was raised on this board when the last gen started! We went a whole generation without decent AF even though it was standard in hardware, and we're looking at another gen where it may not be ubiquitous.
 
Thank you yeah, so it doesn't seem like much of a factor in terms of bandwidth used. High texel cache hit rate, not every surface is at an oblique angle. But perhaps the move to uncompressed textures plays a factor in developers decision?
You shouldn't apply AF to everything all the time. Only a few surfaces will be at an angle that warrants its use, and AFAIK consoles are in a position to apply AF selectively. This was raised on this board when the last gen started! We went a whole generation without decent AF even though it was standard in hardware, and we're looking at another gen where it may not be ubiquitous.
Memory bandwidth has gone up multple fold from last gen, but so have bandwidth demand. So perhaps once again it might be pushed to the side, and now with the combination of uncompressed textures increasing bandwidth cost for AF, this is another generation many developers don't want to sacrifice that relatively small memory bandwidth.
 
Last edited by a moderator:
But perhaps the move to uncompressed textures plays a factor in developers decision?
What do you mean by "uncompressed"?

If you're referring to the textures not being a blurry/artifacty mess, that's all the more reason to use filtering that doesn't regard them as a mess.

On the other hand, if you're suggesting that they're trying to be nice to the cache in a world where they're no longer using hardware texture compression... why would they no longer be using hardware texture compression?
 
Status
Not open for further replies.
Back
Top