Digital Foundry Article Technical Discussion Archive [2013]

Status
Not open for further replies.
Richard says that xbox one is more balanced system than the ps4 -

Well, as many out there, at the begining I have started to divinize the PS4 as an almighty power house... But analyzing all the data at our disposal, I have to say that I start to agree with him.

PS4 has much weight in GPU, but many data now seems to point that after a determinated level, CUs usage will be wasteful.

On the other hand PS4 lack of specific dedicated hardware in some area.
So as hinted everywhere right now, in the future many heavy audio task could be handled by the CUs, as well as speech recognition, or some calculation x motion sensing.

As I have said elsewhere, CUs are the new SPU, and, to me, are there to compensate some lack in design choices or in dedicated hardware.
 
PS4 has much weight in GPU, but many data now seems to point that after a determinated level, CUs usage will be wasteful.
I don't see why the additional CUs would be considered wasteful. It's not difficult to produce a scenario where the additional CUs can be used effectively for graphics. There are benchmarks for existing products that show that it's pretty routine. There is no magic line at 14 where the performance gain drops to 0.

There is almost no part of the Orbis system that can be pointed to as a bottleneck where it isn't as good or better than similar discrete solutions that benefit from more than 14 CUs.
It is quite possible that 18 is past the knee of the curve, but while I don't dispute the authenticity of the documentation VGleaks and DF have looked at, I suspect their interpretation is not perfect.
 
The only dedicated hardware the PS4 lacks compared to the Xbox One are there to solve problems that don't exist on the PS4 (Kinect/ESRAM/Move Engines).

Richard's declaration that the PS4 is unbalanced is complete conjecture and based on assuming everything about the Xbox One is better than it seems, and everything about the PS4 is somehow worse than it seems.
 
eSRAM does mitigate a problem that Orbis is likely to face when its GDDR5 bus is heavily taxed with disparate jobs with uncoordinated read and write traffic. DRAM buses do not like switching modes, and we do not have much data on how far AMD's tech has evolved to determine what Orbis can do to reduce that problem.

Coalescing off-chip traffic would involve adding storage to hold it on-chip, and to really be effective for a broad range of loads, it would have to be big--which brings Durango to mind.
Workloads that don't need coalescing or thrash that storage will see it as a negative or needless development complication.

Both architectures are going to have weak spots they will want software to work around.
 
I don't see why the additional CUs would be considered wasteful. It's not difficult to produce a scenario where the additional CUs can be used effectively for graphics. There are benchmarks for existing products that show that it's pretty routine. There is no magic line at 14 where the performance gain drops to 0.

There is almost no part of the Orbis system that can be pointed to as a bottleneck where it isn't as good or better than similar discrete solutions that benefit from more than 14 CUs.
It is quite possible that 18 is past the knee of the curve, but while I don't dispute the authenticity of the documentation VGleaks and DF have looked at, I suspect their interpretation is not perfect.
Exactly.
I don't understand why some people try to claim PS4 has too many ALUs. Compared with a HD7870, which is considered as relatively balanced in the PC area (maybe a few ROPs too much), Orbis has actually significantly less arithmetic power (and also less pixel throughput) but a higher memory bandwidth (to increase ROP efficiency and supply ample bandwidth for future loads). Where does the idea come from, that the CUs can't be efficiently used? I mean, if the framerate target stays the same, more CUs enable for instance the use of more/more complicated calculations (better approximations, more effects, better resolve filters for AA, whatever one can come up with) in the shaders as the available budget is simply higher. This increases the visual quality. If a dev doesn't do anything on the resolution or image quality between the XB1 and PS4, then of course the additional ALUs in Orbis will be somewhat underutilized as they will be twiddling thumbs for some time. But why shouldn't a dev use the additional resources for something useful? There is nothing which forbids this (besides maybe pressure to create a simple and fast port).
 
We pretty much know it is a second memory pool explicitly managed by the software, i.e. not a cache. The devs have to decide what to put where, nothing is done automatically (besides that shader code is agnostic regarding the physical location, i.e. a shader program doesn't have to know, where a memory location it accesses is physically located; memory accesses are routed automatically to the right memory pool using a page table [or as the simplest version some aperture]).

Other devs have said it can be used as a cache or scratchpad memory
 
The only dedicated hardware the PS4 lacks compared to the Xbox One are there to solve problems that don't exist on the PS4 (Kinect/ESRAM/Move Engines).

Richard's declaration that the PS4 is unbalanced is complete conjecture and based on assuming everything about the Xbox One is better than it seems, and everything about the PS4 is somehow worse than it seems.

I think the article is addressing the theme of "because ESRAM alleviates the bandwidth limitations of DDR3, it must have comparably worse overall performance." When the reality may actually be yes, this DDR3/ESRAM combo is cheaper than having all GDDR5, it also 'pound for pound' performs better overall. (edit: that doesnt mean that the PS4 has many more 'pounds' that will still outperform the xbo)

"Why wouldnt everyone use it then?" The reasons I can think of are: R&D and testing an esoteric design, ease of developer usage, chip design complexity, fab complexity, but overall it may well turn out to be the superior performing subsystem. (relatively speaking)

On the PS4 side, Richard isn't pulling these ideas of a 14+4 out of thin air. They've been in the leaked documents and Cerny didn't exactly squash the idea, i think he added fuel to the fire TBH.


Exactly.
I don't understand why some people try to claim PS4 has too many ALUs. Compared with a HD7870, which is considered as relatively balanced in the PC area (maybe a few ROPs too much),
People arent 'claiming' it, just trying to make sense of the leaks and the somewhat affirming comments by Cerny.
 
Last edited by a moderator:
The only dedicated hardware the PS4 lacks compared to the Xbox One are there to solve problems that don't exist on the PS4 (Kinect/ESRAM/Move Engines).

Richard's declaration that the PS4 is unbalanced is complete conjecture and based on assuming everything about the Xbox One is better than it seems, and everything about the PS4 is somehow worse than it seems.
And maybe the fancy audio DSP that would relieve a CPU core or so. GPGPU audio is apparently not yet a doddle to implement.
 
eSRAM does mitigate a problem that Orbis is likely to face when its GDDR5 bus is heavily taxed with disparate jobs with uncoordinated read and write traffic. DRAM buses do not like switching modes, and we do not have much data on how far AMD's tech has evolved to determine what Orbis can do to reduce that problem.

Coalescing off-chip traffic would involve adding storage to hold it on-chip, and to really be effective for a broad range of loads, it would have to be big--which brings Durango to mind.
One doesn't need to add much. Usual GPUs have that on board already, so do Orbis as well as Durango. A transfer of a cacheline (64bytes) takes two bursts or four command clock/eight data clock cycles on one 32bit channel. Different memory controllers can of course have completely separate memory operations in flight. It basically works similar as banking of the eSRAM (which it quite probably uses) and helps already to avoid conflicts. For texturing, the 64byte cacheline granularity, the usually relatively high spatial coherence of the accesses and the usually read only nature do the job already (together with the deep queues in the memory controllers doing additional coalescing).

The ROPs on the other side already include specialized caches exactly for the coalescing/write combining and to increase the access granularity seen by the the memory. Render targets are only accessed in tiles of (most likely) 8x8 pixels. That are 512 bytes for the favourite 4xFP16 format (more with AA, externally the compressed tiles are read and written, internally they are processed in their uncompressed state). Each render backend contains 16kB of this color cache (+4kB cache for the Z buffer, also tiled), i.e. it can hold up to 32 tiles simultaneously (in case of a 64bit pixel format) absorbing the spatial locality of ROP exports and providing a larger bandwidth internally to the ROPs. That's one of the reasons one can use 95%+ of the theoretical memory bandwidth the memory interface provides in fillrate tests. It actually works reasonably well.

Btw., as the eSRAM sits outside/behind all the GPU caches, it means that the same granularities will apply.
 
Last edited by a moderator:
Other devs have said it can be used as a cache or scratchpad memory
Scratchpad memory it is. It's simply a software managed SRAM memory pool where the dev explicitly has to decide what to put in. Of course a dev is free to use it as a scratchpad to hold data loaded from the main RAM pool which some people also refer to as a software managed "cache" (which isn't a cache in a traditional/strict sense).
 
The only dedicated hardware the PS4 lacks compared to the Xbox One are there to solve problems that don't exist on the PS4 (Kinect/ESRAM/Move Engine

Well, it is not actually complete.

X1 seems to have a very powerfull audio block composed by 4-5 different CPU.
For your exemple, one of them is called SHAPE and can not be emulated by the entire Jaguar CPU. I have also learned on this board that the X1 audio block in theory could allow also 3D sound at ZERO expense for CPU & GPU...
Another CPU of the audio block is for speech recognition.

PS4 has only a compression & decompression unit (similar to xbox360) so all the audio task will be done by the CPU.

Can you see why I say that CU are there to compensate the lack of dedicated hardware?

And also regarding esram and move engine I suspect that are there for precise reasons, and i suspect not only for increase the bw...
 
Well, it is not actually complete.

X1 seems to have a very powerfull audio block composed by 4-5 different CPU.
For your exemple, one of them is called SHAPE and can not be emulated by the entire Jaguar CPU. I have also learned on this board that the X1 audio block in theory could allow also 3D sound at ZERO expense for CPU & GPU...
Another CPU of the audio block is for speech recognition.

PS4 has only a compression & decompression unit (similar to xbox360) so all the audio task will be done by the CPU.

Can you see why I say that CU are there to compensate the lack of dedicated hardware?

And also regarding esram and move engine I suspect that are there for precise reasons, and i suspect not only for increase the bw...

Yeah, you suspect a lot of things that don't necessarily align with reality. As I'm constantly pointing out, we know the ps4 has a hardware audio chip, the capabilities of which are not fully know, save the decompression aspect. It would be nice if people stopped saying that is all it can do as if that's fact.
 
Well, it is not actually complete.

X1 seems to have a very powerfull audio block composed by 4-5 different CPU.
For your exemple, one of them is called SHAPE and can not be emulated by the entire Jaguar CPU. I have also learned on this board that the X1 audio block in theory could allow also 3D sound at ZERO expense for CPU & GPU...
Another CPU of the audio block is for speech recognition.

PS4 has only a compression & decompression unit (similar to xbox360) so all the audio task will be done by the CPU.

Can you see why I say that CU are there to compensate the lack of dedicated hardware?

And also regarding esram and move engine I suspect that are there for precise reasons, and i suspect not only for increase the bw...

What is your source for this? AFAIK Cerny made a comment about the compression and decompression of audio but he did not say that was the limit/scope of the fixed function audio capability inside of PS4. Do you have another source or are you choosing to take the worst case possible scenario to make your argument?
 
Last edited by a moderator:
And also regarding esram and move engine I suspect that are there for precise reasons, and i suspect not only for increase the bw...

Well we know the Move engines are not there to save bandwidth, the total bandwidth they share is like 25GB/s, they are there to save compute cycles. But the ESRAM, ya it is there for low bandwidth mitigation, this has been beaten to death here.
 
Yeah, you suspect a lot of things that don't necessarily align with reality. As I'm constantly pointing out, we know the ps4 has a hardware audio chip, the capabilities of which are not fully know, save the decompression aspect. It would be nice if people stopped saying that is all it can do as if that's fact.

And yet you also point to things in the Xbox One and say that is all it can do as if that was fact despite not much being known about them.

Cerny has already said more about the audio hardware in PS4 than has been said about a lot of the components in Xbox One.

The fact is, that it is highly unlikely that the audio block in the PS4 is even a fraction as powerful as the audio block in the Xbox One. It's possible certainly, but considering that Sony goes on at lengths to point out how capable its hardware is anytime someone asks, the fact that they do not do so when asked about the audio hardware is quite telling.

So if you're going to insist that the audio hardware is quite likely far more powerful that Cerny has implied when directly questioned about how powerful it is, then you'd also have to admit that the possibility exists that many of the things in Xbox One are far more capable than you believe.

Regards,
SB
 
One doesn't need to add much. Usual GPUs have that on board already, so do Orbis as well as Durango. A transfer of a cacheline (64bytes) takes two bursts or four command clock/eight data clock cycles on one 32bit channel.
The read-write turnaround for the bus is a multiple of that time period. At 5.5 Gbs, the best-case latency where the bus provides nothing is 13 command clocks, worst case 20.

[CLmrs+(BL/4)+2-WLmrs]*tCK
(20,19,18,17,16)+8/4+2-(4,6,7)

http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf

That would be six to ten cache transfers not utilized, with the length of time before the next transition dependent on what level is considered good enough, balanced with latency requirements for the CPU.
If it were the GPU alone running a traditional graphics workload, that looks well-handled.

The unknown in my eyes is the octo-core Jaguar part of the APU and Mark Cerny's desire to leverage asynchronous compute heavily.
This falls heavily on the CPU memory controller, onion bus, and the customizations in the L2, since this traffic does not rely on the ROPs.

I'm hoping for disclosure on this as developers have time to work on it. Existing desktop APUs aren't benchmarked for bandwidth utilization with the CPU and GPU sides under load. Latency for the CPU side is usually relatively poor or mediocre, but it's difficult to determine how well the GPU side is catered to since current APUs tend to be bandwidth-strangled.
Orbis would be a real test as to how well AMD's controller tech really handles the disparate needs of the two sides, and the compute customizations show a strong desire to keep a handle on the access patterns of the compute jobs.

Different memory controllers can of course have completely separate memory operations in flight. It basically works similar as banking of the eSRAM (which it quite probably uses) and helps already to avoid conflicts.
There are definitely optimizations that can be done here that can make the job easier on the memory interface. Making jobs more aware of what phase in the frame-generation time the GPU is in would help, as would making target buffer allocation more aware of what controllers would be used.


The ROPs on the other side already include specialized caches exactly for the coalescing/write combining and to increase the access granularity seen by the the memory.
They are also such a major bandwidth consumer that they are placed right next to the memory controllers, so we know their miss rates are very high (edit: misses creating demand bandwidth versus other client types).
The CPU and compute sides would actively contend for the bus, and the ROP caches cannot do more than make the color and Z traffic as well-behaved as they can in the face of an unknown.


Btw., as the eSRAM sits outside/behind all the GPU caches, it means that the same granularities will apply.
There shouldn't be a significant turnaround penalty, if any. There's probably no problem in the likely case that it's a bidirectional interface.

Reads can switch to writes and switch back to reads all day with no bandwidth loss and a much reduced amount of queuing would be needed.
 
Last edited by a moderator:
And yet you also point to things in the Xbox One and say that is all it can do as if that was fact despite not much being known about them.

Cerny has already said more about the audio hardware in PS4 than has been said about a lot of the components in Xbox One.

The fact is, that it is highly unlikely that the audio block in the PS4 is even a fraction as powerful as the audio block in the Xbox One. It's possible certainly, but considering that Sony goes on at lengths to point out how capable its hardware is anytime someone asks, the fact that they do not do so when asked about the audio hardware is quite telling.

So if you're going to insist that the audio hardware is quite likely far more powerful that Cerny has implied when directly questioned about how powerful it is, then you'd also have to admit that the possibility exists that many of the things in Xbox One are far more capable than you believe.

Regards,
SB

Where have I insisted anything about the power of the PS4's audio unit? I grant that in all likelihood it is far less sophisticated than the Xbox One's. But we don't know everything it can do and until we have more details, I see no reason to refrain from challenging blatant misrepresentations on the topic.
 
Where have I insisted anything about the power of the PS4's audio unit? I grant that in all likelihood it is far less sophisticated than the Xbox One's. But we don't know everything it can do and until we have more details, I see no reason to refrain from challenging blatant misrepresentations on the topic.

Tell us about the PS4 audio chip.
 
Tell us about the PS4 audio chip.

This should be directed at the person who is stating conclusions not the person saying we don't what the PS4 audio chip can do......

I pretty much agree with Brad; XB1 more than likely has a more robust solution but that doesn't in any way mean we have enough info at this point to conclude that the audio functions in PS4 have to emulated on Jaguar or the GPU. All we know for certain at this point is audio compression and decompression is available in the hardware and that audio ray tracing can be done on the GPU which leaves quite a bit of vanilla processing that is being done via a DSP or emulated elsewhere - again we just don't know yet.
 
Where have I insisted anything about the power of the PS4's audio unit? I grant that in all likelihood it is far less sophisticated than the Xbox One's. But we don't know everything it can do and until we have more details, I see no reason to refrain from challenging blatant misrepresentations on the topic.

We went through this in the audio thread, same people, same poor logic. Basically the absence of evidence is evidence of absence. They are resting on the audio block of the XB1 to be secret sauce and won't acknowledge that it exists mainly for Kinect as pointed out by bkilian and that a fraction of one CPU core can do what modern games require for current audio processing. It may change in the future if things get more sophisticated.
 
Status
Not open for further replies.
Back
Top