Digital Foundry Article Technical Discussion Archive [2013]

jgp · Jul 29, 2013

Gipsel said:
... which some people also refer to as a software managed "cache" (which isn't a cache in a traditional/strict sense).

Any modern OS uses free RAM for disk cache, and I've never seen people arguing it isn't cache "in the strict sense".

As long as it's managed in large enough chunks, to minimize overhead, it's a cache. Of course, when talking about 512bit lines or such, only hardware can make it worthwhile (which doesn't seem to be the case).

MrFox · Jul 29, 2013

As much as I'm fascinated by the audio block of the xbox one, and I agree it's certain to be "more powerful" than the one on the PS4... Processing is processing.

The big question is what's the equivalence in processing power, from a real-world perspective, what does a game requires? It's possible that a single CU can do all the filters and src on the PS4 that would be done by SHAPE on the xbox one, and it is also likely that the other processors in the audio blocks are reserved for kinect. It's also possible that whatever async compute is done on the PS4 would be more than sufficient to have enough CPU for easy audio middleware. Without a good game post-mortem, we don't know and can only speculate.

I think it's crazy to add up all the fixed flops of the audio hardware as if it was the requirement for parity, because there should be a significant down-mix before applying filters and/or src. Why would a game apply a different src or filter to thousands of individual streams instead of down-mixing before applying the processing heavy filters? Getting thousand of streams down to hundreds reduces the processing needed by an order of magnitude. If most games don't need all of these filters the audio hardware is idle, while OTOH the CPU and/or GPU are busy elsewhere. So it's not apples to apples, and we don't have an answer yet.

Gipsel · Jul 29, 2013

jgp said:
Any modern OS uses free RAM for disk cache, and I've never seen people arguing it isn't cache "in the strict sense".

Just ask some hardware guys.
An OS managed disk cache fullfills the function of a cache. While from the hardware point of view it isn't even existing (

), it is transparent to applications (one of the crucial distinctions mentioned before). The eSRAM of Durango is a physical piece of hardware, which doesn't work as a cache by itself. And the software using it has to manage it, so it is also not transparent. By all reasonable accounts, it can't be called a cache.

bkilian · Jul 29, 2013

Silent_Buddha said:
And yet you also point to things in the Xbox One and say that is all it can do as if that was fact despite not much being known about them.

Cerny has already said more about the audio hardware in PS4 than has been said about a lot of the components in Xbox One.

MS has said nothing about the audio hardware either. The only things that are known are from the VGLeaks data (Which is the theoretical chip I comment on, in theory, if it existed..

) and a mention in a changelist for Wwise or FMOD, I can't remember which.

MrFox said:
I think it's crazy to add up all the fixed flops of the audio hardware as if it was the requirement for parity, because there should be a significant down-mix before applying filters and/or src. Why would a game apply a different src or filter to thousands of individual streams instead of down-mixing before applying the processing heavy filters? Getting thousand of streams down to hundreds reduces the processing needed by an order of magnitude. If most games don't need all of these filters the audio hardware is idle, while OTOH the CPU and/or GPU are busy elsewhere. So it's not apples to apples, and we don't have an answer yet.

Which is something I've pointed out. Fixed function hardware has disadvantages. What if you would rather use those transistors for physics? Nope. Sorry. Got a "silent-movie" type game with a single audio track that could be done on a 386? Sorry, you're going to be leaving a lot of that chip dark.

warb · Jul 29, 2013

DrJay24 said:
We went through this in the audio thread, same people, same poor logic. Basically the absence of evidence is evidence of absence. They are resting on the audio block of the XB1 to be secret sauce and won't acknowledge that it exists mainly for Kinect as pointed out by bkilian and that a fraction of one CPU core can do what modern games require for current audio processing. It may change in the future if things get more sophisticated.

bkilian said SHAPE is 100% available for game audio, while the other processors are presently not. We'd be looking at more than a fraction of one core to match 360 & PS3 audio, depending on the game. An example IIRC was Forza using half the CPU threads on 360 for audio, and X1/PS4 CPU and Xenon being ~equivalent in this regard?

Airon · Jul 29, 2013

temesgen said:
What is your source for this? AFAIK Cerny made a comment about the compression and decompression of audio but he did not say that was the limit/scope of the fixed function audio capability inside of PS4. Do you have another source or are you choosing to take the worst case possible scenario to make your argument?

Well, I would really love PS4 to have a powerfull audio block similar to X1, do not get me wrong. But I agree with Silent_Buddha on that.

And then do not forget that Cerny himself declared that in the future CUs will be used x 3D audio (something that on X1 in theory could be done for free by the audio block).
Same for speech recognition. PSEye is not bundled, yet you believe that PS4 has a dedicated hardware for speech recognition inside?

I just try to make a realistic picture of PS4 capabilities.
Are there people that really believes that PS4 grapich will look 50% better then X1?
Have you seen some evidences of that, even considered the more mature development tools of PS4?

I believe that CU will be PS4 Spu. But it is just my idea, you can believe in 18CU x rendering, as well as 7,5 Gb for gaming.

jgp · Jul 29, 2013

Gipsel said:
Just ask some hardware guys.
An OS managed disk cache fullfills the function of a cache. While from the hardware point of view it isn't even existing (), it is transparent to applications (one of the crucial distinctions mentioned before). The eSRAM of Durango is a physical piece of hardware, which doesn't work as a cache by itself. And the software using it has to manage it, so it is also not transparent. By all reasonable accounts, it can't be called a cache.

At a hardware level, and unlike Haswell GT3e, it isn't a cache, no, but it's memory that can be used as one. I'm just not comfortable calling it "scratchpad", because I associate it with far smaller amounts of memory.

I've (in my youth

) implemented a btree over a software cache for a college assignment, and the rest of the application that used it didn't care or notice if it had a cache or not, except for the internal competition over which group had the fastest implementation. That was DOS, pre-memory extenders, so 640KB (2% of XB1 eDRAM).

If it quacks like a cache...

jgp · Jul 29, 2013

I believe that CU will be PS4 Spu

If the SPUs got a bad reputation, it was because of the pains devs had to go through to use it effectively.

With CUs, you got more of the standard GP[GP]U hardware, free to use for graphics, compute, audio, whatever, the same (or very similar) hardware that the XB1 has. With robust software tools. It's entirely incomparable. What do you mean with this analogy?

Airon · Jul 30, 2013

jgp said:
If the SPUs got a bad reputation, it was because of the pains devs had to go through to use it effectively.

With CUs, you got more of the standard GP[GP]U hardware, free to use for graphics, compute, audio, whatever, the same (or very similar) hardware that the XB1 has. With robust software tools. It's entirely incomparable. What do you mean with this analogy?

Exactly what you have described! Extra power that can be used to compensate some eventual hardware lack!

Brad Grenz · Jul 30, 2013

jgp said:
Any modern OS uses free RAM for disk cache, and I've never seen people arguing it isn't cache "in the strict sense".

Yeah, those OSes have implemented a software caching algorithm. Devs can do that for the ESRAM if they want to go to the effort, too, but that's up to them.

Airon said:
Well, I would really love PS4 to have a powerfull audio block similar to X1, do not get me wrong. But I agree with Silent_Buddha on that.

And then do not forget that Cerny himself declared that in the future CUs will be used x 3D audio (something that on X1 in theory could be done for free by the audio block).
Same for speech recognition. PSEye is not bundled, yet you believe that PS4 has a dedicated hardware for speech recognition inside?

The ray traced audio Cerny talked about is something bkillian has himself said isn't possible on the Xbox One audio chip. If you want that on your Xbox One game you'll be stuck using the GPU or CPU just like the PS4.

temesgen · Jul 30, 2013

Airon said:
Well, I would really love PS4 to have a powerfull audio block similar to X1, do not get me wrong. But I agree with Silent_Buddha on that.

And then do not forget that Cerny himself declared that in the future CUs will be used x 3D audio (something that on X1 in theory could be done for free by the audio block).
Same for speech recognition. PSEye is not bundled, yet you believe that PS4 has a dedicated hardware for speech recognition inside?

I just try to make a realistic picture of PS4 capabilities.
Are there people that really believes that PS4 grapich will look 50% better then X1?
Have you seen some evidences of that, even considered the more mature development tools of PS4?

I believe that CU will be PS4 Spu. But it is just my idea, you can believe in 18CU x rendering, as well as 7,5 Gb for gaming.

Look I asked a very direct question because you are speaking with far more certainty than the information which is available lends itself to. If you are a developer or have access to privileged information that isn't available publicly than simply say you can't comment on how you know, OTOH if you are just using conjecture you might want to preface your comments with IMO, possibly, perhaps and so on.

But in the interest of the discussion I'll ask again - what are you basing your conclusion about the PS4's fixed audio function capability on?

BTW I'm not sure why you're trying to bring 7.5GB into the conversation, it has as much relevance as dual APU, 3TF monster in this conversation..... That is to say nobody said anything like that. As a matter of fact nobody is suggesting that the PS4's audio function is as robust as what is found in XB1, as DRJAY pointed out it doesn't need to be....

blakjedi · Jul 30, 2013

DrJay24 said:
We went through this in the audio thread, same people, same poor logic. Basically the absence of evidence is evidence of absence. They are resting on the audio block of the XB1 to be secret sauce and won't acknowledge that it exists mainly for Kinect as pointed out by bkilian and that a fraction of one CPU core can do what modern games require for current audio processing. It may change in the future if things get more sophisticated.

bkilian absolutely did NOT say that. He said he only worked on a portion of the audio block. He NEVER said is made for Kinect. Within the Kinect diagram there are a series of dedicated blocks Kinect functionality among them.

Betanumerical · Jul 30, 2013

Brad Grenz said:
The only dedicated hardware the PS4 lacks compared to the Xbox One are there to solve problems that don't exist on the PS4 (Kinect/ESRAM/Move Engines).

the Move Engines are simply DMA backed by compression hardware, DMA is present in all modern video card and nearly every modern real time computer device, it is strange that microsoft decided to rename this but not that strange (they did it with everything to), so infact the PS4 does contain most of the move engines and I also wouldn't be surprised if they worked at the same speed too.

Gipsel · Jul 30, 2013

3dilettante said:
The read-write turnaround for the bus is a multiple of that time period. At 5.5 Gbs, the best-case latency where the bus provides nothing is 13 command clocks, worst case 20.

[CLmrs+(BL/4)+2-WLmrs]*tCK
(20,19,18,17,16)+8/4+2-(4,6,7)

http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf

That would be six to ten cache transfers not utilized, with the length of time before the next transition dependent on what level is considered good enough, balanced with latency requirements for the CPU.

If it were the GPU alone running a traditional graphics workload, that looks well-handled.

The unknown in my eyes is the octo-core Jaguar part of the APU and Mark Cerny's desire to leverage asynchronous compute heavily.
This falls heavily on the CPU memory controller, onion bus, and the customizations in the L2, since this traffic does not rely on the ROPs.

Considering fetching a single 64 byte cacheline over one 32bit channel takes two back to back bursts and hence four command/address clock cycles, one would just waste between three to five transfers in that case.
It also depends how AMD sets up the address interleaving between the memory channels. They could actually opt to use a relatively coarse pattern, as each 32bit channel delivers already 22GB/s, that means the CPU itself does not rely an channel interleaving to reach peak throughput, which is supposedly short of 20GB/s (but the multiple independent channels could even increase the throughput for mixed read/write random access compared to traditional CPU memory interfaces, at the cost of blocking some significant capacity of the memory interface). So while I agree, that the balance between simultaneous quasi random fine grained accesses and more contiguous accesses demanding high bandwidth may be difficult, it may be not that bad. The CPU uses the 2x2MB L2 caches which may give enough opportunity to catch quite a lot of the locality. Similar for GPU: one has either the specialized ROP caches which provide for a coarser access granularity or the L2 cache of the GPU (probably 512kB, potentially 1 MB, working as a writeback cache if not explicitly bypassed) helps a bit.
On the upside, the potential bandwidth waste of the fine grained memory accesses may enable aggressive prefetch strategies. As some bandwidth is probably wasted either way, one can always load also the consecutive cache line or some other assumed/detected stride at very little additional cost.

And the problem also applies to some extent to the XB1 as the texturing will very likely run mainly from the DDR3 interface, while the larger accesses from the ROP caches will likely go to the eSRAM.

Gipsel · Jul 30, 2013

jgp said:
At a hardware level, and unlike Haswell GT3e, it isn't a cache, no,

I see we agree.

jgp said:
but it's memory that can be used as one.

And I wrote even how.

jgp said:
I'm just not comfortable calling it "scratchpad", because I associate it with far smaller amounts of memory.
I've (in my youth ) implemented a btree over a software cache [..]
That was DOS, pre-memory extenders, so 640KB (2% of XB1 eDRAM).

Absolute size is basically irrelevant. That's just a quantitative change (and potentially growing with time and the the number of transistors in an ASIC in general), not a qualitative one. And compared to the 8GB memory pool the 32MB eSRAM is actually pretty small (less than 0,4% of the size of the main RAM). A 2.5kB scratchpad in that old machine back in your youth would be the same relative size.

jgp said:
If it quacks like a cache...

Must be your tinnitus, there is no quack to hear from the XB1 eSRAM pool.

(((interference))) · Jul 30, 2013

Aren't 16 ROPs are adequate for 1080p? IIRC there was talk about the 32 ROPs on PS4 being slightly overkill (but probably useful for things like 3D or 4K rendering). EDIT: It was Digital Foundry http://www.eurogamer.net/articles/df-hardware-spec-analysis-durango-vs-orbis

What's interesting is how similar the 1080p vs 1000p screens of Crysis 3 look. I bet if you did a double blind test from normal viewing distances and asked people to choose which was higher res they wouldn't be able to tell you.

Also why is he saying it is a 17.2% reduction in resolution going from 1080p to 1776x1000?
I make it to be a 14.4% resolution decrease?

Really the article just says what ERP and other have been saying for ages, the real world performance differential between the two is going to be more like 20-30% than 50%.

mrcorbo said:
"Wildly unbalanced" is certainly too strong, but the comments made by Cerny himself seem to clearly indicate that the PS4 is ALU-heavy and that those ALUs are likely to be poorly utilized if they aren't being fed GPU compute tasks on top of the traditional rendering workload.

Yes that's what I've heard.

GPU compute is apparently difficult to do, even Guerilla only has one compute job running on the GPU (memory defragmentation)

zupallinere · Jul 30, 2013

Well from the DF piece:

It may well be the case that 12 compute units was chosen as the most balanced set-up to match the Jaguar CPU architecture.

Is it a rule of thumb that you should have around 2 CUs per core or per thread ? Actually what does "balance" mean anyways and how is such a thing determined ?

(((interference))) · Jul 30, 2013

I guess one definition of 'balance' could be when you don't have high end hardware components that are being bottlenecked by other lower spec components in the system.

So a system with a HD7850 paired with a Core i3 would be unbalanced.

I guess with these upcoming consoles the CPU is the real weak link so the PS4 having 50% more GPU power but the exact same CPU power would automatically make it a less 'balanced' design than the XB1.

zupallinere · Jul 30, 2013

(((interference))) said:
I guess one definition of 'balance' could be when you don't have high end hardware components that are being bottlenecked by other lower spec components in the system.

So a system with a HD7850 with a Core i3 would be unbalanced.

I guess with these upcoming consoles the CPU is the real weak link so the PS4 having 50% more GPU power but the exact same CPU power would automatically make it a less 'balanced' design than the XB1.

Of course this a bit of a MS centric version of balance ... I mean maybe balance is 15 CUs and both systems are unbalanced.

Also 12 might be good for XB1 games but 14-16 might be useful for Sony games. I am sure time will tell one way or the other.

Since the whole 14+4 thing seems to pop up all the time might I suggest that if we were to use 2 CUs per core as a metric then 14 could mean 7 cores, eventually ? The 6 core reserve on the PS4 comes from the screenshot for Killzone but we actually don't know what it will be. However since Sony is matching MS in terms of memory reserve 6 cores seems quite reasonable.

Now we know that devs will be taking some time to use "compute" in their games maybe Sony might find a use for a CU or 2 to offload a core from "reserve" duty depending on a lot of things of course. I mean say you really only need 1 core + part of another for background processing it might be reasonable to suspect that Sony engineers might give it a go. I do NOT suggest however that 1 core can be replaced by 1 or more CUs but I would suggest that 1 core in conjunction with 1 or 2 CUs might be able to do the work of 1 and 1/2 or 1/3 cores say ... again depending on what the system will need. Then at that point devs get to play with an extra core relatively early on in the machine life as opposed to later when compute becomes the thing to use with the game.

Pie in the sky I am sure and full of caveats and hand waving but not surely not deranged ... I hope

Added: forgot to agree that 50% GPU being unbalanced is a reasonable assumption in any case.

(((interference))) · Jul 30, 2013

zupallinere said:
Of course this a bit of a MS centric version of balance ... I mean maybe balance is 15 CUs and both systems are unbalanced. Also 12 might be good for XB1 games but 14-16 might be useful for Sony games. I am sure time will tell one way or the other.

I think you're getting confused between balance and efficacy.

For example, the infamous Saab 9-3 Viggen is a textbook example of an unbalanced car, with a powerful turbo engine married to a FWD chassis that simply could not cope with all the torque.

However it'd definitely be faster around a track than the regular, 'balanced' 9-3 models.

In the same way despite PS4 possibly being a less 'balanced' design, it should still outperform XB1.

Digital Foundry Article Technical Discussion Archive [2013]

jgp

MrFox

Deludedly Fantastic

Gipsel

bkilian

warb

Airon

jgp

jgp

Airon

Brad Grenz

Philosopher & Poet

temesgen

blakjedi

Betanumerical

Gipsel

Gipsel

(((interference)))

zupallinere

(((interference)))

zupallinere

(((interference)))

Similar threads