Technical investigation into PS4 and XB1 audio solutions *spawn

As you can see, I'm still a little bitter that there is all that DSP available, and the Kinect guys reserved the lot. :)

Those bastards! ;)

So, if you got your way and game developers were given the option to "steal back" some of that hardware from Kinect, what would happen on the Kinect side of things? Assume that while playing some AAA title, the Kinect no longer has exclusive use of those DSPs. What happens to the Kinect user experience during that game?

Could an uneasy truce be hammered out? Could the Kinect get "its" resources back on-demand if you were willing to pause the game and then shout at the TV?
 
Yes I understand the process of moving your head to localise sound, but if you wish to detect sound location by moving your head, its your virtual head you need to move not your real head
Both would be best.

Yes, game sound absolutely has to react when your virtual "game head" moves. And it typically does. That's a solved problem these days. But it would be nice if it reacted when your actual head moved, too. If you are using speakers, you are already enjoying the benefit of "real head" tracking. You are no doubt using slight head movements to localize the exact positions of your speakers, as well as the audio "images" that may seem to reside in between certain well matched speaker pairs. (At least, I hope that second thing happens as well... not sure.)

If you are wearing headphones, suddenly those little head motions that humans do constantly stop giving you any additional sound localization info, and start giving you crazy, wrong, and unhelpful info. The whole world seems to move when your head moves. When you turn/tilt your head to confirm a sound's direction, the opposite actually happens, and the direction becomes less discernible. You are receiving contradictory cues. You will learn to ignore stuff like that to some degree, but it clearly harms the believability of the experience.

So yeah, it'd be nice to have both. It'll probably never be worth the expense to do head tracking just for audio, but if we end up doing it for another reason, then "Yay!" Headphone audio can get better "for free".

I guess speaker-based audio could be improved by head-tracking as well, in the sense that you could tilt your head to figure out if sounds were coming from above or below you, even though your speakers tend all to be arranged within a 2D-ish plane. (The game would pipe more helicopter noise to your right speakers, if it sensed that your right ear was pointed more upwards.)
 
the point of hrtf is you dont have to tilt your head. you want to be able to tell where a sound is coming from without moving your head just like real life (unless you count tiny possibly subconscious head movement)
 
I was in the audio team. I worked on WASAPI and the audio hardware.

I have no real opinion on the console as a complete package, I haven't actually seen the games. It's easy to develop for, I wrote a scream tracker player for it in just a few hours when I was bored, but other than that, your guess is as good as mine.

How many GFLOPs/GOPs would you say the FF decoder on SHAPE contributes to the over all value? I'm trying to get a handle on how much you'd need on the PS4 to emulate SHAPE in its entirety but I think the over all power number includes the FF decoder which the PS4 has a similar module for.
 
How many GFLOPs/GOPs would you say the FF decoder on SHAPE contributes to the over all value? I'm trying to get a handle on how much you'd need on the PS4 to emulate SHAPE in its entirety but I think the over all power number includes the FF decoder which the PS4 has a similar module for.
FF decoder? I'm afraid you lost me somewhere. You mean the XMA decoder module? The ASP also does codecs, for chat, etc. I'd be surprised if the XMA decoder had been included in that "GOPS" number given in the Hotchips presentation, since it's really just a copy of the identical chip in the 360, with a couple of bugs fixed and a higher clock rate. But I'm not really sure.
 
FF decoder? I'm afraid you lost me somewhere. You mean the XMA decoder module? The ASP also does codecs, for chat, etc. I'd be surprised if the XMA decoder had been included in that "GOPS" number given in the Hotchips presentation, since it's really just a copy of the identical chip in the 360, with a couple of bugs fixed and a higher clock rate. But I'm not really sure.

Thanks figured as much, it just seemed to be lumped in with the rest of the FF hardware which is what I thought the over al GFLOP value was for.
 
the point of hrtf is you dont have to tilt your head. you want to be able to tell where a sound is coming from without moving your head just like real life (unless you count tiny possibly subconscious head movement)
Yes, we're talking about generally sub-conscious head movement, although it still works if you do think about it. ;) I'm not sure what you consider to be "tiny". Five degrees or so? Little enough that it doesn't interfere with maintaining an effortless visual "lock on" to whatever you're looking at. On the other hand, if you are in a full-on, "Where the hell is that noise coming from?!" situation, you'll almost certainly be swiveling back and forth more radically that that.

HRTF-related cues and volume-based binaural cues (among others!) work together to aid us in locating sounds. Just because we have one, it doesn't mean we wouldn't benefit from another. This is particularly the case if we are being "jammed" by incorrect binaural cues. (Like the ones we receive if we move our heads while wearing headphones.) In the case where two sets of cues disagree, which should your brain choose to believe? It muddies the waters. To make a analogy to vision and depth perception: Just because we have binocular vision, that doesn't mean we don't gather further (and often better) info about relative depth based on other cues, like parallax effects, or focus differences, or known sizes, etc.

The fact that mammals have evolved multiple independent methods of determining where a sound is coming from is a pretty good clue that trying to rely on a single method is a good way to get eaten. (Or starve.) Heck, many animals can tilt their ears independently of each other, and of their heads. It's a genuinely useful technique.

Lastly, current HRTF solutions don't actually work well for everybody. Me, for instance. It's not a one-size-fits-all solution. What sounds like "behind you" to you might not sound that way to me. And if we tweaked the commonly used HRTFs so they did work for me, they might stop working for you. So, more work is needed on that front as well. Swiveling an ear towards a sound to verify that it is indeed coming from a particular quadrant is a pretty simple, foolproof and universal way to do accomplish that task. So, it would be nice if that actually worked when we are playing games.

Not required, obviously, but nice.
 
How many GFLOPs/GOPs would you say the FF decoder on SHAPE contributes to the over all value? I'm trying to get a handle on how much you'd need on the PS4 to emulate SHAPE in its entirety but I think the over all power number includes the FF decoder which the PS4 has a similar module for.

As demonstarted multiple times - GFLOPs/GOPs are meaningless in determine performance in real workloads.

The FLT/VOL modules does something like this :

Lowpass = Lowpass + cFREQ * Bandbass
Highpass = cVOLUME * Input - Lowpass - cQ * Bandpass
Bandpass = cFREQ * Highpass + Bandpass
Notch = Highpass + Lowpass

cVOLUME, cFREQ and cQ are precalculated coefficients. Some of the things you noticed first :

* This is a naive implementation without any optimization based on which filter type you want.
* This is a recursive structure (need previously calculated results) which is not a natural fit for GPU processing, so you would need the CPU for the most efficient processing.

According to bkilian the EQ/CMP module can process 512 streams with filtering (EQ) and volume manipulation (CMP) simultaneously, but the obvious question is : How many sounds (each with different volume and filtering) are actually played simultaneously in an audio engine?

The FLT/VOL module give us additional 2500 simultaneous streams which have more or less the exact same functionality as the EQ/CMP module. But if we assume maximum flexibility for the DMA in SHAPE then we could actually extract the filtering information (an approximation) from the different HRTF IR's and use the FLT/VOL module for the filtering. The DMA part would have to manage the 'time arrival' part (delay) of the HRTF IR's. Is the below possible?

MemoryRead (pointer) -> SVF1 -> SVF2 -> SVF3 -> SVF4 -> Mix
MemoryRead (pointer + 36) -> SVF5 -> SVF6 -> SVF7 -> SVF8 -> Mix
MemoryRead (pointer + 17) -> SVF9 -> SVF10 -> SVF11 -> SVF12 -> Mix
etc.
 
As demonstarted multiple times - GFLOPs/GOPs are meaningless in determine performance in real workloads.

The FLT/VOL modules does something like this :

Lowpass = Lowpass + cFREQ * Bandbass
Highpass = cVOLUME * Input - Lowpass - cQ * Bandpass
Bandpass = cFREQ * Highpass + Bandpass
Notch = Highpass + Lowpass

cVOLUME, cFREQ and cQ are precalculated coefficients. Some of the things you noticed first :

* This is a naive implementation without any optimization based on which filter type you want.
* This is a recursive structure (need previously calculated results) which is not a natural fit for GPU processing, so you would need the CPU for the most efficient processing.

According to bkilian the EQ/CMP module can process 512 streams with filtering (EQ) and volume manipulation (CMP) simultaneously, but the obvious question is : How many sounds (each with different volume and filtering) are actually played simultaneously in an audio engine?

The FLT/VOL module give us additional 2500 simultaneous streams which have more or less the exact same functionality as the EQ/CMP module. But if we assume maximum flexibility for the DMA in SHAPE then we could actually extract the filtering information (an approximation) from the different HRTF IR's and use the FLT/VOL module for the filtering. The DMA part would have to manage the 'time arrival' part (delay) of the HRTF IR's. Is the below possible?

MemoryRead (pointer) -> SVF1 -> SVF2 -> SVF3 -> SVF4 -> Mix
MemoryRead (pointer + 36) -> SVF5 -> SVF6 -> SVF7 -> SVF8 -> Mix
MemoryRead (pointer + 17) -> SVF9 -> SVF10 -> SVF11 -> SVF12 -> Mix
etc.
You'd probably be able to emulate the EQ using the FLT, but emulating the CMP using the VOL would need some CPU processing.

I'm not sure if your example at the bottom is possible, IIRC the DMA part required the sample to be aligned to some number of bytes, but I can't remember the granularity of the alignment requirement. You could tell it to skip some number of bytes, which was designed to let you skip XMA header structures, but could possibly be used for this kind of manipulation.
 
I doubt it's a full core being saved. Technically, the AVPs can process more than a jaguar core's worth of code, but I have a hard time believing the speech pipeline has gotten that heavy.

Also, if you're a developer, and you're not interested in Kinect, offloading the Kinect processing hasn't "saved" you anything...

As you can see, I'm still a little bitter that there is all that DSP available, and the Kinect guys reserved the lot. :)

Thank you bkilian!
Regarding this topic, just to have a clear picture, according to your knowledge what could fill the overall speech pipeline (up to, let's say, a CPU core)?
2 simultaneous voices recognition (as hinted)? More complex dialogue recognition (text parser like old text adventure, it would be incredible..)? Higher and faster recognition time?

And also, at the time, were you aware of some game in development that have a great use of the Audio Block, or that simply put a lot of effort on it?
 
Digital Foundry: You talk about having 15 processors. Can you break that down?


Nick Baker: On the SoC, there are many parallel engines - some of those are more like CPU cores or DSP cores. How we count to 15: [we have] eight inside the audio block, four move engines, one video encode, one video decode and one video compositor/resizer.


The audio block was completely unique. That was designed by us in-house. It's based on four tensilica DSP cores and several programmable processing engines. We break it up as one core running control, two cores running a lot of vector code for speech and one for general purpose DSP. We couple with that sample rate conversion, filtering, mixing, equalisation, dynamic range compensation then also the XMA audio block. The goal was to run 512 simultaneous voices for game audio as well as being able to do speech pre-processing for Kinect.

Is there anything new?

http://www.eurogamer.net/articles/digitalfoundry-the-complete-xbox-one-interview
 
That DSP seems to be SHAPE, I think.

On a different note, this article is titled "Meet the researcher who took Xbox One’s audio performance to the next level", :eek: so it's self describing (I thought he would be bkilian) :smile:

http://blogs.technet.com/b/next/arc...rformance-to-the-next-level.aspx#.Ul7R-pBBvDc

Nice sound-proof building, btw
Haha! I'm no researcher. That DSP is not SHAPE. SHAPE is not a general purpose DSP, it's a collection of fixed function units. That DSP is probably the ASP (Audio Scalar Processor) which is used for supporting codecs the hardware doesn't natively support, and some other things.

And Ivan is awesome to work with. He spent six month cohabiting an office with one of my coworkers during the original Kinect timeframe. He designed a bunch of the Kinect pipeline algorithms (in Matlab, of course :)).
 
I have an interesting hypothesis about the dedicated audio hardware of the PS4.

What if the dedicated audio hardware is one of the 2 cores of the jaguar reserved by the OS? Every previous ambiguous annouces about the audio in PS4 would be thus logically explained.
 
Back
Top