Technical investigation into PS4 and XB1 audio solutions *spawn

I think it's more likely you'll just do the audio stuff on the CPU and find something else that will work better on the GPU to offload there.

RudeCurve said:
Didn't bkilian address that in the other thread? He said SHAPE is ~ 100GFLOPS for doing on non-fixed function processor like CPU. He said it would need >1 Jaguar core which is almost equal to XCPU's "100GFLOPS" due to IPC.

Woah, that's a lot of fuzzy math.
 
3D audio needs to be tightly integrated with the game engine, generally. It can easily be done on CPU or GPU. Problem is most console gamers don't game using headphones, and you need to know lots of stuff about the speaker setup and gamer position and head orientation in order to do good 3d audio positioning when headphones aren't in use. Even when they are, you still need to know head orientation if you're going to do it right. Humans, and other animals, use head movement to determine the true 3d position of a sound. Without that, you get an audio effect kinda like the head tracking pseudo 3d graphics effect. It looks good, but since your eyes are both getting the same information, your brain isn't really fooled.

The reserved blocks on the audio chip are used for codecs and Kinect speech processing. The Kinect speech pipeline is one of the most impressive audio graphs I've seen. One of the PMs printed it out once as a block diagram, and to get the text readable it had to be on a 5 by 3 foot sheet of paper.

So in your opinion it should be easier for a game to get "3D" sound with a pair of headphones (any special requirments) compared to a proper calibrated HT 5.1 setup?

I usually play on my 5.1 setup, and enjoy it,but i must admit that i kind a zoned out from the sound development in games because imho the current big titles seems to be prety loud instead of using sound in a subtle way.

On that note, GTA5 has a fucked up sound balance, the voice from the characters is complete off in level compared to the rest of the game sounds. At least they use the Center channel almost exclusive so it was fairly easy to turn down the volume :)
 
Didn't bkilian address that in the other thread? He said SHAPE is ~ 100GFLOPS for doing on non-fixed function processor like CPU. He said it would need >1 Jaguar core which is almost equal to XCPU's "100GFLOPS" due to IPC. How many GFLOPS can 1CU process?

IIRC 1CU is ~102GFLOPS or there about.

1 Jaguar core is 12GFLOPS unless SHAPE requires the entire 8 core Jaguar it is not anywhere near 100GFLOPS. >1 core is useless metric, it only gives a lower bound and a very low lower bound at that (12GFLOPS).
 
So in your opinion it should be easier for a game to get "3D" sound with a pair of headphones (any special requirments) compared to a proper calibrated HT 5.1 setup?

I usually play on my 5.1 setup, and enjoy it,but i must admit that i kind a zoned out from the sound development in games because imho the current big titles seems to be prety loud instead of using sound in a subtle way.

On that note, GTA5 has a fucked up sound balance, the voice from the characters is complete off in level compared to the rest of the game sounds. At least they use the Center channel almost exclusive so it was fairly easy to turn down the volume :)

3D sound is/was always ideal on headphones, because they fire directly into your ears, which vastly simplifies the problem they need to solve. It still has to use processing to trick you into hearing sounds behind you, 5.1 has the advantage of actually producing sounds behind you.

Ideally with headphones the algorithm would know which model of headphones so it can correct for it's frequency response, it would be customized for your specific ear shape, and it would have a gyroscope so it could track your head movement. With all those pieces in place, it would be incredibly convincing.
 
3D audio needs to be tightly integrated with the game engine, generally. It can easily be done on CPU or GPU. Problem is most console gamers don't game using headphones, and you need to know lots of stuff about the speaker setup and gamer position and head orientation in order to do good 3d audio positioning when headphones aren't in use. Even when they are, you still need to know head orientation if you're going to do it right. Humans, and other animals, use head movement to determine the true 3d position of a sound. Without that, you get an audio effect kinda like the head tracking pseudo 3d graphics effect. It looks good, but since your eyes are both getting the same information, your brain isn't really fooled.

I don't really buy the idea that the reason they're not doing it at all is because they can't do it perfectly yet. A3D/DS3D wasn't perfect, but it was still an improvement. It's not like the industry gave up on polygons after the PS1, just because it couldn't produce photorealistic graphics. For headphones, it's easy enough to put a menu option in the game, or better yet in the OS. Same with speaker position (GTA V even allows you to customize it), Kinect can help out with fine tuning it as well..

I'm stoked AMD is pushing back into it, I just don't see it gaining traction unless the consoles are on board.
 
I guess the issue here is that FLOPS is pretty a useless metric in lots of workloads.
Lots musician tool requires some form processing power (in healthy quantity), though you won't see line6 or others communicating about how many flops their DSP are pushing, pretty much that is what Relab explained.
I don't think those guys are to replace the chips they uses (cpu+dsp) by some mobile SoC because nowadays they can delivers a sane amount of (GPU) FLOPS or CPU with strong SIMD. They were not interested in the Cell either for example.

Relab made it clear, if either CPU or GPU were efficient (per watts and mm^2) at handling various sound effects though guys would indeed use existing hardware or simply ship software with increased margins.
Now I think what Relab is asking is can the XB1 audio block runs things like the fancy simulations going on in those expensive Kemper head (amp simulators and more than that). I believe the answer is no, though it doesn't have to.

EDIT Damned I was searching my post and could not find it the thread about audio, I think it is safe to discuss that elsewhere.
 
Last edited by a moderator:
I really hope so. I used to love buying new sound cards every 2 years and being amazed. I remember when I got my awe 32 bit sound blaster back in the day , that thing was amazing.

I remember that time very fondly. I worked for Creative during the SB16-SBAWE64 time-frame. A great time to be in the business & a gamer. ;)

Tommy McClain
 
The DSP numbers are based on completely arbitrary routing (including split of streams) and function, like : In -> SVF(BP) -> SVF(LP) -> CMP -> SVF(BP) -> SRC -> EQ1 -> EQ2 -> SVF(HP) -> EQ3 -> Out. If SHAPE have restrictions related to routing or functions, we could optimize the DSP functions further.

I would have prefered that part of this thread was moved to 'Console Technology' for a technical discussion. I'm not interested in the XB1 vs PS4 discussion, only in the audio technology side and the decisions behind the choises.

How do you define frame? Sample?

We don't have much information other than what have been posted by Vgleaks. I would imagine that the different blocks inside SHAPE are tied up (resource sharing) in a specific combination, like : XMA Decoder -> SRC -> EQ/CMP. Does any of the modules share resources, like EQ/CMP + FLT/VOL and XMA + SRC?

The audio path is 24bit, what about the actual processing (coefficiencts etc.)?



I think this is the primary reason for the dark age in game audio.
The processing is bit-equivalent to what you would get using 32 bit floats. All of the parameters are 32 bit. No shared resources between the blocks. Routing is handled by the mix buffers and it can be an arbitrarily complex graph, including dma out/in from main memory in the middle of processing, so CPU effects can be blended in, although that can cause a frame of latency. A frame is 128 samples, or 2.67 ms of sound at 48 KHz.

Multiple blocks can write into a mix buffer, and multiple blocks can read out of a mix buffer. The writers don't have to worry about mixing or whether another block has already written, the mix bin does all the mixing automatically. Both the SRC and the DMA block will take stereo interleaved data and output them into two mix bins, (SRC and DMA are the only ways to start a graph). They can take mono 16bit, 24bit and 32bit float and stereo 16bit interleaved and do the conversions for free as they move the data into the graph.

Now you're asking things I know about. :) I wrote a ton of code setting up and processing audio graphs on the chip.
 
We should rename this thread to "Xbox One and PS4 audio technology investigation". I've been very happy to see a minimum of "versus" chatter in this thread, it's mostly been investigations and discussions on how the two will achieve similar results, which has been quite refreshing. :)
 
Didn't bkilian address that in the other thread? He said SHAPE is ~ 100GFLOPS for doing on non-fixed function processor like CPU. He said it would need >1 Jaguar core which is almost equal to XCPU's "100GFLOPS" due to IPC. How many GFLOPS can 1CU process?
I think you misunderstood what I said. What I was really saying was that due to in-order, pipeline stalls, and other design elements of the XCPU, it's supposed 100GFLOPS is really only about 20GFLOPS when you profile real code. The hotchips presentation pegged the Shape block at 18G Ops (they can't use flops, it's an integer pipeline :)). Creative would have called it an 18000 MIPS chip, compared to their X-Fi's 10000 MIPS. When you include all the housekeeping the ACP does, it adds up quite quickly.

When we verified the functions of the chip, we compared the output of our 32 bit float reference blocks to the output of the chip using the same input. The outputs are exactly bit-equivalent. That's not really a surprise, since the blocks were designed from our reference pipeline.
 
Routing is handled by the mix buffers and it can be an arbitrarily complex graph, including dma out/in from main memory in the middle of processing, so CPU effects can be blended in, although that can cause a frame of latency.

So then it's fair to say that shape isn't going to arbitrarily limit devs from implementing 3D audio? That's really the root of my question, it wasn't clear to me whether using shape precluded them from using software processing.

So ideally, MS will free up more of shape, and devs can run HRTFs/reverb directly on the DSP. But worst case scenario, they need to copy out to run it the APU at some point in the chain, but they won't lose all the benefits of shape in the process?

The leaked docs made it sound like an all software or all hardware kind of situation...if that's not the case, then I'm no longer disappointed. :p
 
So shape = 2x EMU20K2 dsp's ?
If you just go by operations, yes. But the EMU20K2 is a lot more flexible in some ways, and limited in others. It only supports 128 input channels, for instance, compared to the 512 in Shape. It has a programmable DSP, which Shape does not (although the whole audio block does). You can't just take everything the X-fi can do and multiply it by two for Shape. This is part of why Relab, correctly, is wary of FLOPS and MIPS and GOPS. I could design a chip with over a hundred GFLOPS, but if all it does is mixing, it's not going to be that useful.
 
I remember that time very fondly. I worked for Creative during the SB16-SBAWE64 time-frame. A great time to be in the business & a gamer. ;)

Tommy McClain

Well I'm glad my Christmas present that year helped keep you employed !!!!:LOL:
 
A frame is 128 samples, or 2.67 ms of sound at 48 KHz.

I apologize, but I'm abit confused regarding the use of 'frame' in this context :

The mix buffers will mix >4000 channels per frame, at 28 bit. The rest of the pipeline is 24 bit.

How many add instructions are performed per sec?

#1 48000 * 4000 = 192*10^6 add instructions
This operates on each sample in the 128 sample buffer - the reason for the buffer is to make use of a SIMD setup.

#2 48000 / 128 * 4000 = 1.5*10^6 add instructions
This perform 4000 add instructions per buffer

#3 Something completely different?
 
I apologize, but I'm abit confused regarding the use of 'frame' in this context :



How many add instructions are performed per sec?

#1 48000 * 4000 = 192*10^6 add instructions
This operates on each sample in the 128 sample buffer - the reason for the buffer is to make use of a SIMD setup.

#2 48000 / 128 * 4000 = 1.5*10^6 add instructions
This perform 4000 add instructions per buffer

#3 Something completely different?
Aah, yes, sorry. I'm using "frame" as both a unit of data (128 samples) and a unit of time (2.67ms). It's a bit imprecise of me. Basically, the chip is kinda quantized at the frame (128 sample) level, all operations output or input 128 samples at a time. In a single "frame" (2.67ms), you can mix over 4000 channels using the 128 physical mix bins. In practice it might be lower than this, depending on the complexity of your audio graph, since the mixing happens automatically between the other units, but it's the upper bound. This can be repeated every frame (2.67ms), so at the end of a second, if you have a convoluted enough audio graph, you have mixed 4000 voices of 48000 samples in that second. So #1 in your example above.

They also come with free clipping detection, which is not all that useful in a final playback system, but hugely useful when you're authoring and debugging sound issues during development.
 
Back
Top