NextGen Audio: Tempest Engine, Project Acoustics, Windows Sonic, Dolby Atmos, DTS X

Rtings does great headphone reviews. They have a number of different soundstage measurements, but you’ll find good soundstage with good audio quality is very expensive.

It's still cheaper than getting as good 5.1 or 7.x setup. To each his own, there is no free lunch. Best we can hope console provides great audio experience and rest is up to individual users on if they appreciate audio and get good equipment or not.
 
It's still cheaper than getting as good 5.1 or 7.x setup. To each his own, there is no free lunch. Best we can hope console provides great audio experience and rest is up to individual users on if they appreciate audio and get good equipment or not.

yah. I’m just pointing out that it still won’t sound real.
 
yah. I’m just pointing out that it still won’t sound real.

My argument is not real, that would just lead to semantics and pickering. There is something between current level of console audio and absolutely real audio. Things like having more power to process audio/simulate environment and also trying to do per user hrtfs's can potentially move the needle into right kind of direction and possibly even quite a lot. Considering sony is investing into VR those HRTF functions and heaphones make a lot sense as it's not just the traditional living room console experience from giant speakers.
 
We also have a good idea of how well an AMD CU can perform convolution reverb.

https://steamcommunity.com/games/596420/announcements/detail/1647624403070736393
Hmm
Figure: Reserving Compute Units (CUs) with TrueAudio Next.

What TrueAudio Next Is NOT
TrueAudio Next currently does not perform ray tracing for sound. Ray tracing is one way of simulating acoustic phenomena and calculating an IR; TAN focuses on accelerating the task of filtering audio data using IR after the IR has been calculated (or manually specified).
well then
 
My argument is not real, that would just lead to semantics and pickering. There is something between current level of console audio and absolutely real audio. Things like having more power to process audio/simulate environment and also trying to do per user hrtfs's can potentially move the needle into right kind of direction and possibly even quite a lot. Considering sony is investing into VR those HRTF functions and heaphones make a lot sense as it's not just the traditional living room console experience from giant speakers.

yah, I agree. Just looking at this thread expectations are ... high.
 
Hmm
Figure: Reserving Compute Units (CUs) with TrueAudio Next.

What TrueAudio Next Is NOT
TrueAudio Next currently does not perform ray tracing for sound. Ray tracing is one way of simulating acoustic phenomena and calculating an IR; TAN focuses on accelerating the task of filtering audio data using IR after the IR has been calculated (or manually specified).
well then

Yes. The ray tracing is done on the rest of the gpu or cpu for things like occlusion or reflection. It can figure out which materials were hit etc and then determine which static IR should be used or calculate a time varying IR to be used for convolution.
 
Just saw a post on resetera where he thinks there might be hints of the audio in the Godfall release trailer and it does sound better than current gen games. Here's the trailer.

 
yah, I agree. Just looking at this thread expectations are ... high.
I dunno about expectations. More theoretical differences between what 32 channels gets you versus 512. I wouldn't be that surprised if the actual sample count and environmental audio is way simpler than it could be because of those production issues. In a 64 player shooter, you'll have at most two feet (jumping) per nearby player plus their weapon plus maybe some 'gear' sounds as they run. Footfall + weapon + gear. Ten at most locally in a firefight, that's 30 sounds. A few more for damage and whatnot. I think once you get past 100, you're hitting diminishing returns.

I wonder if there's more 'channels' per sound based on rendering? could some sounds use multiple audio streams for higher quality reverb?
 
I dunno about expectations. More theoretical differences between what 32 channels gets you versus 512. I wouldn't be that surprised if the actual sample count and environmental audio is way simpler than it could be because of those production issues. In a 64 player shooter, you'll have at most two feet (jumping) per nearby player plus their weapon plus maybe some 'gear' sounds as they run. Footfall + weapon + gear. Ten at most locally in a firefight, that's 30 sounds. A few more for damage and whatnot. I think once you get past 100, you're hitting diminishing returns.

I wonder if there's more 'channels' per sound based on rendering? could some sounds use multiple audio streams for higher quality reverb?

I'm thinking mainly virtual reality in this scenario. Virtual reality is where headphones, turning head and accurate audio matters most. So to this extent imagine a forest. Given enough audio channels and some simulation(ray tracing) of environments there could be some crickets, Wind rustling leaves on trees, birds singing etc. All of these could have exact position and the ambient sound could be generated very nicely. Then if player turns head/looks around the ambient sound changes appropriately. It would be possible to focus on specific bird singing and indeed find it's in a specific tree. And this would just be a convincing ambient sound reacting to player. Layer on top of that the hero sounds like a bear attacking the player and group of npc's or whatever there might be in game environment. It could even be a game play mechanic to have to rely more on audio instead of visual cues, hunting a specific bird in a witcher like game or somesuch.

Does above make sense, who the hell knows. But I know I would find that amusing in a vr experience. Regular 2d game in tv could probably work equally well with something much simpler.

Similarly something like gran turismo could push audio a bit further. 16-32 cars on track. Simulate tire sounds and engine more proper for every car. Make the player car and couple other close car hero sounds. Maybe use ray tracing to propagate sound properly. Pit lane/other cars block part of sound etc. Put this into vr and there could be a real sense of awesomeness. Maybe I could actually look left and from the sound know there is a specific car on my right as I recognize the specific sound that car makes and the position of car is fairly clear based on sound(is it getting closer, farther, is is next to me, behind me?)

I know I'm hyped and I'm bound to get disappointed That said, if sony really pushes audio that is great thing.
 
@Shifty Geezer Sounds like they're using ambisonics format for most of the audio rendering, which would then be sampled with two virtual microphones to produce binaural output for headphones. They're using a different 3D format for important audio queues. They also have more channels than they did on PS4, which suggests they've gone from 1st order B-format to 2nd order with 9 channels instead of 4.

Going back to the Steam Audio link:

Ambisonics
Source-centric convolution reverb can be directional in nature: somebody else's distant footsteps heard through an open doorway sound more directional than one's own footsteps in a large room, for example. Steam® Audio uses Ambisonics to represent the directional variation of indirect sound. The higher the Ambisonics order, the more spatial detail that can be rendered in the indirect sound. On the other hand, the higher the Ambisonics order, the more IRs that are needed in the convolution reverb effect. For example, 2nd order Ambisonics requires 9 convolutions per source. This too, increases the computational cost of indirect sound.

So it looks like you have to calculate convolution per channel in ambisonics, so 4 convolutions per audio source in 1st order, and 9 convolutions per audio source.

Would be really interesting to see some specific benchmarks with good data for steam audio using true audio next to see how many audio sources tempest could really handle. We would know the difference in clock speed and could make assumptions about improvements in utilization, architecture.
 
Last edited:
I'm thinking mainly virtual reality in this scenario. Virtual reality is where headphones, turning head and accurate audio matters most. So to this extent imagine a forest. Given enough audio channels and some simulation(ray tracing) of environments there could be some crickets, Wind rustling leaves on trees, birds singing etc. All of these could have exact position and the ambient sound could be generated very nicely.
Yep. But also, I doubt human audio positioning is that accurate save maybe the blind, such that fairly large sources could be used as long as the player can't get too close. A single tree sound would be good for a distance of so many metres. As you get closer, you'd need individual branches. There's probably an angular accuracy, placing an object something like 5 degrees rotation, maybe even far higher, like 15 degrees. Just tried it now listening to my speakers and I can't sense a notable change in position turn my head somewhat. Type of sound probably makes a significant difference. Thus ant objects within a certain angular proximity and spatial locality could be rolled into one audio channel for spatial processing.
 
Yeah, it's interesting to think about this. Especially in relation to spatial audio and spatial location and spatial recognition.

In real life...
  • If there is one sound source it's pretty easy to identify where it is and in good detail. For example, a person talking to you in a room.
  • If there are two sound sources, you can still easily locate them, but now your brain will filter out one or the other to an extent if you focus on one or the other.
  • If there are multiple sound sources in a room, like say 30 people at a gathering. It becomes almost impossible to reliably locate any one sound source (person speaking).
    • Effectively at this point, addition sound sources just become "noise."
    • This goes for anything. Vehicles, gunshots, animals, footsteps, etc.
Think of the rain example. It's cool to model each raindrop. But you won't be able to spatially locate any of the raindrops. Is it a good use of resources then to model each raindrop? It's still cool, but no, it's probably not a good use of resources.

Is there a better way to simulate how enveloping the sound of rain is such that it's omnipresent around you but you can't spatially locate any of it? I don't know.

At what point does having multiple distinct audio sources go from something identifiable to something that is simply processed out as "noise" by a person's brain? After all, this was a natural evolution in order for the brain to be able to locate sounds that it identifies as important, meanwhile shoving anything not important into background noise.

Thinking about it, 32 sound sources is reasonable, but it'd be nice to have more. OTOH, over a thousand sound sources doesn't really accomplish anything that likely couldn't be done as well with far few sound sources combined with audio shaping.

Being at a gathering with ~30 people, and it's already just a sea of "noise" with the only identifiable sounds (voices) being people in my immediate proximity that I'm looking at.

Likewise, out at the ranch. One cow or horse running around is easily identified and located. 10 horses running around? Very difficult to try to identify the sound from one horses set of hooves. 20 horses? It's just a rolling bit of noise almost like thunder.

But those are all examples with sounds that are very similar. What about sounds that are different? How many different sounds can you have and still have them be distinct, identifiable, and locatable? No idea.

With a music band, you can identify and likely locate each player and instrument (assuming it's not being amplified through speakers).

With a full symphony orchestra? I can identify groups of instruments and the general area of a group of instruments but there's no hope of identifying a single instrument or musician unless they are doing something different.

OK, so an instrument/musician playing differently in the orchestra is still identifiable and locatable? So what if we had everyone in the orchestra playing a bit differently than everyone else? We're back to not being able to locate any specific sound source or the whole thing coming across as noise because there are too many differing sources of sound that are all distinctly unique from each other.

So, today I sat near a major thoroughfare (road) in my city. If there are a lot of cars passing by (10s), the sound of each vehicle being its own sound source would sound exactly identical to just having one sound source representing the volume of vehicles passing by.

There are exceptions, of course. A dumptruck going by was identifiable through the sea of vehicle noise. A car honk stood out. So in this case, modeling with a generic traffic sounce source and then individual sources for outstanding sources of sound would work just as well as using individual sources for each object.

All of this just to say that more than 32 would be good, but 1000 is more than needed, IMO.

I'm certainly interested in hearing what a game sounds like if a developer attempts to implement 100's of simultaneous sources of sound. I don't think it would noticeably stand out versus something using say 50-100 simultaneous sources of sound.

What I would prefer WRT to hardware accelerated sound in the next generation of consoles isn't MORE sounds, but better audio modeling and processing. That includes spatial location, occlusion, reflection, doppler, reverberation, material modeling, etc.

Regards,
SB
 
There are two similar but different cases you highlight. With rain, it's omnipresent, all around you, but it doesn't change as you move through it. Thus an ambisonic recording would suffice to provide a sense of rain all around. With the crowd, you have the same omnipresent background sound, but its quality changes as you move through it, with voices and conversation becoming more pronounced as you near someone, and then changing as you approach someone else. Here's a good example:


You can hear all the voices of a crowd, and spot effects when close to someone, but the crowd noise never resolves into conversations among the people as you approach them. If every conversation was modelled, the crowd scene wouldn't be a generic crowd noise but the collective sounds of all the conversations, each becoming clearer as you approached and then becoming absorbed into the general hubbub as you moved away.

This is, of course, impossible to implement. A crowd of 1000 people would need 1000 conversations to be recorded. Devs would have to develop new audio LOD that can resolve. I wonder if, as geometry is increased in complexity, you could do similar with clever audio tracks, using degrees of clarity? Have some 'conversations' that aren't real words, but approximations of vowels and consonant sounds with varying degrees of definition that can be blended into true speech when closing?
 
There are two similar but different cases you highlight. With rain, it's omnipresent, all around you, but it doesn't change as you move through it. Thus an ambisonic recording would suffice to provide a sense of rain all around. With the crowd, you have the same omnipresent background sound, but its quality changes as you move through it, with voices and conversation becoming more pronounced as you near someone, and then changing as you approach someone else. Here's a good example:


You can hear all the voices of a crowd, and spot effects when close to someone, but the crowd noise never resolves into conversations among the people as you approach them. If every conversation was modelled, the crowd scene wouldn't be a generic crowd noise but the collective sounds of all the conversations, each becoming clearer as you approached and then becoming absorbed into the general hubbub as you moved away.

This is, of course, impossible to implement. A crowd of 1000 people would need 1000 conversations to be recorded. Devs would have to develop new audio LOD that can resolve. I wonder if, as geometry is increased in complexity, you could do similar with clever audio tracks, using degrees of clarity? Have some 'conversations' that aren't real words, but approximations of vowels and consonant sounds with varying degrees of definition that can be blended into true speech when closing?

Yes, absolutely. But in the case of moving through the crowd, again, you don't need to have a sound source for each person in the crowd. You just need a general sound source for the crowd (enveloping the player) and then individual sources for the noticeably different sounds (the voices you can resolve that are in close proximity to you. And even then, in the real world, unless you are focusing on that person, your brain is likely to just push it into the general background sea of noise anyway.

That would reduce the need to assign a distinct source to each person within say 5 feet of you, to just those within a cone of vision representing where the player is "focused" and then perhaps a slightly different blended sound representing the raised perceptible audio level of those in close proximity. Less accurate, but would likely have the same or similar perceived impact on the person playing.

In some ways, the 1000 sound sources thing kind of reminds me how Sony were keen to point out how many polygons the PS2 could process. Really neat, but not necessary for various reasons including other aspects of the PS2 being a bottleneck.

So, I'm personally more interested in the audio processing that modified CU will be able to do.

Regards,
SB
 
Last edited:
This is, of course, impossible to implement. A crowd of 1000 people would need 1000 conversations to be recorded. Devs would have to develop new audio LOD that can resolve. I wonder if, as geometry is increased in complexity, you could do similar with clever audio tracks, using degrees of clarity? Have some 'conversations' that aren't real words, but approximations of vowels and consonant sounds with varying degrees of definition that can be blended into true speech when closing?

This would be interesting tech demo to tackle. It might be possible in a demo context today. Provide audio samples from many people and then train neural network to produce high quality speech with different voices. Then either use text to speech or neural network to text to speech algorithms for background discussions. I believe this would work but would be rather heavy to compute. Maybe still use real recordings for hero discussions.

Neural networks can generate pretty convincing speech and even visuals as seen from the various deep fakes floating around. Probably not mature for games but within next 5-10 years could be completely feasible. Another example could be the irishman movie but that of course is kind of next level quality and would apply more to things like expensive offline generated hero sounds. Perhaps in future the role of human actor is different in production than today. We live in very interesting time.
 
Last edited:
Yes, absolutely. But in the case of moving through the crowd, again, you don't need to have a sound source for each person in the crowd. You just need a general sound source for the crowd (enveloping the player) and then individual sources for the noticeably different sounds (the voices you can resolve that are in close proximity to you. And even then, in the real world, unless you are focusing on that person, your brain is likely to just push it into the general background sea of noise anyway.
That's what Uncharted already has, but it doesn't work completely. A general ambience sound will be at distance infinity, but within a crowd, even if you can't resolve what someone 5 metres from you is saying, the soundscape will have audio sense from that position.

That would reduce the need to assign a distinct source to each person within say 5 feet of you, to just those within a cone of vision representing where the player is "focused" and then perhaps a slightly different blended sound representing the raised perceptible audio level of those in close proximity.
I think LOD will need to be via proximity, not related to where one's looking. It's not just voices, but environmental sounds too. As you walk past a street vendor frying something, that frying sound needs to be resolved from 'somewhere in the market' to 'in front somewhere' to 'right next to me' to 'somewhere behind me' to 'somewhere in the market'. I don't think anyone knows how sensitive we are to audio cues, but if we think about lip-sync or contact shadows, I think audio will feel a little 'empty' and 'off' rather than truly immersive if objects that should be making sounds aren't making them properly. Or maybe not. There's a bit of discovery to come, I think.
 
What I would prefer WRT to hardware accelerated sound in the next generation of consoles isn't MORE sounds, but better audio modeling and processing. That includes spatial location, occlusion, reflection, doppler, reverberation, material modeling, etc.
This seems to be what Microsoft's Project Acoustics tech is focused on, by modelling the wave propagation physics around static scene geometry and baking the perceptual parameters generated by the simulation out to probe locations that can be sampled and interpolated at runtime (kinda like an audio equivalent to light-probe based GI solutions).
 
I think LOD will need to be via proximity, not related to where one's looking. It's not just voices, but environmental sounds too. As you walk past a street vendor frying something, that frying sound needs to be resolved from 'somewhere in the market' to 'in front somewhere' to 'right next to me' to 'somewhere behind me' to 'somewhere in the market'. I don't think anyone knows how sensitive we are to audio cues, but if we think about lip-sync or contact shadows, I think audio will feel a little 'empty' and 'off' rather than truly immersive if objects that should be making sounds aren't making them properly. Or maybe not. There's a bit of discovery to come, I think.

Yes, for specific sound cues you would need to have a specific sound source. So in the case of a vendor frying food, you'd want that. But if you were walking down a street with multiple vendors lining both sides of the street, you'd likely be able to get a similar perception of sound by having a left and right side blended audio source potentially combined with a specific source on each side to represent the frying sound closes to the player on each side.

But yes. I hope for lots of experimentation among the sound engineers for games for the next generation consoles. I really do hope that both consoles have very robust audio processing capabilities. I miss the days of audio experimentation on PC when 3D hardware accelerated audio processing was being investigated.

Regards,
SB
 
You can hear all the voices of a crowd, and spot effects when close to someone, but the crowd noise never resolves into conversations among the people as you approach them. If every conversation was modelled, the crowd scene wouldn't be a generic crowd noise but the collective sounds of all the conversations, each becoming clearer as you approached and then becoming absorbed into the general hubbub as you moved away.

This is, of course, impossible to implement. A crowd of 1000 people would need 1000 conversations to be recorded. Devs would have to develop new audio LOD that can resolve. I wonder if, as geometry is increased in complexity, you could do similar with clever audio tracks, using degrees of clarity? Have some 'conversations' that aren't real words, but approximations of vowels and consonant sounds with varying degrees of definition that can be blended into true speech when closing?

Crowd? I'm expecting ND to make each individual chest hair on drake be be a sound source. Those guys know where to place their priorities.
 
Back
Top