AMD Kaveri APU features the Onion + bus like the PlayStation 4

liolio · Oct 10, 2013

Somebody posted the whole presentation in the PC forum:
http://share.csdn.net/uploads/5232b691522ba/5232b691522ba.pdf

Whereas it was bitter, it seems that comment, from I don't remember which MSFT execs, about Sony bragging about off the shelves hardware pieces were not completely misslead.
To me it looks almost crystal clear at this point that the PS4 is build completely on off the shelves parts and some talks about SOny merits and addition to the design were a tad out of place.
Now the off the shelves part being pretty great I don't state that as if it was a negative.

onQ · Oct 10, 2013

Rangers said:
Huh? Wouldn't that be quite the shocker? Why hasn't Sony spoken about this?

Seems like the newest incarnation of secret sauce.

Funny that you would say that.

"However, there's a fair amount of "secret sauce" in Orbis and we can disclose details on one of the more interesting additions. Paired up with the eight AMD cores, we find a bespoke GPU-like "Compute" module, designed to ease the burden on certain operations - physics calculations are a good example of traditional CPU work that are often hived off to GPU cores. We're assured that this is bespoke hardware that is not a part of the main graphics pipeline but we remain rather mystified by its standalone inclusion, bearing in mind Compute functions could be run off the main graphics cores and that devs could have the option to utilise that power for additional graphical grunt, if they so chose."

And also this.

"So, a couple of random things I've learned:

-It's not stock x86; there are eight very wide vector engines and some other changes. It's not going to be completely trivial to retarget to it, but it should shut up the morons who were hyperventilating at "OMG! 1.6 JIGGAHURTZ!".

-The memory structure is unified, but weird; it's not like the GPU can just grab arbitrary memory like some people were thinking (rather, it can, but it's slow). They're incorporating another type of shader that can basically read from a ring buffer (supplied in a streaming fashion by the CPU) and write to an output buffer. I don't have all the details, but it seems interesting.

-As near as I'm aware, there's no OpenGL or GLES support on it at all; it's a lower-level library at present. I expect (no proof) this will change because I expect that they'll be trying to make a play for indie games, much as I'm pretty sure Microsoft will be, and trying to get indie developers to go turbo-nerd on low-level GPU programming does not strike me as a winner."

Shifty Geezer said:
When your enemy is down, that's when you deal the killing blow. You don't give them chance to recover. Sony took the performance advantage line and ran with it. MS has dealt a couple of come-back blows. If Sony have something to retaliate with, it behoves them to use that. there's something to be gained from telling the world that your hardware has a capable DSP that'll add to the experience. There's nothing to be gained from withholding that info.

I guess the Image Signal Processor is simply the two-layer blend and scale. The audio processor is the audio decode+mix block. The DSP doesn't fit anything I know of, but it can't be anything substantial. If it is, that's very weird that none of the leaks have spoken of it and none of the Sony engineers have spoken of it.

dumbo11 said:
AFAIR the zlib block is supposed to be on the secondary chip/southbridge (which is a very interesting idea... transparent disk/blu-ray compression or compression of network traffic?).

AMD did indicate that they integrated some form of 'Sony IP' into the Liverpool APU, but the most logical assumption is that this is something related to audio/video record/playback.

Anyway, at this point "the proof of the pudding is in the eating". If either console had game-related 'secret sauce' then we should be seeing it in demos, and there is no obvious sign of anything major for either console.

Sony hasn't said anything about their CPU & the rumors was that the vector co-processor was connected to the CPU.

even Eurogamer seen it & had no idea what it was for.

Also here is some more things to think about @ Shifty Geezer remember when your Dev friend told you that the older devkits was 10% slower than the new kits?

maybe the older devkits didn't have the vector co-processor on them & they was using only the jaguar cores with no vector units.

& say they are 200GFLOPS that would make the full chip 2.14 TFLOPS & the older devkits being 1.94TFLOPS would be about 10% slower.

Also we had reports of devs saying that the PS4 was 50% faster than the Xbox One if this was being said after the Xbox One clock boost which made the Xbox One around 1.419 TFLOPS , 50% faster would also be around 2.14 TFLOPS.

liolio said:
Somebody posted the whole presentation in the PC forum:
http://share.csdn.net/uploads/5232b691522ba/5232b691522ba.pdf

Whereas it was bitter, it seems that comment, from I don't remember which MSFT execs, about Sony bragging about off the shelves hardware pieces were not completely misslead.
To me it looks almost crystal clear at this point that the PS4 is build completely on off the shelves parts and some talks about SOny merits and addition to the design were a tad out of place.
Now the off the shelves part being pretty great I don't state that as if it was a negative.

I posted that in the OP of this thread & if the PS4 is made of off the shelf parts how come I don't see the PS4 APU/SoC in stores?

patsu · Oct 10, 2013

As I recall, the rumor was AMD didn't have the resources to move the Orbis project forward. Sony went ahead to customize their GPU to add the missing/unimplemented HSA features at that time.

May be they are 2 different implementations of the same/similar block diagram ? Cerny mentioned that they customized the GPU heavily.

Given that Sony usually want their Playstation to last 10 or so years, I am curious to see if they threw in some hacks to support fuller HSA features.

patsu · Oct 10, 2013

dumbo11 said:
AFAIR the zlib block is supposed to be on the secondary chip/southbridge (which is a very interesting idea... transparent disk/blu-ray compression or compression of network traffic?).

AMD did indicate that they integrated some form of 'Sony IP' into the Liverpool APU, but the most logical assumption is that this is something related to audio/video record/playback.

Anyway, at this point "the proof of the pudding is in the eating". If either console had game-related 'secret sauce' then we should be seeing it in demos, and there is no obvious sign of anything major for either console.

Yap, but we may need to wait 3-4 years for developers to tap the hard-to-use features.

zupallinere · Oct 10, 2013

onQ said:
I really think that's the PS4 SoC

Fixed Function Accelerator = Vector Co-processor

DSP = Speech Recognition Processor ?

Image signal processing = Head & Hand tracking processor for PS4 Eye

& the rest CPU , GPU & Video \ Audio encoding & decoding.

Maybe the DSP is for True Audio on the Kavieri ? I don't think this has anything to do with the PS4. The PS4 might use a subset of this stuff but there isn't a reason to think it has all of it. Of course Secret Sauce is good for both Goose and Gander

onQ · Oct 10, 2013

zupallinere said:
Maybe the DSP is for True Audio on the Kavieri ? I don't think this has anything to do with the PS4. The PS4 might use a subset of this stuff but there isn't a reason to think it has all of it. Of course Secret Sauce is good for both Goose and Gander

The SoC design shown in that PDF isn't Kaveri.

patsu · Oct 10, 2013

That Sony CTO diagram may be outdated. It doesn't talk about the secondary chip.

If the zlib block is attached to the secondary chip, then perhaps it will need some security logic too to deal with low power HDD saving/reading, and PSN communication.

The other thing is the audio stack. I heard developers don't have direct access to the audio hardware. They have to go through an API implemented by Sony. I have read somewhere that Vita uses the same approach.

MrFox · Oct 10, 2013

Maybe I'm crazy, but the slide with DSPs and Audio and whatnot, doesn't look like it's an existing SoC design. In the context of the presentation it's probably just an example of what kind of modules can take advantage of HSA, so they put everything they could think of.

onQ · Oct 10, 2013

patsu said:
That Sony CTO diagram may be outdated. It doesn't talk about the secondary chip.

If the zlib block is attached to the secondary chip, then perhaps it will need some security logic too to deal with low power HDD saving/reading, and PSN communication.

The other thing is the audio stack. I heard developers don't have direct access to the audio hardware. They have to go through an API implemented by Sony. I have read somewhere that Vita uses the same approach.

The diagram isn't from the Sony CTO it's from the AMD PDF

'GET READY FOR NEXT GENERATION APU
ARCHITECTURE
HSA AND GCN FROM OPENCL DEVELOPER’S PERSPECTIVE'

MrFox said:
Maybe I'm crazy, but the slide with DSPs and Audio and whatnot, doesn't look like it's an existing SoC design. In the context of the presentation it's probably just an example of what kind of modules can take advantage of HSA, so they put everything they could think of.

Yes it's an example of the HSA Architecture.

patsu · Oct 10, 2013

onQ said:
The diagram isn't from the Sony CTO it's from the AMD PDF

'GET READY FOR NEXT GENERATION APU
ARCHITECTURE
HSA AND GCN FROM OPENCL DEVELOPER’S PERSPECTIVE'

Yes it's an example of the HSA Architecture.

I meant this one:
http://beyond3d.com/showpost.php?p=1793379&postcount=111

Is it from Sony or AMD ?

3dilettante · Oct 10, 2013

pMax said:
Would you mind to comment more on this? Why are you referring to a 2Mb window? I am missing something...

Both the interface and the CPU cores can blow through all the on-chip storage in very little time. The common case is that the data is going to be evicted by the CPUs, and the GPU cannot write to the CPU L2. If it did, the GPU would trash any CPU's cache in short order.
Each module has 2MB of cache, which is the last bit of storage on the chip where data can stay resident long enough for there to be a chance that the GPU might access it.

Once the data leaves the L2, there's possibly tens of cycles before it's off to main memory.
For a GPU that could stream millions of accesses in a second over that bus, the odds are not that good that a specific address is going to be hit before the write to memory.

onQ · Oct 10, 2013

patsu said:
I meant this one:
http://beyond3d.com/showpost.php?p=1793379&postcount=111

Is it from Sony or AMD ?

It's from the AMD PDF.

patsu · Oct 10, 2013

I see. No wonder it doesn't talk about the secondary chip.

3dilettante · Oct 10, 2013

The HSA marketing slide is so generic you can pick a subset of of blocks and point excitedly at almost any existing SoC.
I suggest not leaping to conclusions so quickly, or so eagerly dredging up old forum posts that were iffy then and have not been borne out in the months since.

What's more specific about the HSA slide is the shared coherent memory and user mode queues, but that is something Sony has not indicated for the secondary processor or the decode logic and system monitoring it does.
Sony, other than in a now very old interview that kind of touches on the subject, has not promised every block in that generic slide.
It hasn't proven the negative, but nothing in the hardware disclosures or leaks has supported the affirmative.

zupallinere · Oct 10, 2013

onQ said:
It's from the AMD PDF.

Well the "System Integration" part lines up with what Cerny discusses with Context Switching ( "volatile bit" and such )and some form of Hardware Arbitration that could be used for Graphics Preemption or at least some pre-emptive compute alongside whatever is in the graphics queue.

Thirdly, said Cerny, "The original AMD GCN architecture allowed for one source of graphics commands, and two sources of compute commands. For PS4, we’ve worked with AMD to increase the limit to 64 sources of compute commands -- the idea is if you have some asynchronous compute you want to perform, you put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that's in the system."

It would be interesting to find out about those arbitration levels at some point.

3dilettante · Oct 10, 2013

The presentation made note a bit earlier that we are still at the Architectural stage.
Preemption is not the same as arbitrating kernel launch, and has more import than what Cerny promised.

For preemption, the system needs to be able to interrupt wavefronts, swap them out, do something else, then swap them back in.
The lack of specific mention of the implications of having that ability doesn't prove the negative, but in terms of GPU tech and compute it would be a massive thing to not want to brag about.

dumbo11 · Oct 10, 2013

3dilettante said:
The HSA marketing slide is so generic you can pick a subset of of blocks and point excitedly at almost any existing SoC.

The only thing hinting at a PS4 influence is the faint box outline that groups the "CPU/GPU" separately from the miscellaneous processors on the right.

3dilettante · Oct 10, 2013

dumbo11 said:
The only thing hinting at a PS4 influence is the faint box outline that groups the "CPU/GPU" separately from the miscellaneous processors on the right.

Even that's not PS4-related.
AMD is giving its GPU more attention and optimization than some random custom IP block that hasn't had years to co-evolve with the CPU.
The same goes with the different way the TrueAudio block can work with memory on the GPUs that feature it.

There's an advantage to incumbency in this regard, and AMD can't readily promise the same level of integration since HSA needs to have a broad enough umbrella to include hardware that is even less compatible than the GPU is.

I'd imagine getting into that special zone is what AMD wants custom APU cash for.

taisui · Oct 10, 2013

wouldn't be the secret sauce (tm) if it's known publicly....

onQ · Oct 10, 2013

3dilettante said:
The presentation made note a bit earlier that we are still at the Architectural stage.
Preemption is not the same as arbitrating kernel launch, and has more import than what Cerny promised.

For preemption, the system needs to be able to interrupt wavefronts, swap them out, do something else, then swap them back in.
The lack of specific mention of the implications of having that ability doesn't prove the negative, but in terms of GPU tech and compute it would be a massive thing to not want to brag about.

Sounds a lot like what Cerny was saying.

http://www.gamasutra.com/view/feature/191007/inside_the_playstation_4_with_mark_.php?print=1

"The eight Jaguar cores, the GPU and a large number of other units are all on the same die,"

Familiar Architecture, Future-Proofed

So what does Cerny really think the console will gain from this design approach? Longevity.

Cerny is convinced that in the coming years, developers will want to use the GPU for more than pushing graphics -- and believes he has determined a flexible and powerful solution to giving that to them. "The vision is using the GPU for graphics and compute simultaneously," he said. "Our belief is that by the middle of the PlayStation 4 console lifetime, asynchronous compute is a very large and important part of games technology."

Cerny envisions "a dozen programs running simultaneously on that GPU" -- using it to "perform physics computations, to perform collision calculations, to do ray tracing for audio."

But that vision created a major challenge: "Once we have this vision of asynchronous compute in the middle of the console lifecycle, the question then becomes, 'How do we create hardware to support it?'"

One barrier to this in a traditional PC hardware environment, he said, is communication between the CPU, GPU, and RAM. The PS4 architecture is designed to address that problem.

"A typical PC GPU has two buses," said Cerny. "There’s a bus the GPU uses to access VRAM, and there is a second bus that goes over the PCI Express that the GPU uses to access system memory. But whichever bus is used, the internal caches of the GPU become a significant barrier to CPU/GPU communication -- any time the GPU wants to read information the CPU wrote, or the GPU wants to write information so that the CPU can see it, time-consuming flushes of the GPU internal caches are required."

Enabling the Vision: How Sony Modified the Hardware

The three "major modifications" Sony did to the architecture to support this vision are as follows, in Cerny's words:

"First, we added another bus to the GPU that allows it to read directly from system memory or write directly to system memory, bypassing its own L1 and L2 caches. As a result, if the data that's being passed back and forth between CPU and GPU is small, you don't have issues with synchronization between them anymore. And by small, I just mean small in next-gen terms. We can pass almost 20 gigabytes a second down that bus. That's not very small in today’s terms -- it’s larger than the PCIe on most PCs!
"Next, to support the case where you want to use the GPU L2 cache simultaneously for both graphics processing and asynchronous compute, we have added a bit in the tags of the cache lines, we call it the 'volatile' bit. You can then selectively mark all accesses by compute as 'volatile,' and when it's time for compute to read from system memory, it can invalidate, selectively, the lines it uses in the L2. When it comes time to write back the results, it can write back selectively the lines that it uses. This innovation allows compute to use the GPU L2 cache and perform the required operations without significantly impacting the graphics operations going on at the same time -- in other words, it radically reduces the overhead of running compute and graphics together on the GPU."
Thirdly, said Cerny, "The original AMD GCN architecture allowed for one source of graphics commands, and two sources of compute commands. For PS4, we’ve worked with AMD to increase the limit to 64 sources of compute commands -- the idea is if you have some asynchronous compute you want to perform, you put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that's in the system."

"The reason so many sources of compute work are needed is that it isn’t just game systems that will be using compute -- middleware will have a need for compute as well. And the middleware requests for work on the GPU will need to be properly blended with game requests, and then finally properly prioritized relative to the graphics on a moment-by-moment basis."

This concept grew out of the software Sony created, called SPURS, to help programmers juggle tasks on the CELL's SPUs -- but on the PS4, it's being accomplished in hardware.

The team, to put it mildly, had to think ahead. "The time frame when we were designing these features was 2009, 2010. And the timeframe in which people will use these features fully is 2015? 2017?" said Cerny.

"Our overall approach was to put in a very large number of controls about how to mix compute and graphics, and let the development community figure out which ones they want to use when they get around to the point where they're doing a lot of asynchronous compute."

Cerny expects developers to run middleware -- such as physics, for example -- on the GPU. Using the system he describes above, you can run at peak efficiency, he said.

"If you look at the portion of the GPU available to compute throughout the frame, it varies dramatically from instant to instant. For example, something like opaque shadow map rendering doesn't even use a pixel shader, it’s entirely done by vertex shaders and the rasterization hardware -- so graphics aren't using most of the 1.8 teraflops of ALU available in the CUs. Times like that during the game frame are an opportunity to say, 'Okay, all that compute you wanted to do, turn it up to 11 now.'"

Sounds great -- but how do you handle doing that? "There are some very simple controls where on the graphics side, from the graphics command buffer, you can crank up or down the compute," Cerny said. "The question becomes, looking at each phase of rendering and the load it places on the various GPU units, what amount and style of compute can be run efficiently during that phase?"

Launch and Beyond

The benefits of this powerful hardware will be seen in the PlayStation 4's launch games. But Cerny maintains that, in the future, they'll shine through in totally different ways.

"The launch lineup for PlayStation 4 -- though I unfortunately can’t give the title count -- is going to be stronger than any prior PlayStation hardware. And that's a result of that familiarity," Cerny said. But "if your timeframe is 2015, by another way of thinking, you really need to be doing that customization, because your competition will be doing that customization."

So while it takes "weeks, not months" to port a game engine from the PC to the PlayStation 4 according to Cerny, down the road, dedicated console developers can grasp the capabilities of the PlayStation 4, customize their technology, and really reap the benefits.

"There are many, many ways to control how the resources within the GPU are allocated between graphics and compute. Of course, what you can do, and what most launch titles will do, is allocate all of the resources to graphics. And that’s perfectly fine, that's great. It's just that the vision is that by the middle of the console lifecycle, that there's a bit more going on with compute."

Freeing Up Resources: The PS4's Dedicated Units

Another thing the PlayStation 4 team did to increase the flexibility of the console is to put many of its basic functions on dedicated units on the board -- that way, you don't have to allocate resources to handling these things.

"The reason we use dedicated units is it means the overhead as far as games are concerned is very low," said Cerny. "It also establishes a baseline that we can use in our user experience."

"For example, by having the hardware dedicated unit for audio, that means we can support audio chat without the games needing to dedicate any significant resources to them. The same thing for compression and decompression of video." The audio unit also handles decompression of "a very large number" of MP3 streams for in-game audio, Cerny added.

AMD Kaveri APU features the Onion + bus like the PlayStation 4

liolio

Aquoiboniste

onQ

patsu

patsu

zupallinere

onQ

patsu

MrFox

Deludedly Fantastic

onQ

patsu

3dilettante

onQ

patsu

3dilettante

zupallinere

3dilettante

dumbo11

3dilettante

taisui

onQ

Similar threads