AMD Carrizo / Toronto

You guys might want to check out the ongoing discussion around the HSA patches for the Linux kernel. Some Carrizo details being shared there.
 
By "as-is" I wanted to mean the chip and its on-board southbridge. And no need for the FM2+ really. I think Toronto is intended to provide an 8x PCIe 3.0 slot, which gives you something at least.
 
You guys might want to check out the ongoing discussion around the HSA patches for the Linux kernel. Some Carrizo details being shared there.
Is it on LKML.org?

Intra-wavefront preemption seems to be on the table for Carrizo.
There's some significant back and forth on rough edges for Kaveri, including its lack of preemption and some concern about how it tries to synchronize some things for performance monitoring.

The lack of preemption is also likely one of the big reasons why Kaveri is marketed as only having HSA features.

If Kaveri is that lacking in preemption, however, it is of uncertain portent for how well the compute functionality can be used for the consoles. Long-running compute was pointed as a sore spot by Infamous SS devs, and while I made allowance for a lack of software infrastructure to explain that and the writing off of GPU compute for low-latency audio, it's another thing if the hardware simply doesn't have it.

I empathize with the position concerning the level of control the kernel should have, particularly when it's Kaveri's shortfalls that endanger the system. However, long-term I think there is going to be some give on the level of awareness the kernel will have on the behavior that lies below the level of the queue command layer. There are other areas of CPU and GPU behavior that take general guidance and failsafe controls, but do not otherwise interact with the OS.
 
If Kaveri is that lacking in preemption, however, it is of uncertain portent for how well the compute functionality can be used for the consoles. Long-running compute was pointed as a sore spot by Infamous SS devs, and while I made allowance for a lack of software infrastructure to explain that and the writing off of GPU compute for low-latency audio, it's another thing if the hardware simply doesn't have it.
TDR the old way may still be applicable, but it would require the developers to break everything into smaller tasks to fit the time limit, and does not sound a good solution in terms of cross-generation compatibility. It sounds now to me that Kaveri is just an HSA prototype for developers to start working on.

I suspect consoles will hardly suffer from the lack of mid-wave pre-emption, as GPUs in consoles are likely running only the game process (compute) and the graphics pipeline in parallel, which both are controllable by the application and shall behave like adding more CS uses into the mix, perhaps?
 
Last edited by a moderator:
I suspect consoles will hardly suffer from the lack of mid-wave pre-emption, as GPUs in consoles are likely running only the game process (compute) and the graphics pipeline in parallel, which both are controllable by the application and shall behave like adding more CS uses into the mix, perhaps?
The lack of preemption means there's no way to get new work onto the GPU if there are long-running wavefronts hogging resources.
It's why Sony's audio engineer stated only audio effects capable of tolerating multiple frames of latency should be done on the GPU.

Without better controls, the software can only control how work is put on the queues and their relative issue priority, but it can't do anything if there's a nasty surprise like already started wavefronts getting in the way of any further launches.
 
Could developers use profiling to allow them to avoid any such nasty surprises?

I understand that on PC a thick API and other software running concurrently might make this ineffective, but wouldn't consoles be different?

The comment about several frames of latency is surprising, given that anything requiring several frames of latency (graphics, sound, physics etc) would be a bad fit for a video game.
 
Could developers use profiling to allow them to avoid any such nasty surprises?
The profile seems to be whether the GPU is busy.
There is the individual behavior of each kernel, and their collective effects on the GPU by how they happen to get their resource allocations set.
The GPU was profiled as take over 30ms to get to the point that it can launch a kernel the application submitted. The submission process is fast, the execution of launched kernels is generally fast, but the time it takes for the whole set of required buffers, register allocations, and data share to be available when there are hundreds or thousands of other competitors is what takes so long.

I understand that on PC a thick API and other software running concurrently might make this ineffective, but wouldn't consoles be different?
They are, or can be, different. The PC is worse.

The comment about several frames of latency is surprising, given that anything requiring several frames of latency (graphics, sound, physics etc) would be a bad fit for a video game.
He basically said latency-tolerant things like reverb might be done on the GPU.
If the GPU is not unduly stressed, it can manage better latencies, but once it reaches high utilization you can't count on it.
The audio DSP logic the PS4 has is faster, but it is frequently used for other functions and it is hidden behind a secure API that does add some latency.
The CPU provides the lowest and most dependable latency.
 
The GPU was profiled as take over 30ms to get to the point that it can launch a kernel the application submitted. The submission process is fast, the execution of launched kernels is generally fast, but the time it takes for the whole set of required buffers, register allocations, and data share to be available when there are hundreds or thousands of other competitors is what takes so long.
I don't know the circumstances of this profiling you're talking about, but I doubt being able to preempt compute work already executing is going to make much of a difference. You even said the execution of launched kernels is generally fast so outside of some bad cases I'm unsure why the lack of wave preemption is an issue for audio processing.

By using async compute queues the audio work can skip any graphics work that's queued up so it only needs to wait a short while for a compute unit to be free. If there's too much latency it has to be in the queue process and getting data from the CPU to the GPU and back.
 
I don't know the circumstances of this profiling you're talking about, but I doubt being able to preempt compute work already executing is going to make much of a difference. You even said the execution of launched kernels is generally fast so outside of some bad cases I'm unsure why the lack of wave preemption is an issue for audio processing.
It's from Laurent Betbeder's presentation at APU13.
He does mention that there's a sort of preemption, but goes into no detail concerning its use.
The use of the GPU for compute for his purposes behaves badly when the rest of the game's GPU load gets in the way of audio kernel launches.
More likely, the lack of graphics preemption is the stronger detractor there.

Sucker Punch Studio's post-mortem on their engine mentioned that the opposite use case for long-running GPU compute is also currently very problematic.
The Linux HSA Kaveri and Carrizo discussion shows that long-running compute runs into the problem where it's possible to DoS the graphics portion.
Preemption can give better guarantees for the low end of the latency spectrum, and a safety net for long-running kernels.
The theory is that at least some form of it is present in Sea Islands, but either the state of the hardware or the current version of the software has left the supposed cheerleaders at APU13 and Sony's first party developer aiming a bit lower than what a general compute peer should be capable of doing.

I was surprised at just how long the launch latency under load was in Betbeder's presentation.

By using async compute queues the audio work can skip any graphics work that's queued up so it only needs to wait a short while for a compute unit to be free.
Apparently that short while was not short enough for his purposes, but he was aiming for <5ms for the stuff that needs audio sync.

If there's too much latency it has to be in the queue process and getting data from the CPU to the GPU and back.

http://www.slideshare.net/DevCentralAMD/mm-4085-laurentbetbeder

I'm still able to track down the slides, but not the video. It wasn't the easiest thing to listen for tidbits in the video, but he added additional remarks in the video for some rough audio frame and later submission latencies that don't show up in the slides.

For the purposes of audio in particular, the latency for compute kernel launch when the GPU is under load was about 10x too high.

The characterization of the tools he was able to work with at that early date did not show that he had many levers to pull to compensate, although he made a brief mention of wanting to experiment with unspecified functions at a later time.

The following has some rough text of a few things in the video:
http://mandetech.com/2013/11/24/designing-a-game-audio-system-for-hsa/

The text there puts the ACP as being acceptable for things with a 20ms latency. My recollection of the presentation is that it's somewhat better than that, with a latency of 2 or so audio frames, which from my fuzzy memory are something less than 10ms each.

The last line of that link with 1-3 frames of latency is a rough characterization of the latency numbers that did pop up for the GPU.
So the CPU was for the low single-digits, the ACP was a somewhat too slow solution somewhere a little north of that, while the GPU chucked an extra 10-20ms on top of that.
 
I supposed if a console developer really wants to isolate the queuing latency they can reserve a CU or two for compute and virtually guarantee queued work will launch as fast as possible. The result would be academic though since if it's not fast enough it doesn't really matter which part is too slow.
 
That was something I hypothesized could be done prior to the PS4 launch.
There's no indication that this was done, so I guess the more conventional way of allocating work remains in play thus far.

Something like Carrizo, I think, would have made things more interesting.
 
I don't know the circumstances of this profiling you're talking about, but I doubt being able to preempt compute work already executing is going to make much of a difference. You even said the execution of launched kernels is generally fast so outside of some bad cases I'm unsure why the lack of wave preemption is an issue for audio processing.
The whole problem seems to be having no guaranteed access in a given time interval, so that long-running shaders (I suspect almost all long-running stuff would be shader only so far) would block the progress of later tasks that are latency-critical. Though I used to think application has all the control, one wouldn't want to break the shaders into smaller pieces either because it just works, perhaps. More importantly, splitting shaders mean you need to burn extra bandwidth on saving states or data, and sometimes also the performance of graphics fixed-function like doing multiple PS passes. That's said hardware preemption burns bandwidth too, but at least it is transparent and has a set of guarantees.

On PC this means the huge potential of GPU DDoS, but you can't kill it like graphics timeout e.g. TDR on Windows either, except if HSA has eventually a defined weaker profile for closed systems that allows this. My two cent is that the upstream KFD would probably supporting neither Kaveri (nor any future HSA device that has no hard pre-emption), unless AMD eventually finds a solution.
 
Last edited by a moderator:
EwohP98.png
 
Looks like they have 3 modules with less L2 Cache now and a beefier GPU to feed; rumors are they're still sticking do a dual channel DDR3 interface. It would be nice if they moved on to HBMs but it doesn't look like it's going to happen before they reveal the K12. It would save them money and boost performance to sell an APU package with HBMs on package.
 
Seems like Carrizo was just announced officially.

- "brand new graphics architecture"
- "biggest leap ever from an energy efficiency standpoint for AMD"

says John Byrne in an overlay enthusiastically cut video

From the AMD press release:
AMD Mobile “Carrizo” Family of APUs Designed to Deliver Significant Leap in Performance, Energy Efficiency in 2015

─ 2015 AMD Mobile Roadmap adds “Carrizo” and “Carrizo-L” SoCs to APU lineup ─


SINGAPORE — Nov. 20, 2014 — AMD (NYSE: AMD) today at its Future of Compute event announced the addition of its first high performance system-on-a-chip (SoC), codenamed “Carrizo”, and a mainstream SoC codenamed “Carrizo-L” as part of the company’s 2015 AMD Mobile APU family roadmap. In collaboration with hardware and software partners, these new 2015 AMD Mobile APUs are designed as complete solutions for gaming, productivity applications, and ultra high-definition 4K experiences. With support for Microsoft® DirectX® 12, OpenCL® 2.0, AMD’s Mantle API, AMD FreeSync and support for Microsoft’s upcoming Windows® 10 operating system, the 2015 AMD Mobile APU family enables the experiences consumers expect.

“We continue to innovate and build upon our existing IP to deliver great products for our customers,” said John Byrne, senior vice president and general manager, Computing and Graphics business group, AMD. “AMD’s commitment to graphics and compute performance, as expressed by our goal to improve APU energy efficiency 25x by 2020, combines with the latest industry standards and fresh innovation to drive the design of the 2015 AMD Mobile APU family. We are excited about the experiences these new APUs will bring and look forward to sharing more details in the first half of next year.”

The flagship “Carrizo” processor will integrate the new x86 CPU core codenamed “Excavator” with next generation AMD Radeon™ graphics in the world’s first Heterogeneous Systems Architecture (HSA) 1.0 compliant SoC. The “Carrizo-L” SoC integrates the CPU codenamed “Puma+” with AMD Radeon™ R-Series GCN GPUs and is intended for mainstream configurations. In addition, an AMD Secure Processor will be integrated into the “Carrizo” and “Carrizo-L” APUs, enabling ARM® TrustZone® across the entire family for the security commercial customers and consumers expect. Utilizing a single package infrastructure for “Carrizo” and “Carrizo-L,” the 2015 AMD Mobile APU family simplifies partner designs across a broad range of commercial and consumer mobile systems.

“Carrizo” and “Carrizo-L,” are scheduled to ship in 1H 2015, with laptop and All-in-One systems based on the 2015 AMD Mobile APU family expected in market by mid-year 2015.


Supporting Resources

· View video of AMD’s John Byrne introducing the “Carrizo” codenamed APU

· More information on AMDInvestor Relations

· Become a fan of AMD on Facebook

· Follow AMD onTwitter

· JoinAMDon Google+
 
Back
Top