Is it on LKML.org?You guys might want to check out the ongoing discussion around the HSA patches for the Linux kernel. Some Carrizo details being shared there.
TDR the old way may still be applicable, but it would require the developers to break everything into smaller tasks to fit the time limit, and does not sound a good solution in terms of cross-generation compatibility. It sounds now to me that Kaveri is just an HSA prototype for developers to start working on.If Kaveri is that lacking in preemption, however, it is of uncertain portent for how well the compute functionality can be used for the consoles. Long-running compute was pointed as a sore spot by Infamous SS devs, and while I made allowance for a lack of software infrastructure to explain that and the writing off of GPU compute for low-latency audio, it's another thing if the hardware simply doesn't have it.
The lack of preemption means there's no way to get new work onto the GPU if there are long-running wavefronts hogging resources.I suspect consoles will hardly suffer from the lack of mid-wave pre-emption, as GPUs in consoles are likely running only the game process (compute) and the graphics pipeline in parallel, which both are controllable by the application and shall behave like adding more CS uses into the mix, perhaps?
The profile seems to be whether the GPU is busy.Could developers use profiling to allow them to avoid any such nasty surprises?
They are, or can be, different. The PC is worse.I understand that on PC a thick API and other software running concurrently might make this ineffective, but wouldn't consoles be different?
He basically said latency-tolerant things like reverb might be done on the GPU.The comment about several frames of latency is surprising, given that anything requiring several frames of latency (graphics, sound, physics etc) would be a bad fit for a video game.
I don't know the circumstances of this profiling you're talking about, but I doubt being able to preempt compute work already executing is going to make much of a difference. You even said the execution of launched kernels is generally fast so outside of some bad cases I'm unsure why the lack of wave preemption is an issue for audio processing.The GPU was profiled as take over 30ms to get to the point that it can launch a kernel the application submitted. The submission process is fast, the execution of launched kernels is generally fast, but the time it takes for the whole set of required buffers, register allocations, and data share to be available when there are hundreds or thousands of other competitors is what takes so long.
It's from Laurent Betbeder's presentation at APU13.I don't know the circumstances of this profiling you're talking about, but I doubt being able to preempt compute work already executing is going to make much of a difference. You even said the execution of launched kernels is generally fast so outside of some bad cases I'm unsure why the lack of wave preemption is an issue for audio processing.
Apparently that short while was not short enough for his purposes, but he was aiming for <5ms for the stuff that needs audio sync.By using async compute queues the audio work can skip any graphics work that's queued up so it only needs to wait a short while for a compute unit to be free.
If there's too much latency it has to be in the queue process and getting data from the CPU to the GPU and back.
The whole problem seems to be having no guaranteed access in a given time interval, so that long-running shaders (I suspect almost all long-running stuff would be shader only so far) would block the progress of later tasks that are latency-critical. Though I used to think application has all the control, one wouldn't want to break the shaders into smaller pieces either because it just works, perhaps. More importantly, splitting shaders mean you need to burn extra bandwidth on saving states or data, and sometimes also the performance of graphics fixed-function like doing multiple PS passes. That's said hardware preemption burns bandwidth too, but at least it is transparent and has a set of guarantees.I don't know the circumstances of this profiling you're talking about, but I doubt being able to preempt compute work already executing is going to make much of a difference. You even said the execution of launched kernels is generally fast so outside of some bad cases I'm unsure why the lack of wave preemption is an issue for audio processing.
AMD Mobile “Carrizo” Family of APUs Designed to Deliver Significant Leap in Performance, Energy Efficiency in 2015
─ 2015 AMD Mobile Roadmap adds “Carrizo” and “Carrizo-L” SoCs to APU lineup ─
SINGAPORE — Nov. 20, 2014 — AMD (NYSE: AMD) today at its Future of Compute event announced the addition of its first high performance system-on-a-chip (SoC), codenamed “Carrizo”, and a mainstream SoC codenamed “Carrizo-L” as part of the company’s 2015 AMD Mobile APU family roadmap. In collaboration with hardware and software partners, these new 2015 AMD Mobile APUs are designed as complete solutions for gaming, productivity applications, and ultra high-definition 4K experiences. With support for Microsoft® DirectX® 12, OpenCL® 2.0, AMD’s Mantle API, AMD FreeSync and support for Microsoft’s upcoming Windows® 10 operating system, the 2015 AMD Mobile APU family enables the experiences consumers expect.
“We continue to innovate and build upon our existing IP to deliver great products for our customers,” said John Byrne, senior vice president and general manager, Computing and Graphics business group, AMD. “AMD’s commitment to graphics and compute performance, as expressed by our goal to improve APU energy efficiency 25x by 2020, combines with the latest industry standards and fresh innovation to drive the design of the 2015 AMD Mobile APU family. We are excited about the experiences these new APUs will bring and look forward to sharing more details in the first half of next year.”
The flagship “Carrizo” processor will integrate the new x86 CPU core codenamed “Excavator” with next generation AMD Radeon™ graphics in the world’s first Heterogeneous Systems Architecture (HSA) 1.0 compliant SoC. The “Carrizo-L” SoC integrates the CPU codenamed “Puma+” with AMD Radeon™ R-Series GCN GPUs and is intended for mainstream configurations. In addition, an AMD Secure Processor will be integrated into the “Carrizo” and “Carrizo-L” APUs, enabling ARM® TrustZone® across the entire family for the security commercial customers and consumers expect. Utilizing a single package infrastructure for “Carrizo” and “Carrizo-L,” the 2015 AMD Mobile APU family simplifies partner designs across a broad range of commercial and consumer mobile systems.
“Carrizo” and “Carrizo-L,” are scheduled to ship in 1H 2015, with laptop and All-in-One systems based on the 2015 AMD Mobile APU family expected in market by mid-year 2015.
Supporting Resources
· View video of AMD’s John Byrne introducing the “Carrizo” codenamed APU
· More information on AMDInvestor Relations
· Become a fan of AMD on Facebook
· Follow AMD onTwitter
· JoinAMDon Google+
http://www.eetimes.com/document.asp?_mc=RSS_EET_EDT&doc_id=1324643&page_number=2AMD will disclose Carrizo, an integrated processor with its latest x86 core. The 28 nm chip measures 244.62 mm² and packs more than 3.1 billion transistors. Its new Excavator core is 23% smaller and uses 40% less power than AMD’s previous x86 core.