Is everything on one die a good idea?

sebbbi · Aug 14, 2014

3dilettante said:
I recall at least some developers on this generation being at least a little thankful for the Jaguar cores, even though their peak throughput numbers weren't a significant upgrade.

Gameplay programmers are pleased of course. Running game script or flash-based UI code on an in-order CPU with no cache prefetcher is a real pain in the ass. Modern x86/64 core is a godsend for this kind of "crappy" code. I am however a graphics engine programmer, so my code is a little bit different than this. I care more about a good GPU than a good CPU, since the GPU is where most of my code runs nowadays.

3dilettante said:
The current state of Sony's use of GPU compute for more latency-sensitive audio operations is that the neglected straightline performance of the GPU keeps it tens of milliseconds beyond the range of acceptable.

I don't know anything about audio mixing, and I wouldn't want audio tasks eating my GPU cycles (there is a separate DSP for this purpose). The latency between CPU and GPU has historically always been long (especially on PC), and graphics programmers have been always complaining about his. One solution is to move the whole graphics engine to the GPU, so there's no latency anymore. This allows the graphics engine to use the resources generated by the GPU earlier this frame to make decisions without latency issues. Traditionally occlusion culling needed a CPU round trip (analyze depth buffer on CPU side), and this meant that the results were ready at the next frame (or even later on SLI PC setups). Using old data has it's problems, especially if you want to do very fine grained culling for objects and lights. This is one of the biggest problems when implementing advanced shadow mapping techniques such as SDSM (http://visual-computing.intel-research.net/art/publications/sdsm/sdsmLauritzen_I3D2011.pdf).

3dilettante said:
The costs of scaling system concurrency and working set size are masked because the consoles are so far below that level of scale, much how a goldfish may think it is a better master of its fishbowl than a great white is of the Pacific.

But I am the goldfish in the gaming console bowl, and while I am interested to follow how the supercomputer world (the great white) evolves, it's not that relevant for me getting the best performance out of our current systems designed to run games. Our code should scale perfectly well to consumer 16 core (32 thread) Broadwell/Skylake? CPUs equipped with the next generation GPUs, so I don't see any real problems with my statements (for now). But we could revisit this discussion 4-5 years from now when we know more how the CPUs, GPUs and the gaming devices have evolved.

liquidboy · Aug 14, 2014

sebbbi said:
However, in the long run, I would definitely want a processor / programming model that allows me to write all my code in a single language (having good modern built in parallel primitives) and to use a single set of debugging and analysis tools that can be used to step through (or trace back) the whole code base.

Do you have one in mind that already exists and can be evolved to meet your needs? Or does it need to be a radically new language?

C++? or something more high level/safe/managed?

3dilettante · Aug 15, 2014

sebbbi said:
Gameplay programmers are pleased of course. Running game script or flash-based UI code on an in-order CPU with no cache prefetcher is a real pain in the ass. Modern x86/64 core is a godsend for this kind of "crappy" code. I am however a graphics engine programmer, so my code is a little bit different than this. I care more about a good GPU than a good CPU, since the GPU is where most of my code runs nowadays.

Perhaps it has more bearing then on the use of general compute or side functions that involve paging or updating the IOMMU, since the GPU can only operate as a guest, and something like updating or invalidating the IOMMU's TLB or page data can lead to stalls with the early implementations of shared virtual memory.
Some of that is CPU-driven when it comes to privileged operations, with other things like paging dominated by HDD latency and bandwidth.

I don't know anything about audio mixing, and I wouldn't want audio tasks eating my GPU cycles (there is a separate DSP for this purpose).

At least for Sony, the DSP path is dominated by services and codecs, and at least part of that path is through a secure API, which burdens it with a theoretically unnecessary DRM navigation duty/penalty.
That's something over 10ms, although I can't find the video posted anymore for Sony's APU13 presentation for the specific timings.

The latency between CPU and GPU has historically always been long (especially on PC), and graphics programmers have been always complaining about his.

That is a significant problem for the PC especially. For the PS4, the kernel launch latency of a compute job for audio at times of heavy graphics load can spike to ~30ms, and at least for Sony this worst-case is dominated by the time it takes the GPU to launch a wavefront, not the command submission or execution delay.
Perhaps the SDK or system development at the time had left some of the necessary QoS tools unavailable.

But I am the goldfish in the gaming console bowl, and while I am interested to follow how the supercomputer world (the great white) evolves, it's not that relevant for me getting the best performance out of our current systems designed to run games.

That is where Gustafson's Law comes from, and where it is in full force.
We'll have to see if some of the assumptions made for how it approximates scaling transfer to the consumer space. HPC systems are allowed some pretty hefty allowances to scale IO, memory capacity, and networking to help tamp down any increase in the serial component as the problem size increases.
The consoles did boost memory capacity, at least.

Our code should scale perfectly well to consumer 16 core (32 thread) Broadwell/Skylake? CPUs equipped with the next generation GPUs, so I don't see any real problems with my statements (for now).

I guess we'll evaluate scaling when a 16 core Broadwell/Skylake consumer CPU is released.

But we could revisit this discussion 4-5 years from now when we know more how the CPUs, GPUs and the gaming devices have evolved.

I suspect we'll find that the consoles were a few years too early and certain architectural and manufacturing advances a few years too late.
The previous gen had dragged on longer than expected, so it may not have been practical for the console market (or AMD's financials), but in terms of where bandwidth, capacity, storage, and the effectiveness of compute are headed they had to pull the trigger prior to a marked rise.

Nick · Aug 15, 2014

keldor314 said:
It surprises me how many people use Amdahl's Law to claim limits to parallelism in general since the law is actually rather flawed.

The original formulation of the law isn’t quite as flawed as the formula that was derived from it. It goes something like: every increase in parallelization has to be matched by an equal increase in sequential performance. Note that this doesn’t make any assumption about code being strictly categorized as parallel or sequential. The realization one has to make is that even in ‘sequential’ code, there are things that can be parallelized further. Out-of-order execution extracts instruction level parallelism out of single-threaded code. Even at the level of a single instruction, parallelism is achieved between individual bits and through pipelining. Reversely in the ‘parallel’ code there are hidden sequential costs like moving data around and doing more setup and synchronization, that limit just how far you can go with parallelization before you need an increase in sequential performance.

And guess what, GPUs have perfectly adhered to this formulation of Amdahl’s Law. The latency of arithmetic operations has been lowered from several hundreds of clock cycles (when texture sampling and register combining still formed a single pipeline), to just 4 clock cycles today. This trend will continue. GPUs will adopt out-of-order execution. Because they have to.

Gustavson’s Law isn’t going to prevent that. Why? Because a human’s senses are limited, and because we care about frame latency. Retina displays are, as the name implies, the limit of what is useful to compute. Maybe we can still increase the field of view and have stereo displays and have supersampling and such, but we’re clearly approaching rapidly diminishing returns. Especially when taking cost and value into account, the ‘embarrassingly parallel’ nature of real-time consumer graphics does turn out to have some inherent limits.

Frame latency comes into play when we realize that parallelization without improving sequential performance could only continue if we were fine with having a hundred frames be computed simultaneously but each frame has a full second of latency. GPUs already resort to executing commands from multiple frames to keep all their cores busy, but it’s causing problems with interactivity. For virtual reality we’ve even already exceeded what’s acceptable, and touch interfaces also prefer very low input to response latencies. APIs and drivers are being rewritten for low latency as we speak, mobile CPUs have adopted out-of-order execution, and GPUs will have to lower their latencies further as well.

Amdahl's Law expresses the limit of parallel performance in terms of percent serial in a program, but this entire premise is flawed in that for that percentage to be constant, the big O complexity of the serial portion must be the same as the big O complexity for the parallel portion. Otherwise the percent will asymptotically approach either 0% or 100%, making the law inapplicable.

You can only apply the formula form of Amdahls’ Law under constant conditions. For instance the data access latency when adding more processors has to remain the same. This is obviously not true for very long in reality as the latencies go up unless you do something about it, and that actually reinforces the original formulation of the law. Your big-O argument does not take the context of the formula into account, making your conclusions wrong right off the bat. The amount of work you have to do in the ‘sequential’ and ‘parallel’ part is strongly linked, and as I mentioned before there’s parallelism in the former and there are sequential dependencies in the latter too. You have to re-evaluate things after ever scaling step, and address latency issues to achieve a speedup in the ‘sequential’ part to keep on scaling. That’s the true Amdahl’s Law, and it unmistakably applies to GPUs as well.

The core idea of massively parallel programming then is to find algorithms such that the parallel portion is of a higher order of complexity than the serial portion. As long as this condition is met, the parallel speedup is unbounded.

Even if we ignore that such algorithms will eventually still run into issues due to the ‘sequential’ and ‘parallel’ portion having dependencies, there’s also the problem that you’d favor some applications over others, which limits creativity and value. If every game has to look like Tetris with ever smaller blocks, because that’s what your hardware supports best without scaling issues, people will prefer more versatile hardware instead.

I’m actually observing some of that happening already. CPUs are very competitive at generic computing, and will become even much more competitive with AVX-512. GPU manufacturers are thus not able to monetize their GPGPU efforts, and therefore they focus more on pure graphics again. Ironically, games are doing ever more generic computing, which will require greater versatility again. So CPUs and GPUs are inevitably on a collision course, which can only result in unification.

sebbbi · Aug 15, 2014

3dilettante said:
Perhaps it has more bearing then on the use of general compute or side functions that involve paging or updating the IOMMU, since the GPU can only operate as a guest, and something like updating or invalidating the IOMMU's TLB or page data can lead to stalls with the early implementations of shared virtual memory.
Some of that is CPU-driven when it comes to privileged operations, with other things like paging dominated by HDD latency and bandwidth.

Yes, it's very unfortunate that current GPUs cannot do things like update virtual memory page mappings using compute shaders. If this would be possible, it would make many things faster and easier to implement (no need for software indirection in virtual texturing or custom mesh data fetching). It's awkward that you need to fire CPU side interrupts to change your mappings. Latency for GPU->CPU->GPU roundtrip in a middle of the frame is just too long for it to be practical (especially on PC).

3dilettante said:
I suspect we'll find that the consoles were a few years too early and certain architectural and manufacturing advances a few years too late.

I would have at least liked to have transactional memory (TSX instructions or similar). AMD was ahead of Intel or IBM in announcing transactional memory extensions (x86 ASF instructions in 2009). However they still haven't said anything about the timeline to get these instructions to real products (source: http://en.wikipedia.org/wiki/Transactional_memory). GPU compute in next gen consoles has really spawned lots of interesting research. GPU compute in consumer devices has taken a big jump forward. Tools have also improved a lot (for example Visual Studio 2013 DirectCompute debugger/profiler is miles ahead of the previous ones). It would have been nice to see how TSX would have changed things regarding to multithreaded programming, but maybe that's what we see in the next generation after this one. I would have also liked to see even more CPU cores, just to force everyone to use actual parallel programming methods instead of continuing to run six serial programs simultaneously on separate cores and synchronizing them between frames (= additional frames of input latency and poor CPU utilization).

liquidboy said:
Do you have one in mind that already exists and can be evolved to meet your needs? Or does it need to be a radically new language?

C++? or something more high level/safe/managed?

There are multiple nice new parallel languages in development, but I haven't had time to properly experiment with any of them.

Personally I like C++11 a lot. It improved things nicely over the old C++ standard, and C++14 seems to be continuing the trend. C++ is moving forward nicely. You can even finally create variable parameter functions that are type safe (thanks to variadic templates)

There are many nice parallel programming libraries developed on top of modern C++. There are CPU based ones such as Intel TBB and Microsoft PPL. Both can be used to implement modern task/job based systems, but unfortunately neither is cross platform compatible (Windows, Linux, Mac, XB1, PS4, iOS, Android, etc), so we game developers still need to write our own systems. Nothing bad about that either, since C++ is a very flexible language, so you can usually add things that the core language doesn't support. One thing that is not possible to implement on top of straight C++ (without super excessive expression template hackery that takes years to compile) is writing SPMP-style programs that could compile to optimized SoA AVX2 vector code (and automatically multithreaded). For this you need to use something like Intel SPMD compiler, and that's not cross platform compatible either.

C++AMP is nice for writing GPU code inside your CPU code. It uses standard C++11 primitives (lambdas) to present kernels, and it's fully C++11 compatible in syntax with templates and all (one extra language keyword added). Auto-completion and all the other modern IDE productivity improvements work properly with it. You could also theoretically compile C++AMP code to standard CPU code by implementing a small wrapper library. However this wouldn't automatically SoA vectorize your code (but you could multithread it easily). C++AMP debugging is also fully integrated to Visual Studio. You can step inside the code easily and inspect the memory and variables and go though call stacks. I would prefer writing all my GPU code in C++AMP instead of HLSL, but Unfortunately C++AMP is only available on Windows, and that's a big show stopper.

If you compared C++AMP to the ideal solution, it sorely lacks the ability to call other kernels inside a kernel. Kepler now has that feature on latest CUDA version (Dynamic Parallelism). This feature is sorely needed for all GPU hardware and all GPU APIs. Being able to spawn other shaders (proper function calls) inside a shader is one of the things that has limited the ability to run general purpose code on GPUs.

Obviously even the NVIDIA cards and APIs have limitations. You can't use Dynamic Parallelism to submit draw calls for example. This would be awesome for graphics code. Indirect multidraw is nice, as it allows GPU to write an array of draw call parameters and the draw call count, but it's not as flexible as directly executing draw calls from a shader.

Nick said:
Frame latency comes into play when we realize that parallelization without improving sequential performance could only continue if we were fine with having a hundred frames be computed simultaneously but each frame has a full second of latency. GPUs already resort to executing commands from multiple frames to keep all their cores busy, but it’s causing problems with interactivity. For virtual reality we’ve even already exceeded what’s acceptable, and touch interfaces also prefer very low input to response latencies. APIs and drivers are being rewritten for low latency as we speak, mobile CPUs have adopted out-of-order execution, and GPUs will have to lower their latencies further as well.

There are people saying that GPUs need to be more like CPUs to survive: GPUs need to utilize ILP better, GPUs need to be able to run less parallel (less threads) programs better. But if you look at both NVIDIAs and AMDs latest architectures, you see a change to the opposite direction.

AMDs new GCN architecture needs at least three times as many threads than their old VLIW architecture to run efficiently. VLIW exploited ILP (static analysis), while the new architecture is pure DLP architecture, it doesn't extract any ILP. AMD has made the GPUs even wider since (you need quite a bit bigger problems to fill R290X compared to 7970).

NVIDIA did similar changes to Kepler. Fermi had double clocked shader cores with dynamic hardware instruction scheduling (limited form of OoO). The new Kepler architecture however relies on static (compile time) scheduling and halved the shader core clocks. The new architecture needs over twice as many threads (more parallelism) compared to the old one to be fully utilized.

So it seems that GPUs are still not wide enough to hit the scaling wall. Both companies are adding more and more compute units and at the same time making architectural choices that need even wider problems to fill even the existing cores efficiently.

And it's not hard to see why. 720p = 0.9M pixels. 1080p = 2M pixels (2.25x). 2560x1600 (a popular Android tablet and high end PC monitor resolution) = 4M pixels (4.44x), 4096x2160 (4K video standard, all future TV sets) = 8.8M pixels (9.6x). Year ago all console games still rendered at 720p (or lower). Now most games render at 1080p on consoles, and tablet games start to sport Retina resolution (our Trials Frontier for example runs at native Retina resolution on iPad). Next gen consoles will surely support 4K, and many PC players are already enjoying 2560x1600. Almost 10x growth in pixel count compared to the base 720p resolution during three GPU generations should be enough to provide the needed extra parallelism. What happens after that, nobody knows. Some people are already talking about 8K. It has benefits for movie theaters already, but my personal opinion is that 4K is enough for all current home use as long as the content is properly antialiased. But who knows if the displays and tables will be fully covered with display ink in the future. Devices like these need obscene resolutions to look good at close distance.

Ideas have also been thrown out (in B3D threads) that suggest that CPU and GPU could share a big pool of vector execution units. I don't see this as a realistic future direction, since higher distance between units means more electricity used to transfer the data and the commands. And it also adds latency. One thing NVIDIA did with Kepler was to replicate some hardware to multiple cores instead of share it. This cost some die area, but according to NVIDIA it both improved the performance and reduced the power usage. This is also one of the problems in AMDs Bulldozer vector unit design. The shared unit (between two cores) have much higher latencies than the Intel units (one unit per core). I also expect the data transfer cost to be higher compared to Intel's design.

Nick said:
I’m actually observing some of that happening already. CPUs are very competitive at generic computing, and will become even much more competitive with AVX-512. GPU manufacturers are thus not able to monetize their GPGPU efforts, and therefore they focus more on pure graphics again. Ironically, games are doing ever more generic computing, which will require greater versatility again. So CPUs and GPUs are inevitably on a collision course, which can only result in unification.

Ironically, right now I would be perfectly happy if GPU rasterizers reverted back to fixed function pixel processing. It's the vertex processing and compute shaders that I need

3dilettante · Aug 17, 2014

sebbbi said:
I would have at least liked to have transactional memory (TSX instructions or similar). AMD was ahead of Intel or IBM in announcing transactional memory extensions (x86 ASF instructions in 2009).

They were ahead of the other big CPU manufacturers in publishing a PDF about a proposed ISA extension. I've not seen anything that really indicates that AMD had committed itself to anything.

I'm not sure if AMD was ever in a position to follow through, given the complexity of the problem and their inability to implement other things like SMT. Intel was hit by a bug in TSX that will mean disabling it for current and near-term steppings, which goes to show that even the more modest aims of TSX versus AMD's proposal are not easy to get right.

I have wondered if it were possible that some of the more curious problems with Bulldozer's CMT design philosophy could have stemmed from something like transactional memory being ripped out of it.
CMT, per the original designer, isn't interesting without very high clocks and/or more powerful speculation techniques. AMD's CMT really didn't clock high enough and doesn't do anything architecturally interesting beyond adding unexpected ways of losing more performance than is reasonable.
Tightly coupled cores with a write-combining cache, some unusually tight restrictions on multithreaded throughput, and an slow L2 may have in a different reality allowed for a different way to handle the write set and to recover quickly from failed transactions.
There's no evidence to prove anything like that, other than how far Bulldozer fell short of decent performance.
Given the likely direction AMD is taking after Excavator, it doesn't matter.

Nick · Aug 18, 2014

sebbbi said:
One thing that is not possible to implement on top of straight C++ (without super excessive expression template hackery that takes years to compile) is writing SPMP-style programs that could compile to optimized SoA AVX2 vector code (and automatically multithreaded).

You don't have to do all the optimization work at static compile time. Reactor and Halide use LLVM to generate optimized code at run-time. It has the added advantage of being able to specialize for run-time conditions.

For this you need to use something like Intel SPMD compiler, and that's not cross platform compatible either.

ISPC binaries are available for Windows, Mac OS and Linux. And since it's open-source it should be easy to port to other platforms as well.

There are people saying that GPUs need to be more like CPUs to survive: GPUs need to utilize ILP better, GPUs need to be able to run less parallel (less threads) programs better. But if you look at both NVIDIAs and AMDs latest architectures, you see a change to the opposite direction.

Like I said, CPUs with wide vectors are becoming very competitive at generic throughput computing, and so GPUs are focussing more on graphics again because that's their safe source of revenue. I'm sure this will continue for quite a while as they won't go down without a fight. It means we might see some more evolution in the opposite direction. But don't mistake that for a trend which can continue indefinitely. The Memory Wall and Amdahl's Law can be staved off for a while, but it comes at a cost and it only makes it harder for the next generation. Eventually we'll need architectures which are highly versatile and geared toward code with varying amounts of ILP, TLP and DLP. Only unified architectures can offer that.

AMDs new GCN architecture needs at least three times as many threads than their old VLIW architecture to run efficiently. VLIW exploited ILP (static analysis), while the new architecture is pure DLP architecture, it doesn't extract any ILP. AMD has made the GPUs even wider since (you need quite a bit bigger problems to fill R290X compared to 7970).

It's not all bad. The back-to-back latency for arithmetic instructions was lowered from 8 to 4 cycles by using result forwarding (previously considered a CPU feature). Also, VLIW often left lanes unused, so the average ILP wasn't that high. Also, if I'm not mistaken GCN can start executing independent instructions from the same thread the next cycle. So in total, single-threaded performance improved.

NVIDIA did similar changes to Kepler. Fermi had double clocked shader cores with dynamic hardware instruction scheduling (limited form of OoO). The new Kepler architecture however relies on static (compile time) scheduling and halved the shader core clocks. The new architecture needs over twice as many threads (more parallelism) compared to the old one to be fully utilized.

Fermi shader units had a double clock speed but were processing a half-warp each cycle. Also, except for GF104 they were single-issue while Kepler is dual-issue.

We'd also have to compare average memory access latencies to get a more complete picture of single-threaded performance. In any case, if it's lower, it wasn't lowered by much.

So it seems that GPUs are still not wide enough to hit the scaling wall. Both companies are adding more and more compute units and at the same time making architectural choices that need even wider problems to fill even the existing cores efficiently.

And it's not hard to see why. 720p = 0.9M pixels. 1080p = 2M pixels (2.25x). 2560x1600 (a popular Android tablet and high end PC monitor resolution) = 4M pixels (4.44x), 4096x2160 (4K video standard, all future TV sets) = 8.8M pixels (9.6x). Year ago all console games still rendered at 720p (or lower). Now most games render at 1080p on consoles, and tablet games start to sport Retina resolution (our Trials Frontier for example runs at native Retina resolution on iPad). Next gen consoles will surely support 4K, and many PC players are already enjoying 2560x1600. Almost 10x growth in pixel count compared to the base 720p resolution during three GPU generations should be enough to provide the needed extra parallelism. What happens after that, nobody knows. Some people are already talking about 8K. It has benefits for movie theaters already, but my personal opinion is that 4K is enough for all current home use as long as the content is properly antialiased. But who knows if the displays and tables will be fully covered with display ink in the future. Devices like these need obscene resolutions to look good at close distance.

I seriously doubt 4K is a target for the next consoles. These devices have to be really cheap to reach a large enough market. 720p wasn't that bad and 1080p is really gorgeous for the TV form factor and viewing distance. Higher resolutions make sense for tablets because you read a lot of text on them, but for consoles 4K just doesn't add enough value to be worth the hefty cost. I can only maybe see it happen if it uses some kind of upsampling with edge detection, or adaptive shading. Either way the amount of pixel parallelism won't go up much any more.

Stereoscopic 3D calls for twice the pixels, but that's also the end of it. And while VR glasses probably demand 4K to not still see the pixels, they need very low latency to avoid causing motion sickness or other discomfort and remain immersive. So any small tasks that have dependencies still have to execute as fast as possible. Also, you can't afford any round-trip between the CPU and GPU, even when staggered between frames. So a unified architecture which combines low latency and high throughput will become necessary.

Rodéric · Aug 18, 2014

sebbbi said:
However, in the long run, I would definitely want a processor / programming model that allows me to write all my code in a single language (having good modern built in parallel primitives) and to use a single set of debugging and analysis tools that can be used to step through (or trace back) the whole code base.

Agreed, Chapel or something similar would be nice.
(Not checked RenderScript in a while, but given who's in the team there's also a good chance it would evolve in something we'd like.)

iroboto · Aug 18, 2014

sebbbi said:
Predictable code is easier for any machine to run. Branches are bad for both architectures. GPUs choke on incoherent branches (threads do not follow the same path in wave/warp) and CPUs choke on mispredicted branches, and neither likes running cold code (instruction cache stall). Haswell can sustain 4 instructions per cycle (retire rate), thus a branch miss that causes a 14 cycle stall could cost up to 56 instructions. In the worst case among these 56 instructions there are 28 AVX2 (256 bit) fused multiply-adds (Haswell can execute two AVX2 FMAs per cycle). Thus the mispredict can cost 896 flops (+28 other instructions such as 256 bit AVX2 loads/stores). People are saying that branchy code is bad for GPUs, but it is also bad for CPUs. You can most often transform your code in a way that you don't need branches at all. This is often called "sort by branch", because it means that the processor is executing elements with different branch conditions separately (all elements following a certain path first, and then the others following the other path, etc). This solves both CPUs branch misprediction penalties and GPUs incoherent branch penalties. Obviously not all algorithms can be transformed in a way like this (but you don't need to use algorithms that don't behave well on modern processors).

The other important thing is memory access patterns. Modern CPUs and GPUs need good cache locality to achieve acceptable performance. If your code is not cache-optimized, all the other optimizations are pretty much useless. This is the most important thing when optimizing either for CPU or GPU. Some people are still thinking that CPU is good at traversing pointer soups (long dependency chains with random memory accesses), such as linked lists and complex tree structures. This is simply not true. Processing linear arrays with vectorized code almost always beats complex structures at element counts that matter in interactive applications (such as games). For example Dice got huge performance gains when they ditched their old octree-based viewport culling and changed to a simpler vectorized one with flat structures (http://dice.se/publications/culling-the-battlefield-data-oriented-design-in-practice/). You might want to say that CPUs advantage is that it CAN operate on these complex structures. However, most often you don't want to use pointer soup structures at all. Generating complex structures in a massively parallel way (on the GPU) is often a difficult problem, traversing is often easy (as you can do branchess traversals with CMOV-style code if you have a depth guarantee, such as log).

Lock contention is an important thing to consider, but I feel that many programmers are just doing it wrong. I have seen too much code with ad-hoc manual lock primitives added to make something thread safe. These systems tend to bloat extensively (too many synchronization points), and cause all kinds of problems in the end. I feel that a task based system that lets you describe resources (and read / modify accesses to them) is strictly better. Your scheduler doesn't schedule a task until it can ensure that all the necessary resources are available. This way you don't have idle waiting or preemption (each context switch causes both L1 instruction and data cache to fully trash).

I am very pleased to see hardware transactional memory (Intel TSX extensions) entering consumer products. This makes it much easier to implement efficient parallel data structures and task schedulers. Unfortunately not even all the high end Haswell processors support it (4770k doesn't!), so I don't see a wide adaptation any time soon. I hope Intel (and the other GPU manufacturers) are listening now: Transactional memory would be perfect for GPUs. It would make many parallel algorithms much simpler to implement and would allow completely new kinds of algorithms. Even if the transaction size would be limited to two cache lines (64 bytes * 2), it would be enough to change the GPU computing (one cache line transaction is obviously not enough for the most interesting cases).

Damn, I just realized why my game ran like pure sh** on PSN@Home - for a game I felt was very basic. I blamed it on Lua, but I see the problem is likely how many branches were occurring. Too many for loops and if/else statements

sebbbi · Aug 18, 2014

Rodéric said:
Agreed, Chapel or something similar would be nice.

Chapel looks really promising, but they have still a lot to do to make it ready for production.

Nick said:
It's not all bad. The back-to-back latency for arithmetic instructions was lowered from 8 to 4 cycles by using result forwarding (previously considered a CPU feature). Also, VLIW often left lanes unused, so the average ILP wasn't that high. Also, if I'm not mistaken GCN can start executing independent instructions from the same thread the next cycle. So in total, single-threaded performance improved.

The single threaded performance didn't improve. The VLIW4 GPUs have much higher per "thread" performance in vectorizable code that doesn't have long dependency chains (or branches). Otherwise GCN would improve performance over an older GPU with 4x peak FLOP/s. This didn't obviously happen. The gain was less than 50%, meaning that per thread performance took a big hit in generic game tests (= simple code). In some complex compute benchmarks, the excellent 4 cycle latency, and other architecture improvements (branching, caches) result in higher than 4x gains in compute benchmarks (when compared to similar FLOP/s 6000-series Radeon). In these (relatively rare) cases the performance / thread improved. But that wasn't obviously the goal of the architecture. They wanted to go all in with DLP (static extraction of ILP with VLIW wasn't worth it as the problems had enough thread/data level parallelism).

Nick said:
Stereoscopic 3D calls for twice the pixels, but that's also the end of it. And while VR glasses probably demand 4K to not still see the pixels, they need very low latency to avoid causing motion sickness or other discomfort and remain immersive. So any small tasks that have dependencies still have to execute as fast as possible. Also, you can't afford any round-trip between the CPU and GPU, even when staggered between frames. So a unified architecture which combines low latency and high throughput will become necessary.

VR glasses require more than 4K if you want to fill the view. I have been using an Oculus Rift, and 1080p is seriously not enough. I can see the subpixels clearly with my eyes, and the image doesn't even fill my whole view. For broad consumer adaptation 4K would be needed (at the current view). To cover the whole view 8K would be needed. And stereoscopic doubles the numbers I gave earlier. 8K+stereo = roughly 80x more pixels than 720p. GPUs still need to get a bit wider to support that

I agree with you that round trips between CPU and GPU are impossible inside a frame. Getting page-id data from GPU to CPU to load proper texture pages from HDD (for virtual texturing) has two frame latency on our current PC technology (and it's hard to improve on that without causing stalling on some GPUs/drivers). Unified memory architectures (on a single die) have much better latency (less than a frame). That's acceptable.

A good example of low latency improving immersion is the newest high end Fujifilm cameras with EVF (electronic viewfinders). They reduced the latency from the sensor to the (2.36 million dot OLED) viewfinder to a mere 5 milliseconds. Traditional EVF (or camera/phone display) has around 40 millisecond latency from the sensor to the display/viewfinder. The difference is HUGE. It feels so much more like watching through a traditional optical (mirror/prism) viewfinder. I would say that this is the first electronic viewfinder that is actually usable.

Andrew Lauritzen · Aug 18, 2014

sebbbi said:
AMD GCN already has a "real core" per CU (compute unit). It's called the scalar unit.

Sure, and for most GPUs it's similar. But it's a very "bad" core compared to a CPU as far as extracting IPC goes. And no, that's not just because "that code is bad" (although of course there's never a shortage of that on both CPUs and GPUs) - there are legitimately important code and memory patterns that current GPUs do not handle efficiently. The suggestion that one could currently - or even in the near future - subsume the other is pretty naive in either direction.

sebbbi said:
Pretty much every expensive step in our games involves big amount of data / entities. Code that is just running once per frame (serial code without any loops) has never been a bottleneck. Thus Gustafson's law prevents Amdahl's law from mattering.

I think you'd be surprised at how badly serial convergence points actually hurt modern games. Reductions, mip generation, etc. are all measurably becoming larger parts of game frames due to not getting any faster for several generations. This isn't a huge problem yet but it will continue to get worse each year.

sebbbi said:
In the future games, compute shaders will change things radically.

I'm more skeptical on this point. When DX11 first came out we went through a similar period of excitement about all the great new stuff you can do with compute... then after a while you start to realize how fundamentally limited the model (and GPUs to some extent) really is.

At this point it's pretty clear to me that explicit shared memory is more of a liability than a benefit in terms of language design. "Groups" and "barriers" are basically a broken model that you have to awkwardly avoid with warp-level programming (persistent threads, etc.) in any non-trivial code. Nesting, self-dispatch and work stealing in compute are limited and inefficient even if the hardware is capable. Shared memory is useful for little more than a banked gather/scatter/atomic cache on modern GPUs, and it is far too small (even compared to registers) to handle a lot of workloads given the amount of latency hiding that modern GPUs require. Memory layouts and access patterns are opaque and impossible to reconcile through compute shaders in a portable way. And so on...

Some of these are less of an issue in a console environment where you target one piece of hardware, but you'll still run into the others. A lot of the euphoria and excitement around compute on consoles is due to the fact that the CPU cores are extremely weak on the SIMD front compared to PCs, specifically Haswell+. The ratio of FLOPS between the CPU/GPU is very different than what it is on PCs (even with discrete GPUs) and so you're basically forced to try and move everything off of the CPU as much as possible. This, however, is not an architectural argument any more than anything about Cell was... it's just the reality of the hardware they chose this generation.

sebbbi said:
One of the biggest limitations of current GPUs is that they are not designed to scale up/down the thread count based on the current register pressure. The peak register usage determines the occupancy (thread count), even if that peak occurs inside a branch that is never taken with the current data set.

This is probably the biggest issue I see with current GPU *hardware*, and it really does limit what you can do in terms of targeting them with more "grown up" languages. I think it's solvable without massively changing the hardware but I don't see any way around taking *some* GPU efficiency/area hit in the process. It's legitimately more complicated, but it's ultimately necessary.

sebbbi said:
The result is a program that doesn't have random branches, has almost linear memory access patterns, has zero raw synchronization primitives (no stalling) and has almost zero memory allocations (you can grow data tables with virtual memory tricks).

I'm excited to see these programs

Kidding aside, while most people agree on the theory of this (myself included), the practice is a completely different matter. I still see the portions of an engine that do nice data-parallel SIMD kernels with nice memory access patterns as the vast minority of the total work done. Suffice it to say, much more work still needs to be done in this area.

sebbbi said:
You should move your whole graphics engine (scene setup / management / animation / rendering) to the GPU.

This has yet to be proven in general and fundamentally is pretty architecture-specific. Certainly in the past writing command buffers has been a big overhead but that is changing quickly, and it is far from proven that GPU frontends are as fast/efficient at "gathering" data as CPUs, especially given the frequency disparity.

I think it's fair to say that some element of this stuff will be good to do "on the GPU" in the future, but I think you're making assumptions that aren't warranted for integrated chips where CPUs are very close and memory is fully shared that are not warranted.

sebbbi said:
Currently it seems that the heavy processing is moving away from the CPU to the GPU (at least in games). ... You don't need the CPU to orchestrate the things anymore.

Agreed, but again this is as much because console CPUs are really weak in this area as anything else. I would not generalize future hardware architecture trends based primarily on the specifics of current consoles. Again, look at Cell...

Anyways I am mostly in agreement with you, but I think you're overstating some points that are not yet clear architecturally and I want to play devil's advocate to flesh out exactly what is just console experience vs. general architectural analysis.

sebbbi said:
For this you need to use something like Intel SPMD compiler, and that's not cross platform compatible either.

As Nick noted, it's available pretty much anywhere and compiles basic obj/lib with optimization targets available for a wide variety of architectures, jaguar included

It's basically a frontend on LLVM so I don't think compatibility should be a major concern. There's even support for ARM/Neon in there and IIRC an experimental PTX fork too. And of course it's open source so people can add to it and modify it as required.

Frankly as someone who has used a lot of different parallel languages and platforms, ISPC is the closest to what I imagine these languages will look like in the future as anyone has come.

mczak · Aug 18, 2014

Andrew Lauritzen said:
A lot of the euphoria and excitement around compute on consoles is due to the fact that the CPU cores are extremely weak on the SIMD front compared to PCs, specifically Haswell+. The ratio of FLOPS between the CPU/GPU is very different than what it is on PCs (even with discrete GPUs) and so you're basically forced to try and move everything off of the CPU as much as possible.

I don't think that's really generally true - Kabini is not all THAT weak on that front. If you'd take some "entry level gaming" system, say with Haswell Pentium and a HD 7770, you'd find that in comparison both the xbox one and ps4 have in fact higher cpu/gpu flops ratio than this pc.
Sure if you're talking full-blown Haswell, the flops go up a lot on the PC (factor 2 for avx, another factor 2 for fma). But even then (add another factor 2 for quad-core but also a factor 4 on the gpu side for using r290 instead) the ratio is still not all that dissimilar (and obviously, the game would actually need to use avx/fma while still needing some path to run without it, otherwise that ratio is again going to be much lower).

Andrew Lauritzen · Aug 18, 2014

mczak said:
I don't think that's really generally true - Kabini is not all THAT weak on that front.

I think when I did the math a while back consoles were ~20:1 on GPU:CPU SP flops. Haswell SoCs (GT3e for the biggest GPU) are ~2:1 at best. r290 + 4790k is ~10:1. Of course HSW-E will shift that a bit back towards the CPU on high end rigs, but even now with a top of the line GPU and a (much cheaper) CPU you're still ~2x off the ratio of consoles.

mczak said:
(and obviously, the game would actually need to use avx/fma while still needing some path to run without it, otherwise that ratio is again going to be much lower).

Sure, but we're talking architecture here. My point that the "realities of current consoles" is not an argument about hardware architecture is actually in-line with your observation. For that discussion you have to pick a workload and compare power/area/$/whatever to get concrete, but at a high level I think most would agree that moving stuff to GPU compute is primarily motivated by consoles, not a lack of available CPU flops on PCs. If it was the latter, you'd probably see more games making use of the latest instruction sets as well

Anyways it's definitely an observation I have made after speaking directly to some developers about it. i.e. they have told me as much explicitly

mczak · Aug 19, 2014

Andrew Lauritzen said:
I think when I did the math a while back consoles were ~20:1 on GPU:CPU SP flops. Haswell SoCs (GT3e for the biggest GPU) are ~2:1 at best. r290 + 4790k is ~10:1. Of course HSW-E will shift that a bit back towards the CPU on high end rigs, but even now with a top of the line GPU and a (much cheaper) CPU you're still ~2x off the ratio of consoles.

Yes, if you're talking IGPs then the ratio is indeed more favored towards the CPU (though less so for the AMD APUs). But for discrete graphics my take is that if it's 10:1 or 20:1 doesn't really make all that much of a difference, as in either case you'd want to run flop-intensive code on the gpu if possible.

Sure, but we're talking architecture here. My point that the "realities of current consoles" is not an argument about hardware architecture is actually in-line with your observation. For that discussion you have to pick a workload and compare power/area/$/whatever to get concrete, but at a high level I think most would agree that moving stuff to GPU compute is primarily motivated by consoles, not a lack of available CPU flops on PCs. If it was the latter, you'd probably see more games making use of the latest instruction sets as well

Well I dunno I think it's always the case you have lack of cpu flops compared to gpu. GPUs just always had quite a bit more flops (well maybe excluding things like g45 paired with a Core 2 Quad...), and the gpu/cpu flops ratio actually hasn't changed all that much over the time (for the PC). (A high end gaming rig would have for instance been a 8800GTX coupled with a X6800 - a gpu/cpu flops ratio of roughly 10:1.)

Anyways it's definitely an observation I have made after speaking directly to some developers about it. i.e. they have told me as much explicitly

Oh, I'm sure developers would love it if they could run everything on the cpu. But I just don't see the big difference between consoles and PCs here. Sure on some PCs you might have a higher cpu/gpu flops ratio but the opposite could happen just as well. And obviously on the former gen consoles the cpu/gpu flops ratio was also much more cpu favored (especially on the ps3 though of course these weren't your typical cpu flops, but easily blowing away those amd cpu cores in the ps4 in that metric), so if you came from there I could easily see why you'd now have to move stuff you used to run on the cpu to the gpu.

3dcgi · Aug 19, 2014

Nick said:
Stereoscopic 3D calls for twice the pixels, but that's also the end of it.

Stereo might be fine for VR since you have a fixed point of view, but for monitors and TVs I think we'll eventually have multiview holograms. I love playing current games in stereo, but as a long term solution it kind of sucks.

Nick · Aug 19, 2014

sebbbi said:
Chapel looks really promising, but they have still a lot to do to make it ready for production.

Don't hold your breath. Even for a production ready language it takes decades to build up an ecosystem and gain a share among legacy languages. Especially for single-language code the amount of work involved is phenomenal. A better bet is something like Reactor: it only uses standard C++, but it defines a higher level language within it with support for the SPMD paradigm. A somewhat similar concept is used by SYCL, although it doesn't feature support for specialization like Reactor does.

The single threaded performance didn't improve. The VLIW4 GPUs have much higher per "thread" performance in vectorizable code that doesn't have long dependency chains (or branches). Otherwise GCN would improve performance over an older GPU with 4x peak FLOP/s. This didn't obviously happen. The gain was less than 50%, meaning that per thread performance took a big hit in generic game tests (= simple code). In some complex compute benchmarks, the excellent 4 cycle latency, and other architecture improvements (branching, caches) result in higher than 4x gains in compute benchmarks (when compared to similar FLOP/s 6000-series Radeon). In these (relatively rare) cases the performance / thread improved. But that wasn't obviously the goal of the architecture. They wanted to go all in with DLP (static extraction of ILP with VLIW wasn't worth it as the problems had enough thread/data level parallelism).

I beg to differ. Becoming better at executing complex code is what it's really about. There's little point in making existing code run faster. If it was adequate before, it's still adequate today. Nobody upgrades to play last year's games faster. The real value comes from efficiently running stuff that didn't run well before. And while these cases might seem rare at first, developers jump on the opportunity to increase complexity, and by the end of the generation it has become the norm. You only have to look at every past generation to see this happen over and over again. First people question whether they're too flexible, because no games use the new capabilities to the fullest, but years later they seem horribly limited in comparison to the new hardware and the code it is expected to run.

With the current consoles having integrated GPUs, it won't take all that long for the resulting characteristics of unified memory and low round-trip latency to become the norm. Discrete GPUs will suffer from this, leaving them no option but to become capable of running scalar sequential code really fast, thus allowing them to run significant portions of the game's code entirely on the GPU. Of course that's just a few steps away from getting rid of the CPU and running everything on the unified GPU. Some might call that a unified CPU though.

RecessionCone · Aug 19, 2014

Nick said:
With the current consoles having integrated GPUs, it won't take all that long for the resulting characteristics of unified memory and low round-trip latency to become the norm. Discrete GPUs will suffer from this, leaving them no option but to become capable of running scalar sequential code really fast, thus allowing them to run significant portions of the game's code entirely on the GPU. Of course that's just a few steps away from getting rid of the CPU and running everything on the unified GPU. Some might call that a unified CPU though.

Love the annotated die photo of an Intel integrated CPU/GPU:
https://software.intel.com/sites/de...tel_Processor_Graphics_Gen7dot5_Aug4_2014.pdf

The CPU is sliding into irrelevance. GPUs won. At least at Intel...

Alexko · Aug 19, 2014

RecessionCone said:
Love the annotated die photo of an Intel integrated CPU/GPU:
https://software.intel.com/sites/de...tel_Processor_Graphics_Gen7dot5_Aug4_2014.pdf

The CPU is sliding into irrelevance. GPUs won. At least at Intel...

Meh, you could just as well look at the die shot of a server CPU and proclaim that caches have won, cores are sliding into irrelevance.

3dcgi · Aug 19, 2014

Nick said:
There's little point in making existing code run faster. If it was adequate before, it's still adequate today. Nobody upgrades to play last year's games faster. The real value comes from efficiently running stuff that didn't run well before.

Ideally this would be true, but business realities means there are reasons to make existing code run faster. For example, some OEMs still use 3dmark11 as a significant factor in choosing parts and reviews use old apps. These old reviews are used to make purchase decisions even if people have already played the games.

Andrew Lauritzen · Aug 19, 2014

RecessionCone said:
The CPU is sliding into irrelevance. GPUs won. At least at Intel...

On the one hand that looks like the most stark example of big GPU/small CPU (i.e. the 15W 2 core + GT3 part) while the vast majority of parts are more skewed towards the CPU (4+2, 4+3, 2+2). I'm also not sure that's a "real" die shot in terms of being representative of area and so on...

On the other hand, the GPU definitely does take up an increasingly large part of the die, and the trend is definitely in that direction over the past couple generations. Yet some people still think that Intel doesn't care about graphics... clearly untrue

Is everything on one die a good idea?

sebbbi

liquidboy

3dilettante

Nick

sebbbi

3dilettante

Nick

Rodéric

a.k.a. Ingenu

iroboto

Daft Funk

sebbbi

Andrew Lauritzen

Moderator

mczak

Andrew Lauritzen

Moderator

mczak

3dcgi

Nick

RecessionCone

Alexko

3dcgi

Andrew Lauritzen

Moderator

Similar threads