Predict: Next gen console tech (9th iteration and 10th iteration edition) [2014 - 2017]

zupallinere · Dec 31, 2014

The console future in 3 words. HP The MACHINE .... Google it ... ;-) /boom

zupallinere · Dec 31, 2014

Scott_Arm said:
If there's a market for gamers that upgrade pcs annually or semi-annually, then there's probably a market for consoles as well. They would just need to be forwards and backwards compatible.

I am warming to this over time if only because of the consumer acceptance of a "tick tock" cycle for items. If say the "tick" is this gen that creates new types of games and gameplay and the "tock" is the S version that executes those titles at the max display available say the 9th gen could just do everything to the max including VR then a hypothetical 11th would do full 4K next gen gaming at 120HZ or whathaveyou.

Shifty Geezer · Dec 31, 2014

The annual upgrade console discussion has its own thread. Can't be bothered to look. If people think there's more to add to that conversation than already said, feel free to find it and carry on.

3dilettante · Dec 31, 2014

sebbbi said:
I don't think anybody needs any Koolaid to understand that ARM based chips are currently selling more than any other chips combined, especially in consumer oriented devices. Software development takes nowadays more time than hardware development.

The ARM market is fragmented, so part of the problem is that revenue gets split up among many manufacturers and multiple designs--none of which currently address the challenges of design with the physical and design requirements of the console space.

Implementation still matters, which is why ARM's standard cores (which lead MIPS) are losing at the top end of the mobile space and are not strong in the preliminary ARM server benchmarks.
There are a number of ways that the vanilla designs lag extant desktop/server architectures by years, but how many ARM players are funneling revenue towards high-end manufacturing and design for consumer products?
The server ARM chips have heftier hardware, but do not go there. The mobile ARM products have small chips, power ranges an order of magnitude too low and die sizes too small to experience the same manufacturing and design challenges.

Maybe AMD and Nvidia have the combination of ARM presence and experience with hardware of this level of complexity and size. It would be a notable risk to rely on a newcomer to that space given the high costs and liabilities that hardware would be subject to.

I don't know. The only 64 bit model currently announced (I6400) can be configured to have up to 6 cores, and up to 4 threads per core. Maybe there's a client scaling limit for the coherency protocols or for the L2 cache. AMD provided Sony and Microsoft CPUs that have two 4 core Jaguar modules (with no shared cache). A similar configuration would yield 12 cores + 24 threads (assuming 2 threads per core).

That may be possible. The Anandtech diagram showed a bus leading off from the L2.

I didn't actually mean that Denver provides lower level access. I meant that Sony developers are used to optimizing code at quite low level. Denver's wide in-order design definitely benefits from code that is specially optimized for that CPU (and generally from code that minimizes cache misses). Not having full control over the actual processor level code might obviously cause grey hair for some console programmers

Without more data, it wouldn't be clear how far you could take optimizing the ARM code before it became a mis-optimization. The ARM front-end is 2-wide, so having optimal ARM instructions for a wider standard target would not yield the expected results in non-optimized mode. The optimized code comes from optimizer whose exact invocation and output would probably not be deterministic, and the output may not be readily visible.
It might work better to use generalized optimizations like blocking that is appropriate for the cache, but to go for compactness so that the native stream reduces thrashing of the Icache should the optimizer begin expanding things out to the uop format.

Denver is currently NVIDIA's only 64 bit CPU. Tegra K1 model with four Cortex-A15 cores is 32 bit. Obviously they could integrate a 64 bit core from ARM, but I believe NVIDIA would prefer selling their own CPU IP instead.

As the A8, Denver, and the ARM server chips show, a non-custom core is bordering on inadequate already.

Intel's focus doesn't match console designs that well.

Intel's desktop chips, with the exception of graphics, have handily beaten the current console space for years. Their cores are unquestionably far more advanced, and while the GPUs themselves are not as impressive, the level of cache subsystem integration is actually far better than what AMD has managed. AMD is in turn still leading ARM in this space. In terms of memory controller design and on-die connections, they are vastly better than AMD, and Intel is on the verge of switching to a new generation of interconnect just as ARM and friends start talking up a ring bus.
The big x86 cores may have trouble fully reaching the power range of the best vanilla ARM cores, but that is not a problem in this space.

Intel is very comfortable with chip and package-level integration, to a degree that AMD cannot match.
There are questions as to whether Intel's bet on Crystalwell will work here, but it simultaneously has development efforts into HMC and other high-bandwidth methods that it can drive significant engineering resources into if necessary.
The upcoming Xeon Phi is using a re-engineered variant of HMC to provide an on-package wide IO connection, which may be a rapid turnaround to match HBM. Or it could adopt an interposer HBM method, it has a lot of resources to make it happen.
This may be higher cost now, but a console would be able to go with a fraction of the bandwidth and capacity Intel is putting into Phi.

Beyond this, Intel is continuing to further IO and interconnect technologies and is a leading provider of non-volatile storage tech. The current gen consoles are incredibly behind in this regard.

Some developers are also moving physics simulation to GPU. The question becomes, what high performance game tasks remains on the CPU? If Intel sees a threat coming from the GPU direction, they might act.

Anything latency-sensitive or which might be starved of resources if graphics demand spikes, anything that does not vectorize to batches of 64, 256, or more.
Paging goes to CPU, and modifying the page tables is currently a global stall.

Intel is hedging its bets by upgrading its GPU to have compatible page mappings to x86 and since AMD continues to sit still it is closing with graphics performance for implementations of similar market levels.

But I doubt Intel would be willing to sell their big EDRAM dies and 16 core CPUs at the price point suitable for consumer boxes.

The core count could drop by a factor of 2-4 and it still would probably handily defeat vanilla ARM cores, which beat MIPS.
AMD's been standing still for years and ARM still has problems matching its low end.

Intel's cost structure is very competitive because of its manufacturing integration, so it becomes a question of whether it wants to dip into the lower-margin space, not whether it can turn a profit. By dint of its manufacturing prowess alone, it can gain major efficiencies, and if the foundries continue to have troubles delivering on schedule it won't matter if Intel's prices are poorer on a per mm2 basis if it's able to transition cleanly so that chips are smaller overall.
It's not a guarantee of success, but a lot of questions for Intel are not a question of if they can do it--they probably can, and far better in many ways--rather than if they can be compelled to go cheap.

sebbbi · Jan 1, 2015

3dilettante said:
Without more data, it wouldn't be clear how far you could take optimizing the ARM code before it became a mis-optimization. The ARM front-end is 2-wide, so having optimal ARM instructions for a wider standard target would not yield the expected results in non-optimized mode. The optimized code comes from optimizer whose exact invocation and output would probably not be deterministic, and the output may not be readily visible.

Low level (instruction level) optimizations (except for vector instructions and some important intrinsics) are no longer done. Optimizing the memory layout is the most important thing for performance. Game engines tend to use programming patterns that utilize most data in the fetched cache lines, make the access patterns more linear or easier to prefetch and reduce hard to predict branches (example: SoA-style component/entity model). Data is often processed in larger batches, reducing the instruction cache misses. Heavy OoO machinery doesn't help code like this nearly as much as it helps pointer indirection and branch heavy productivity software (office software, network browsers, etc). A simpler in-order core (such as Denver) would be better suited for optimized code (higher performance per watt, since OoO machinery eats a lot of power). Obviously nobody wants to go back to in-order PPC era. Denver seems to be a good compromise between full blown OoO and pure in-order designs (for running game program code).

3dilettante said:
Intel's desktop chips, with the exception of graphics, have handily beaten the current console space for years. Their cores are unquestionably far more advanced, and while the GPUs themselves are not as impressive, the level of cache subsystem integration is actually far better than what AMD has managed. AMD is in turn still leading ARM in this space. In terms of memory controller design and on-die connections, they are vastly better than AMD, and Intel is on the verge of switching to a new generation of interconnect just as ARM and friends start talking up a ring bus.
The big x86 cores may have trouble fully reaching the power range of the best vanilla ARM cores, but that is not a problem in this space.

I fully agree with you that Intel has all the technology needed to power a next generation console. I don't doubt that. Their GPUs were slightly under-powered (and lacked some features) when current generation launched, but they have already shown to match AMD with performance AND performance / watt on laptops. Intel is advancing rapidly, and their GPUs will be among the best when the next gen launches. Intel doubles the performance of their GPUs almost every generation, while keeping the power consumption practically the same (shrinks of course help them).

Intel's memory subsystem is something out of this world. The last time I we were benchmarking a new data structure on our Xeons (before we tested it on consoles), we first believed our code had a bug. The Xeon had a L3 cache so big that it fit our whole data set. Even the 6 core models nowadays have 37.5 MB of L3 cache. The L2 latency is also unbelievably small. The big OoO buffers can basically hide all L1 and L2 misses. Programmer just needs to care about data that doesn't fit to these huge 30+ MB caches...

3dilettante said:
This may be higher cost now, but a console would be able to go with a fraction of the bandwidth and capacity Intel is putting into Phi.

Current gen consoles already have similar bandwidth than a Xeon Phi board. Next gen needs at least doubled bandwidth to distinguish itself from current gen. I haven't seen a console yet that had enough bandwidth

3dilettante said:
Anything latency-sensitive or which might be starved of resources if graphics demand spikes, anything that does not vectorize to batches of 64, 256, or more.

Yes obviously. But the time consuming CPU tasks in current games tend to be: graphics setup, animation, physics simulation, collision detection, particle simulation, particle animation, cloth simulation (in games that do not do this on GPU), pathfinding and AI. Basically everything else than AI is embarrassingly parallel. Crowd AI can be made parallel quite easily. And developers tend to hate too good AI (as the player should feel superior, not inferior to the AI).

3dilettante said:
Paging goes to CPU, and modifying the page tables is currently a global stall.

Yes... I hate this already. I wish someday I could write to the page mapping tables using compute shaders.

3dilettante said:
The core count could drop by a factor of 2-4 and it still would probably handily defeat vanilla ARM cores, which beat MIPS.

Haswell core has around 1.5x IPC compared to best ARM cores (Cyclone v2). It's fair to say that a hyperthreaded Haswell core equals two ARM cores at the same clock speed. If you are running the Haswell at 3 GHz and the ARM at 2 GHz, a single Haswell core is equal to 3 ARM cores. If generation 8.5 devices (new Nintendo console + new Apple console) are equipped with 8 ARM cores (or equivalent MIPS cores) at 2 GHz, a 6 core 3 GHz Skylake would definitely be enough for PS5 (as it would provide roughly 3x CPU performance).

3dilettante said:
Intel's cost structure is very competitive because of its manufacturing integration, so it becomes a question of whether it wants to dip into the lower-margin space, not whether it can turn a profit. By dint of its manufacturing prowess alone, it can gain major efficiencies, and if the foundries continue to have troubles delivering on schedule it won't matter if Intel's prices are poorer on a per mm2 basis if it's able to transition cleanly so that chips are smaller overall. It's not a guarantee of success, but a lot of questions for Intel are not a question of if they can do it--they probably can, and far better in many ways--rather than if they can be compelled to go cheap.

Intel also has much higher R&D costs as most companies. They need to ask for higher profits to pay for their R&D for their chips and their foundries. It's not cheap to be the leader. I just don't believe they would be willing to sell a chip worth 600$ (super high end laptop model with big integrated GPU and L4 cache) cheap enough to fit it into a 400$ gaming console.

Intel benchmark results in consumer software make it seem like they are light-years ahead of their competition. This is because they have been pushing race-to-sleep for so long time and their processors are very good at it. They have extremely high turbo clocks to reach single threaded performance nobody else can match, and they are very good in running code not optimized directly to their CPU or GPU (very fine grained power states both inside the cores and outside, big OoO buffers, shared TDP between CPU and GPU, etc). This is not perfectly well suited for a console, since console games are taxing the CPU and the GPU all the time, every frame, and often from the beginning to the end of the frame. There is no idling when waiting for the next user command (like in browsers and productivity/professional softwares) and there is no idling at the end of the frames (like PC games behave, since the CPU is often too fast for the game, since the game logic has been designed for the lowest common denominator). Consoles games always use all the CPU cores and all the GPU frame time, and thus would always run the CPU and the GPU at their base clocks. Laptop ULV CPUs on the other hand are almost always running at turbo clocks (around +1 GHz from the base). This is perfect for PC / laptop usage scenario, but not a big benefit for consoles that are running software hand tailored to use all their execution resources every frame. What I am trying to say here is that Intel has lots of R&D and transistors inside their chips that are awesome for consumer devices running random software (not optimized specifically for it, usually not multithreaded perfectly and usually not taxing the GPU), but all this technology helps only a little for consoles.

function · Jan 1, 2015

It'd be interesting to see how Intel sans Turbo clocks would fair against AMD on the same process.

No doubt Intel would still win, but it might be informative to see how much of the advantage was eroded.

Hopefully 2016 will bring AMDs next big core on 14nm (I don't believe 2015 for a second AMD + GloFo = disappointment).

zupallinere · Jan 1, 2015

Going full ARM would mean increasing the cost of BC and maybe taking it off the table. If BC is important enough for the consumer and leveraging what is used and learned in the 8th gen toolkit then that may necessitate an x86 core experience with ARM finding a bigger seat at the table ( Sebbi suggested a few helpful examples )

Anything involving AVX2 is something supported in Jaguar but AFAIK not implemented in ARM and what would be the penalties for not having that SIMD on CPU facility on a future if it becomes a thing to be relied upon ? Could it be a recompile situation and then a huge patch to make the existing game on the hard drive forward compatible ??

I say 8 specialized ARM cores arranged in a ring bus configuration ( ARM Interconnect Bus it could be called ) add a chunk of memory to each. Maybe another ARM core to help synchronize ... Sounds kind of cool ;-D

Slightly more seriously if BC is important enough one or both consoles will likely stay the x86 course with ARM used to run non-gaming specific stuff ( OS and the like ) leaving all cores which will be likely doubled in speed and IPC for gaming. Whatever happens in compute in the next 2-3 years will likely drive the GPU and it's customization design-wise so I couldn't imagine what to say about it except more tflops. If async compute is a win, more customizations to ease hybridization of gaming code, otherwise just more CUs and stream processors per CU ( if that is what they are called 5 years from now ). Boring but cheaper than something new.

If BC is not as much a factor then whatever hardware gets you full 4k support in 6 years and handles most of your Global Illumination needs ( physics based materials since that could be a standard eventually )

3dilettante · Jan 2, 2015

sebbbi said:
Low level (instruction level) optimizations (except for vector instructions and some important intrinsics) are no longer done. Optimizing the memory layout is the most important thing for performance.

The sentence I had after what you quoted went into that. Like you said, it wouldn't matter if Denver were OoO or not, as you would still do it. This leaves all the other code optimization techniques that Denver will actively optimize itself, so I feel it gives less opportunity for low-level optimization because it would get in the way of the intermediary software layer. If there's a specific end result desired, the art will be found in goading a black box to spit out the code.
Hopefully devs can easily review the code produced in controlled circumstances, I know the code generator is described as being encrypted.

One can look at what was done for Itanium at the compiler level to see what it takes to make a wide in-order VLIW perform.
Denver really should have, and I think probably does have, methods for facilitating loop unrolling, software pipelining, memory speculation, trace scheduling, complex branch removal and hinting, transactional memory, predication, and accelerated rollback.
The ARM V8 architecture an intrepid low-level developer will use in order to nudge the optimizer, does not.

While I do not expect significant algorithmic changes for large routines, I think profiling will show where there are going to be places where accesses do not conform to the source ISA, and very likely there will be expected accesses that will not be seen.
One example is that Denver's architecture is doubly capable of loop unrolling, barring any internal features that can probably extend that.
Spill behavior could be quietly elided in some cases, and the actual instruction mix may be significantly different within a basic block as long as the results are equivalent at the end. The optimizer will be capable of inlining and linking at run-time various optimized functions--some of which may be variations of the exact same stretch of source ISA.
Per the whitepaper, the chip has aggressive data prefetching, including run-ahead for microcode past a miss. It would be nice if it can be clarified if this means only optimized routines get this feature.

Heavy OoO machinery doesn't help code like this nearly as much as it helps pointer indirection and branch heavy productivity software (office software, network browsers, etc).

A compiler/optimizer could benefit, since gcc is one of the SPEC benchmarks longest considered unbroken.

Why require that OoO be heavyweight? A tight loop can hit a loop detector or fit in a loop buffer, then rely on the renamer and reordering logic to unroll as much as the hardware can support.
Software loop unrolling will expand code footprint in proportion to the level of unrolling, with an ISA that quadruples code footprint by default.

In the former case, a small loop could reside in the loop buffer, clock gate the front end, and lower the Icache miss rate.
The alternative is an ostensibly efficient software solution that risks dropping multiple lines from the cache, could take multiple main memory transactions, and keeps 128KB of SRAM active for dozens of cycles.

A simpler in-order core (such as Denver) would be better suited for optimized code (higher performance per watt, since OoO machinery eats a lot of power). Obviously nobody wants to go back to in-order PPC era. Denver seems to be a good compromise between full blown OoO and pure in-order designs (for running game program code).

I do not think OoO is exclusive of software optimization. The pioneering work for Dynamo was fine with an OoO core that would be roughly as complex as Jaguar while being from an era of single cores and very poor caches.
There are synergies there, where the cost of invoking the optimizer goes down, while the optimizer can be less aggressive optimizing or it can specialize in cases where the OoO core does poorly. This means more frequent chances at fine-tuning the microcode, and a more consistent baseline. There are certain optimizations like checked loads and superblock software scheduling that can invoke costly tradeoffs in code density or worst-case cleanup, and they typically proved with Itanium to have best-case of getting to where an OoO core could get, sometimes.

Current gen consoles already have similar bandwidth than a Xeon Phi board.

Xeon Phi supports 8-16 GB of on-package HMC, with 500 GB/s bandwidth. It demonstrates Intel's comfort with package-level integration of stacked memory, and future HMC short-reach connections should double or triple the bandwidth, should an interposer-based standard like HBM does not come into use. Pairing that with Intel's memory subsystem, and going by the leaked latency numbers of a miss to Durango's ESRAM, a fraction of a Xeon Phi successor's HMC system could yield an L3/L4 cache with similar or lower latency to the Xbox One's ESRAM, more bandwidth, and with gigabytes of capacity.
That seems like a decent aspirational goal.

And developers tend to hate too good AI (as the player should feel superior, not inferior to the AI).

I read this several times and kept missing the "too" qualifier there. I've seen enough games where it's true either way.

Haswell core has around 1.5x IPC compared to best ARM cores (Cyclone v2).

I've made a distinction between custom ARM cores and vanilla ones that are ARM's own cores.
Because ARM is smaller and it needs to balance the conflicting needs of a cost-sensitive ecosystem, the vanilla cores are generally not good enough. The Anandtech article on the new MIPS cores give DMIPS (Dhrystone, which you don't use if you have anything good to say about yourself with any better benchmark) per cycle below that of cores that are not good enough.

If you go with Apple's customized core, it shows how being competitive requires more thoroughly engineered implementations, but it also only applies to a hypothetical Apple console.

Intel also has much higher R&D costs as most companies. They need to ask for higher profits to pay for their R&D for their chips and their foundries.

I see it mostly as the other way around. They have greater revenues to pour into R&D, and they can do so across a vast array of disciplines.
There may be a good chunk of revenue for ARM cores in aggregate, but Qualcomm and Nvidia aren't going to compare R&D notes. There's no big pool of money for furthering the development of ARM, rather there are countless small ones that individually cannot forge that far ahead and actively compete with each other.

So we have dozens of implementations with paltry memory buses and lame cache subsystems. ARM currently has no provisions at all for multithreading at a core level, or wide vectors, or transactional memory. The complexity of its cores are modest, in proportion to the effort ARM can afford to expend. We have phones and tablets with driver-level voltage and clock management. We have a running drama on which foundries are going to fall on their face after 14/16nm.
There are extremely limited IO and interface choices for the constrained mobile platforms. Even the consoles are terrible with this, and Intel has good tech here, too.

It is the whole-package aspect that helped AMD get the current generation, and it did so with somewhat aged and flawed tech from its heyday as a comparatively high-end and high-ish revenue IDM.
It still has not been proven yet what other players out there besides maybe IBM and Intel have consistently gotten near good-enough on IP, design, and manufacturing metrics in this design space.

This is why I'm conservatively looking at the vendors with custom cores and at least some history with large chip design in this space. It's not a cheap game to play in, leader or not.

I just don't believe they would be willing to sell a chip worth 600$ (super high end laptop model with big integrated GPU and L4 cache) cheap enough to fit it into a 400$ gaming console.

Even when it can cut things by 3-4x and still come out ahead? What if the competition falls an extra node behind, and is more expensive?

This is not perfectly well suited for a console, since console games are taxing the CPU and the GPU all the time, every frame, and often from the beginning to the end of the frame.

If this were true, Microsoft wouldn't have discussed the pervasive clock gating for Durango.
It is physically impossible to exercise every unit in a core, every portion of a cache, every bus unit, and every IO driver. Even so, even using a significant fraction of the chip unnecessarily can blow the power budget.
If the console space thinks they were utilizing chips like Xeon well, then it was because the previous gen was horrendously primitive, having been released just before this kind of tech became non-optional.
I've seen enough dev profiling slides from Sucker Punch, Guerilla, and others for CPU and GPU usage where they weren't solid bars of utilization.

There is no idling when waiting for the next user command (like in browsers and productivity/professional softwares) and there is no idling at the end of the frames

Core or chip-level management is just one component.
Managing power can be done at nanosecond scales with pervasive clock gating. Modern clock and voltage scaling can shift in microseconds at a module or core level.
Advanced high-speed power management lets chips get closer to the edge of the manufacturing process, and it combats variation. It helps reduce the cost of overprovisioned power delivery, which a recent Xbox motherboard revision has apparently done.

There is very high-end work that goes into designing this, and if you don't do it you leave a significant amount of performance and yields on the table.

sebbbi · Jan 2, 2015

function said:
It'd be interesting to see how Intel sans Turbo clocks would fair against AMD on the same process.

When the Jaguar was launched, some hardware review site compared a (4 core) Jaguar based laptop versus an Ultrabook. The Ultrabook had a 1.7 GHz Intel dual core (+HT) ULV CPU, while the Jaguar was running at 1.6 GHz. Based on many single threaded benchmarks they concluded that the Intel CPU had almost 4x higher IPC. The problem with this conclusion was that the Intel ULV turbo clocks at 3.0 GHz and is able to reach maximum turbo clocks most of the time in GPU light single threaded applications. It's absolutely true that the Intel CPU is 4x faster in these applications, but it certainly does not have 4x higher IPC (1.5x+ is a more realistic number when HT is not used).

Modern CPUs with flexible clock rates are hard for consumers (and reviewers) to understand. A CPU marketed as 1.7 GHz might actually be running most of the (single threaded) software at 3.0 GHz. This is also a great example of how far ahead Intel actually is compared to AMD. Even if AMD managed to produce a CPU with equal IPC, they would still be at least 1.5x slower in single threaded benchmarks because of Intel's high turbo clocks, made possible by their very fine grained (both spatial and temporal) power saving mechanisms.

3dilettante said:
Denver really should have, and I think probably does have, methods for facilitating loop unrolling, software pipelining, memory speculation, trace scheduling, complex branch removal and hinting, transactional memory, predication, and accelerated rollback.
The ARM V8 architecture an intrepid low-level developer will use in order to nudge the optimizer, does not.

Optimized console compiler could directly produce Denver microcode (or some higher level intermediate byte code). If I remember correctly, the Transmeta CPU could run multiple different ISAs simultaneously (they demonstrated Java bytecode running natively intermixed with x86). The same CPU would be able to run all Android software (and OS) using ARM instruction set while simultaneously running games targeting it directly (or an optimized intermediate byte code).

3dilettante said:
Why require that OoO be heavyweight? A tight loop can hit a loop detector or fit in a loop buffer, then rely on the renamer and reordering logic to unroll as much as the hardware can support.

Obviously the CPU should support advanced power saving features such as shutting down the front end when a loop is detected (Jaguar does this also). Most time critical code is always inside loops, so this optimization is always profitable. I do agree with you that some light weight OoO machinery would help, since it's impossible to predict the memory subsystem latencies on software side (without knowing the data). But something as complex as Haswell is not needed, unless the software is badly optimized for the CPU, or the CPU is generally running software containing too many pointer indirections. Haswell also has hyperthreading to hide memory based stalls. It has lots of tools to run bad code. I am just questioning whether all these tools are necessary for a cost optimized consumer device that runs tightly quality controlled software specially designed for it. Nobody is willing to go back to the old in-order era, but there are alternatives for massive OoO designs that could provide good balance of power efficiency and cost efficiency (and raw processing power of course).

3dilettante said:
Xeon Phi supports 8-16 GB of on-package HMC, with 500 GB/s bandwidth.

The highest BW I could find in current products was 352 GB/s (GDDR5) in the $4235.00 flagship model (7120A). HMC will obviously change things radically in the future, but will it be cheap enough for mainstream chips that must be sold at less than 200$ in the next 3 years? I fully expect Intel to be among the first to introduce HMC in their consumer chips. However I expect them to introduce it first in the enthusiast segment, not the mainstream segment.

3dilettante said:
If you go with Apple's customized core, it shows how being competitive requires more thoroughly engineered implementations, but it also only applies to a hypothetical Apple console.

The competition is already fierce in the high end mobile market (flagship phones selling for 600$-700$). People care about the performance of these devices. Vanilla ARM cores are not enough if you are willing to compete in this market. Qualcomm, Nvidia and Apple already know this. If you don't improve your chips constantly, the big players will order their chips elsewhere for their flagship models, and soon you are only fighting for the scraps (cheap 100$-200$ phones models that provide very little profit). There is a lot of money in this business. I am sure these companies understand to spend enough of it for their R&D to maintain their market shares in the future as well.

3dilettante said:
There are extremely limited IO and interface choices for the constrained mobile platforms. Even the consoles are terrible with this, and Intel has good tech here, too.

Apple has always customized their SOCs heavily. They have traditionally had at least twice as much memory bandwidth as their competition. But the situation is rapidly improving. There are now several Android phones in the market that are not starved to death by the lack of memory bandwidth.

3dilettante said:
Even when it can cut things by 3-4x and still come out ahead? What if the competition falls an extra node behind, and is more expensive?

Yes, Intel could cut down the CPU core count easily by 3x-4x, if we are talking about their current 18 core 2000$+ server processors (18/3 = 6). A six core Intel CPU would be enough for a high end generation 9 console. However at the same time Intel needs to scale up their GPUs drastically. Around 2x scale up is needed to match the current generation consoles. You'd want to have around 8x-10x GPU performance difference between generations. A true 9 generation console would thus need a lot of scaling up.

The TDP requirement for a console chip is around 100 watts. That is roughly twice as much as 4980HQ consumes (47 W, Haswell quad core, 2.8 GHz, Iris Pro 5200 with 128 MB L4). All they have to do is to add two CPU cores and scale the GPU up by around 15x... and drop the price by 3x. The price of course is easy to scale down (in 3 years), because it is currently a top of the line premium product. HMC could replace the EDRAM and is likely cheaper. Scaling the price down or the CPU core count up by two shouldn't cause any problems. But, I am a little bit concerned about scaling the GPU up so much so fast. If their life depended on it, they could likely pull it off. But I doubt consoles interest them that much since tablets, phones, (ultra light) laptops, servers and high performance computing are all already their focus markets and those markets are growing much faster than console market is. There has even been talks about generation 8 being the last console generation. Is it wise to put your money and research in a market that might die, if you already are participating the markets that grow and bring much larger profits per device?

Maybe Apple could again push Intel to improve their GPUs for their new Retina Macbook Airs. Apple has always equipped their (mobile) devices with better GPUs than their competition. They understand the importance of a good (integrated) GPU. Intel needs to improve to keep enough distance to Apple's own CPUs and they need to improve their GPUs to stay competitive with Imagination. The latest announced Series 7 GPUs can scale to higher core counts than Intel GPUs and probably also offer higher performance/watt.

sebbbi · Jan 2, 2015

(10k words exceeded, splitting post)

3dilettante said:
It is physically impossible to exercise every unit in a core, every portion of a cache, every bus unit, and every IO driver. Even so, even using a significant fraction of the chip unnecessarily can blow the power budget.

I am not talking about utilizing every part of the chip. I am talking about actually running something on all the CPU cores (even bad code that stalls because of LLC misses). PC productivity software writers seem to be happy to drop tens of frames every time I do something simple (click a UI button, write something with Word) that shouldn't have taken more than a few milliseconds to compute.

Having a program (game) actually run something continuously on all the CPU cores is stressing the CPU a lot more compared to running this kind of (wake -> sleep) software. Even bad game code that utilizes just a few cores full steam causes the current PC mobile chips to lower their clocks (because of TDP limits). Imagine running something that actually stresses all the 4 CPU cores to the maximum (8 hardware threads filled with work most of the time) while simultaneously stressing the GPU fully.

Asynchronous compute will increase the GPU utilization in the future games. Previously lots of GPUs cycles were wasted every time you had to do any kind of reduction operation (for example downscale for blur kernels, scene average brightness calculation, depth pyramid generation for culling, etc) or when you changed your render target and started reading the previous buffer as a texture. There were lots of stalling scenarios that allowed the GPU to sleep. Now the games will push commands from two (or more) queues, utilizing the GPU even when (big) stalls like these occur. With multiple kernels running simultaneously it is also much easier to utilize most of the GPU bandwidth and most of the excess ALU when the rendering is bound by the fixed function units (such as ROPs, depth buffering, front end and triangle setup).

It is more power efficient to run the GPU at lower clocks for the whole frame instead of running it with higher clocks and frequently stalling (and power cycling). As long as asynchronous compute doesn't trash your caches (a similar problem exists with Hyperthreading) it will be more power efficient than stalling + sleeping frequently.

3dilettante said:
I've seen enough dev profiling slides from Sucker Punch, Guerilla, and others for CPU and GPU usage where they weren't solid bars of utilization.

Solid 100% utilization is obviously impossible, since the game must present frames at (preferably) constant intervals. The tasks done in a single frame need to planned according to the frame budget. Obviously it's better to slightly underutilize the CPU compared to dropping frames every time something unexpected happens in the game. For example in our latest game the frame rate was analyzed by automated tools, and level designers had to remove enough stuff to make every area constant 60 fps. In areas where there was extra CPU/GPU time available, they could add extra decorations to make it look better. This kind of optimization process guarantees that at least a few CPU cores (or the GPU) are fully maxed on every area. Obviously this kind of level optimization doesn't guarantee that all the CPU cores are fully utilized, since no engine has 100% perfect CPU load balancing (and the GPU can be the bottleneck as well).

Game engine developers are always finding ways to extract more and more from multicore CPUs. I wouldn't want to work with a future console that wasn't properly designed to handle workloads stressing all the CPU cores. But I of course admit that a software raytracer (or a movie post production render) utilizes the CPU cores even better than games. There is no waiting needed to synchronize with display vsync and input when you are outputting a multi-minute long movie clip to a (SSD) hard drive. Games can never match long running (AVX optimized) parallel processing tasks in either CPU or GPU utilization.

3dilettante said:
Managing power can be done at nanosecond scales with pervasive clock gating. Modern clock and voltage scaling can shift in microseconds at a module or core level.

I know. Intel has done great work in this area for Ivy Bridge and Haswell. I do agree that power saving technology helps also code that is using all the cores. However it helps more with code that is filled with longer execution holes (such as waiting for IO or mutexes) and/or cache misses causing memory stalls. These give the CPU more opportunities to save power (to to a deeper power saving state).

Intel CPUs have lots of tech to help bad code more: Super fine grained power saving to reduce power cost when code stalls, big OoO buffers to reduce stalls when code is messy, extremely low latency memory system to reduce idling when code has bad memory access patterns, monstrous caches to keep huge data sets accessible (data doesn't need to be memory optimized or cache optimized to run well), hyperthreading to help filling the CPU cores with executable instructions (again most useful when bad memory access patterns cause lots of cache misses), high maximum (turbo) clocks to help programs that are not properly multi-threaded. What did I forget

3dilettante · Jan 2, 2015

sebbbi said:
Optimized console compiler could directly produce Denver microcode (or some higher level intermediate byte code). If I remember correctly, the Transmeta CPU could run multiple different ISAs simultaneously (they demonstrated Java bytecode running natively intermixed with x86).

Do you mean having a user-space compiler that gets completely optimized, and when that happens its output goes into the optimization cache? The cache has no permanence, and the code would write to locations outside of the optimization cache.

If you mean having a compiler emit Denver microcode into an executable, there would be the problem of getting it into the optimization cache. Nothing that isn't microcode can write to there, while microcode would be evaluated as ARM outside of it. The optimizer grandfathers in by being loaded as part of the secure boot-up process.
If Nvidia granted access to the secure area in memory, perhaps microcode could come from the outside. It would be a big security attack vector, and if Transmeta is the historical example, it's not going to happen.

Transmeta was dead set against native coding for Crusoe. Reverse engineering showed that the chip was incapable of executing any code but what was in the translated code segment, which means no matter what that there would be an outside repository of code that would need to be copied and interpreted to be executable. Since the reverse engineering showed Crusoe could have its update signature overwritten, perhaps it could be made to take an update that added a backdoor microcode load. However, it lacked even basic memory protections and the translation cache was fixed to a small max capacity.
There were other oddities like unexpected NOPS and multiple versions of translated code that were called based on what clock speed the chip was running on, since the CMS was also hiding hardware bugs and timing restrictions.

Navigating all that would really be low-level coding.

The highest BW I could find in current products was 352 GB/s (GDDR5) in the $4235.00 flagship model (7120A).

My apologies, I got muddled with the marketing names and forgot about the GDDR5 models already out there. It would be the upcoming Xeon Phi II or whatever name it gets: Knights Landing by code name.

HMC will obviously change things radically in the future, but will it be cheap enough for mainstream chips that must be sold at less than 200$ in the next 3 years?

If people are discussing HBM, which is stacked DRAM and an interposer cost adder, HMC would also be in a similar cost range. It may not be enough for the full memory pool, as it isn't for Knights Landing, which is why Intel has made it possible to treat it as a separate memory area or as a last-level cache.

The competition is already fierce in the high end mobile market (flagship phones selling for 600$-700$).

The phones do, but not their components. Intel sells CPUs priced higher than their phones, which is why the can funnel so much money back into developing them. I see this as a significant long-term challenge, since ARM chips are beyond the performance levels where low-hanging fruit is left to pick.

There is a lot of money in this business. I am sure these companies understand to spend enough of it for their R&D to maintain their market shares in the future as well.

If they have the incoming revenue to spend, sure.
Apple is the big financial juggernaut that could do some serious damage since it is a bigger fish than Intel. It comes down to whether Apple is willing to route spending there versus the other non-processor initiatives, and if it can be bothered to go for products in that design space. Its start with Cyclone is nice, but it is just a start. Everything around it from the system to the physical manufacturing would need to scale up in a few years to meet a console release cycle.

There has even been talks about generation 8 being the last console generation. Is it wise to put your money and research in a market that might die, if you already are participating the markets that grow and bring much larger profits per device?

There is a decent level of overlap between HPC machines that like GPUs and consoles.
Both like cost-sensitive FLOPS machines, like bandwidth, like larger dies, have space concerns, have desktop/laptop level CPUs but can trade off between throughput and latency, and live in the same order of magnitude of power consumption.
The highly integrated networking might be a major point of divergence, although on the other hand even a stripped down high-bandwidth interconnect could help with a lot of bottlenecks.
Intel is working on terabit-level optical interconnects. I was wondering if this could be used to connect two console compute "cards", which would allow some coarse scaling, like buying a VR add-on that comes with a module containing another APU and memory that could transfer data to the console at higher data rates than the consoles have now.

Ironically, there are benefits for photonics if you are using an interposer, so there are a bunch of fun future technologies that may separately come into their own but sadly may not be under the same umbrella to be used together.

3dilettante · Jan 2, 2015

sebbbi said:
Having a program (game) actually run something continuously on all the CPU cores is stressing the CPU a lot more compared to running this kind of (wake -> sleep) software.

That would be why Intel provides base clocks along with its turbo grades. The only time that changes is when it is trying to sell tablet-mobile level chips against the mobile chip makers who give max clocks that the phone or tablet OSes throttle almost perpetually below.

The power management and dynamic scaling tech goes into the base clocks as well, because they allow chips to slim down guard-banding for clock and voltage. Without the more capable power management, even base clocks would be lower.

The faster or integrated VRMs and new techniques coming out like AMD's having its cores rapidly downclock for a few cycles in cases of Vdroop allow for better baselines without fear of compromised function or chip damage

GPUs could actually benefit from that as well. AMD's GPUs are rapidly adopting very fast DVFS that has takes into account the physical properties of the chip, which helps with manufacturing variation, thermal transients, and long-term reliability. The Vdroop angle could help because the sheer amount of hardware can lead to sudden spikes in demand that current schemes can only compensate for by being too conservative in clocking, capping unit counts, or bumping voltage up.

With multiple kernels running simultaneously it is also much easier to utilize most of the GPU bandwidth and most of the excess ALU when the rendering is bound by the fixed function units (such as ROPs, depth buffering, front end and triangle setup).

There are elements below that level that allow for power management. Sections of SIMDs don't need to be used all the time, such as when there are ADDs but no MUL, unneeded full-width data paths, the lookup tables for transcendentals, or when lanes are predicated off, banking conflicts, etc.
The power management hardware has activity counters that keep track of at least some of these, which then feeds into what the GPU will set as clock and voltage.
With the introduction of preemption and graphics context switches, there will be coarser opportunities as well. I doubt many floating point operations are necessary for a SIMD that is currently saving state.

It's because hardware has picked low-hanging fruit so long ago that all the gains take an integrated and holistic effort from so many levels of the system.

mosen · Jan 3, 2015

Predict: Next gen console tech (9th iteration and 10th iteration edition) [2014 - 2017]

zupallinere

zupallinere

Shifty Geezer

uber-Troll!

3dilettante

sebbbi

function

None functional

zupallinere

3dilettante

sebbbi

sebbbi

3dilettante

3dilettante

mosen

Tahir2

metacore

Rangers

Scott_Arm

Tahir2

metacore

Arwin

Now Officially a Top 10 Poster

Similar threads