The console future in 3 words. HP The MACHINE .... Google it ... ;-) /boom
If there's a market for gamers that upgrade pcs annually or semi-annually, then there's probably a market for consoles as well. They would just need to be forwards and backwards compatible.
I don't think anybody needs any Koolaid to understand that ARM based chips are currently selling more than any other chips combined, especially in consumer oriented devices. Software development takes nowadays more time than hardware development.
That may be possible. The Anandtech diagram showed a bus leading off from the L2.I don't know. The only 64 bit model currently announced (I6400) can be configured to have up to 6 cores, and up to 4 threads per core. Maybe there's a client scaling limit for the coherency protocols or for the L2 cache. AMD provided Sony and Microsoft CPUs that have two 4 core Jaguar modules (with no shared cache). A similar configuration would yield 12 cores + 24 threads (assuming 2 threads per core).
Without more data, it wouldn't be clear how far you could take optimizing the ARM code before it became a mis-optimization. The ARM front-end is 2-wide, so having optimal ARM instructions for a wider standard target would not yield the expected results in non-optimized mode. The optimized code comes from optimizer whose exact invocation and output would probably not be deterministic, and the output may not be readily visible.I didn't actually mean that Denver provides lower level access. I meant that Sony developers are used to optimizing code at quite low level. Denver's wide in-order design definitely benefits from code that is specially optimized for that CPU (and generally from code that minimizes cache misses). Not having full control over the actual processor level code might obviously cause grey hair for some console programmers
As the A8, Denver, and the ARM server chips show, a non-custom core is bordering on inadequate already.Denver is currently NVIDIA's only 64 bit CPU. Tegra K1 model with four Cortex-A15 cores is 32 bit. Obviously they could integrate a 64 bit core from ARM, but I believe NVIDIA would prefer selling their own CPU IP instead.
Intel's desktop chips, with the exception of graphics, have handily beaten the current console space for years. Their cores are unquestionably far more advanced, and while the GPUs themselves are not as impressive, the level of cache subsystem integration is actually far better than what AMD has managed. AMD is in turn still leading ARM in this space. In terms of memory controller design and on-die connections, they are vastly better than AMD, and Intel is on the verge of switching to a new generation of interconnect just as ARM and friends start talking up a ring bus.Intel's focus doesn't match console designs that well.
Anything latency-sensitive or which might be starved of resources if graphics demand spikes, anything that does not vectorize to batches of 64, 256, or more.Some developers are also moving physics simulation to GPU. The question becomes, what high performance game tasks remains on the CPU? If Intel sees a threat coming from the GPU direction, they might act.
The core count could drop by a factor of 2-4 and it still would probably handily defeat vanilla ARM cores, which beat MIPS.But I doubt Intel would be willing to sell their big EDRAM dies and 16 core CPUs at the price point suitable for consumer boxes.
Low level (instruction level) optimizations (except for vector instructions and some important intrinsics) are no longer done. Optimizing the memory layout is the most important thing for performance. Game engines tend to use programming patterns that utilize most data in the fetched cache lines, make the access patterns more linear or easier to prefetch and reduce hard to predict branches (example: SoA-style component/entity model). Data is often processed in larger batches, reducing the instruction cache misses. Heavy OoO machinery doesn't help code like this nearly as much as it helps pointer indirection and branch heavy productivity software (office software, network browsers, etc). A simpler in-order core (such as Denver) would be better suited for optimized code (higher performance per watt, since OoO machinery eats a lot of power). Obviously nobody wants to go back to in-order PPC era. Denver seems to be a good compromise between full blown OoO and pure in-order designs (for running game program code).Without more data, it wouldn't be clear how far you could take optimizing the ARM code before it became a mis-optimization. The ARM front-end is 2-wide, so having optimal ARM instructions for a wider standard target would not yield the expected results in non-optimized mode. The optimized code comes from optimizer whose exact invocation and output would probably not be deterministic, and the output may not be readily visible.
I fully agree with you that Intel has all the technology needed to power a next generation console. I don't doubt that. Their GPUs were slightly under-powered (and lacked some features) when current generation launched, but they have already shown to match AMD with performance AND performance / watt on laptops. Intel is advancing rapidly, and their GPUs will be among the best when the next gen launches. Intel doubles the performance of their GPUs almost every generation, while keeping the power consumption practically the same (shrinks of course help them).Intel's desktop chips, with the exception of graphics, have handily beaten the current console space for years. Their cores are unquestionably far more advanced, and while the GPUs themselves are not as impressive, the level of cache subsystem integration is actually far better than what AMD has managed. AMD is in turn still leading ARM in this space. In terms of memory controller design and on-die connections, they are vastly better than AMD, and Intel is on the verge of switching to a new generation of interconnect just as ARM and friends start talking up a ring bus.
The big x86 cores may have trouble fully reaching the power range of the best vanilla ARM cores, but that is not a problem in this space.
Current gen consoles already have similar bandwidth than a Xeon Phi board. Next gen needs at least doubled bandwidth to distinguish itself from current gen. I haven't seen a console yet that had enough bandwidthThis may be higher cost now, but a console would be able to go with a fraction of the bandwidth and capacity Intel is putting into Phi.
Yes obviously. But the time consuming CPU tasks in current games tend to be: graphics setup, animation, physics simulation, collision detection, particle simulation, particle animation, cloth simulation (in games that do not do this on GPU), pathfinding and AI. Basically everything else than AI is embarrassingly parallel. Crowd AI can be made parallel quite easily. And developers tend to hate too good AI (as the player should feel superior, not inferior to the AI).Anything latency-sensitive or which might be starved of resources if graphics demand spikes, anything that does not vectorize to batches of 64, 256, or more.
Yes... I hate this already. I wish someday I could write to the page mapping tables using compute shaders.Paging goes to CPU, and modifying the page tables is currently a global stall.
Haswell core has around 1.5x IPC compared to best ARM cores (Cyclone v2). It's fair to say that a hyperthreaded Haswell core equals two ARM cores at the same clock speed. If you are running the Haswell at 3 GHz and the ARM at 2 GHz, a single Haswell core is equal to 3 ARM cores. If generation 8.5 devices (new Nintendo console + new Apple console) are equipped with 8 ARM cores (or equivalent MIPS cores) at 2 GHz, a 6 core 3 GHz Skylake would definitely be enough for PS5 (as it would provide roughly 3x CPU performance).The core count could drop by a factor of 2-4 and it still would probably handily defeat vanilla ARM cores, which beat MIPS.
Intel also has much higher R&D costs as most companies. They need to ask for higher profits to pay for their R&D for their chips and their foundries. It's not cheap to be the leader. I just don't believe they would be willing to sell a chip worth 600$ (super high end laptop model with big integrated GPU and L4 cache) cheap enough to fit it into a 400$ gaming console.Intel's cost structure is very competitive because of its manufacturing integration, so it becomes a question of whether it wants to dip into the lower-margin space, not whether it can turn a profit. By dint of its manufacturing prowess alone, it can gain major efficiencies, and if the foundries continue to have troubles delivering on schedule it won't matter if Intel's prices are poorer on a per mm2 basis if it's able to transition cleanly so that chips are smaller overall. It's not a guarantee of success, but a lot of questions for Intel are not a question of if they can do it--they probably can, and far better in many ways--rather than if they can be compelled to go cheap.
The sentence I had after what you quoted went into that. Like you said, it wouldn't matter if Denver were OoO or not, as you would still do it. This leaves all the other code optimization techniques that Denver will actively optimize itself, so I feel it gives less opportunity for low-level optimization because it would get in the way of the intermediary software layer. If there's a specific end result desired, the art will be found in goading a black box to spit out the code.Low level (instruction level) optimizations (except for vector instructions and some important intrinsics) are no longer done. Optimizing the memory layout is the most important thing for performance.
A compiler/optimizer could benefit, since gcc is one of the SPEC benchmarks longest considered unbroken.Heavy OoO machinery doesn't help code like this nearly as much as it helps pointer indirection and branch heavy productivity software (office software, network browsers, etc).
I do not think OoO is exclusive of software optimization. The pioneering work for Dynamo was fine with an OoO core that would be roughly as complex as Jaguar while being from an era of single cores and very poor caches.A simpler in-order core (such as Denver) would be better suited for optimized code (higher performance per watt, since OoO machinery eats a lot of power). Obviously nobody wants to go back to in-order PPC era. Denver seems to be a good compromise between full blown OoO and pure in-order designs (for running game program code).
Xeon Phi supports 8-16 GB of on-package HMC, with 500 GB/s bandwidth. It demonstrates Intel's comfort with package-level integration of stacked memory, and future HMC short-reach connections should double or triple the bandwidth, should an interposer-based standard like HBM does not come into use. Pairing that with Intel's memory subsystem, and going by the leaked latency numbers of a miss to Durango's ESRAM, a fraction of a Xeon Phi successor's HMC system could yield an L3/L4 cache with similar or lower latency to the Xbox One's ESRAM, more bandwidth, and with gigabytes of capacity.Current gen consoles already have similar bandwidth than a Xeon Phi board.
I read this several times and kept missing the "too" qualifier there. I've seen enough games where it's true either way.And developers tend to hate too good AI (as the player should feel superior, not inferior to the AI).
I've made a distinction between custom ARM cores and vanilla ones that are ARM's own cores.Haswell core has around 1.5x IPC compared to best ARM cores (Cyclone v2).
I see it mostly as the other way around. They have greater revenues to pour into R&D, and they can do so across a vast array of disciplines.Intel also has much higher R&D costs as most companies. They need to ask for higher profits to pay for their R&D for their chips and their foundries.
Even when it can cut things by 3-4x and still come out ahead? What if the competition falls an extra node behind, and is more expensive?I just don't believe they would be willing to sell a chip worth 600$ (super high end laptop model with big integrated GPU and L4 cache) cheap enough to fit it into a 400$ gaming console.
If this were true, Microsoft wouldn't have discussed the pervasive clock gating for Durango.This is not perfectly well suited for a console, since console games are taxing the CPU and the GPU all the time, every frame, and often from the beginning to the end of the frame.
Core or chip-level management is just one component.There is no idling when waiting for the next user command (like in browsers and productivity/professional softwares) and there is no idling at the end of the frames
When the Jaguar was launched, some hardware review site compared a (4 core) Jaguar based laptop versus an Ultrabook. The Ultrabook had a 1.7 GHz Intel dual core (+HT) ULV CPU, while the Jaguar was running at 1.6 GHz. Based on many single threaded benchmarks they concluded that the Intel CPU had almost 4x higher IPC. The problem with this conclusion was that the Intel ULV turbo clocks at 3.0 GHz and is able to reach maximum turbo clocks most of the time in GPU light single threaded applications. It's absolutely true that the Intel CPU is 4x faster in these applications, but it certainly does not have 4x higher IPC (1.5x+ is a more realistic number when HT is not used).It'd be interesting to see how Intel sans Turbo clocks would fair against AMD on the same process.
Optimized console compiler could directly produce Denver microcode (or some higher level intermediate byte code). If I remember correctly, the Transmeta CPU could run multiple different ISAs simultaneously (they demonstrated Java bytecode running natively intermixed with x86). The same CPU would be able to run all Android software (and OS) using ARM instruction set while simultaneously running games targeting it directly (or an optimized intermediate byte code).Denver really should have, and I think probably does have, methods for facilitating loop unrolling, software pipelining, memory speculation, trace scheduling, complex branch removal and hinting, transactional memory, predication, and accelerated rollback.
The ARM V8 architecture an intrepid low-level developer will use in order to nudge the optimizer, does not.
Obviously the CPU should support advanced power saving features such as shutting down the front end when a loop is detected (Jaguar does this also). Most time critical code is always inside loops, so this optimization is always profitable. I do agree with you that some light weight OoO machinery would help, since it's impossible to predict the memory subsystem latencies on software side (without knowing the data). But something as complex as Haswell is not needed, unless the software is badly optimized for the CPU, or the CPU is generally running software containing too many pointer indirections. Haswell also has hyperthreading to hide memory based stalls. It has lots of tools to run bad code. I am just questioning whether all these tools are necessary for a cost optimized consumer device that runs tightly quality controlled software specially designed for it. Nobody is willing to go back to the old in-order era, but there are alternatives for massive OoO designs that could provide good balance of power efficiency and cost efficiency (and raw processing power of course).Why require that OoO be heavyweight? A tight loop can hit a loop detector or fit in a loop buffer, then rely on the renamer and reordering logic to unroll as much as the hardware can support.
The highest BW I could find in current products was 352 GB/s (GDDR5) in the $4235.00 flagship model (7120A). HMC will obviously change things radically in the future, but will it be cheap enough for mainstream chips that must be sold at less than 200$ in the next 3 years? I fully expect Intel to be among the first to introduce HMC in their consumer chips. However I expect them to introduce it first in the enthusiast segment, not the mainstream segment.Xeon Phi supports 8-16 GB of on-package HMC, with 500 GB/s bandwidth.
The competition is already fierce in the high end mobile market (flagship phones selling for 600$-700$). People care about the performance of these devices. Vanilla ARM cores are not enough if you are willing to compete in this market. Qualcomm, Nvidia and Apple already know this. If you don't improve your chips constantly, the big players will order their chips elsewhere for their flagship models, and soon you are only fighting for the scraps (cheap 100$-200$ phones models that provide very little profit). There is a lot of money in this business. I am sure these companies understand to spend enough of it for their R&D to maintain their market shares in the future as well.If you go with Apple's customized core, it shows how being competitive requires more thoroughly engineered implementations, but it also only applies to a hypothetical Apple console.
Apple has always customized their SOCs heavily. They have traditionally had at least twice as much memory bandwidth as their competition. But the situation is rapidly improving. There are now several Android phones in the market that are not starved to death by the lack of memory bandwidth.There are extremely limited IO and interface choices for the constrained mobile platforms. Even the consoles are terrible with this, and Intel has good tech here, too.
Yes, Intel could cut down the CPU core count easily by 3x-4x, if we are talking about their current 18 core 2000$+ server processors (18/3 = 6). A six core Intel CPU would be enough for a high end generation 9 console. However at the same time Intel needs to scale up their GPUs drastically. Around 2x scale up is needed to match the current generation consoles. You'd want to have around 8x-10x GPU performance difference between generations. A true 9 generation console would thus need a lot of scaling up.Even when it can cut things by 3-4x and still come out ahead? What if the competition falls an extra node behind, and is more expensive?
I am not talking about utilizing every part of the chip. I am talking about actually running something on all the CPU cores (even bad code that stalls because of LLC misses). PC productivity software writers seem to be happy to drop tens of frames every time I do something simple (click a UI button, write something with Word) that shouldn't have taken more than a few milliseconds to compute.It is physically impossible to exercise every unit in a core, every portion of a cache, every bus unit, and every IO driver. Even so, even using a significant fraction of the chip unnecessarily can blow the power budget.
Solid 100% utilization is obviously impossible, since the game must present frames at (preferably) constant intervals. The tasks done in a single frame need to planned according to the frame budget. Obviously it's better to slightly underutilize the CPU compared to dropping frames every time something unexpected happens in the game. For example in our latest game the frame rate was analyzed by automated tools, and level designers had to remove enough stuff to make every area constant 60 fps. In areas where there was extra CPU/GPU time available, they could add extra decorations to make it look better. This kind of optimization process guarantees that at least a few CPU cores (or the GPU) are fully maxed on every area. Obviously this kind of level optimization doesn't guarantee that all the CPU cores are fully utilized, since no engine has 100% perfect CPU load balancing (and the GPU can be the bottleneck as well).I've seen enough dev profiling slides from Sucker Punch, Guerilla, and others for CPU and GPU usage where they weren't solid bars of utilization.
I know. Intel has done great work in this area for Ivy Bridge and Haswell. I do agree that power saving technology helps also code that is using all the cores. However it helps more with code that is filled with longer execution holes (such as waiting for IO or mutexes) and/or cache misses causing memory stalls. These give the CPU more opportunities to save power (to to a deeper power saving state).Managing power can be done at nanosecond scales with pervasive clock gating. Modern clock and voltage scaling can shift in microseconds at a module or core level.
Do you mean having a user-space compiler that gets completely optimized, and when that happens its output goes into the optimization cache? The cache has no permanence, and the code would write to locations outside of the optimization cache.Optimized console compiler could directly produce Denver microcode (or some higher level intermediate byte code). If I remember correctly, the Transmeta CPU could run multiple different ISAs simultaneously (they demonstrated Java bytecode running natively intermixed with x86).
My apologies, I got muddled with the marketing names and forgot about the GDDR5 models already out there. It would be the upcoming Xeon Phi II or whatever name it gets: Knights Landing by code name.The highest BW I could find in current products was 352 GB/s (GDDR5) in the $4235.00 flagship model (7120A).
If people are discussing HBM, which is stacked DRAM and an interposer cost adder, HMC would also be in a similar cost range. It may not be enough for the full memory pool, as it isn't for Knights Landing, which is why Intel has made it possible to treat it as a separate memory area or as a last-level cache.HMC will obviously change things radically in the future, but will it be cheap enough for mainstream chips that must be sold at less than 200$ in the next 3 years?
The phones do, but not their components. Intel sells CPUs priced higher than their phones, which is why the can funnel so much money back into developing them. I see this as a significant long-term challenge, since ARM chips are beyond the performance levels where low-hanging fruit is left to pick.The competition is already fierce in the high end mobile market (flagship phones selling for 600$-700$).
If they have the incoming revenue to spend, sure.There is a lot of money in this business. I am sure these companies understand to spend enough of it for their R&D to maintain their market shares in the future as well.
There is a decent level of overlap between HPC machines that like GPUs and consoles.There has even been talks about generation 8 being the last console generation. Is it wise to put your money and research in a market that might die, if you already are participating the markets that grow and bring much larger profits per device?
That would be why Intel provides base clocks along with its turbo grades. The only time that changes is when it is trying to sell tablet-mobile level chips against the mobile chip makers who give max clocks that the phone or tablet OSes throttle almost perpetually below.Having a program (game) actually run something continuously on all the CPU cores is stressing the CPU a lot more compared to running this kind of (wake -> sleep) software.
There are elements below that level that allow for power management. Sections of SIMDs don't need to be used all the time, such as when there are ADDs but no MUL, unneeded full-width data paths, the lookup tables for transcendentals, or when lanes are predicated off, banking conflicts, etc.With multiple kernels running simultaneously it is also much easier to utilize most of the GPU bandwidth and most of the excess ALU when the rendering is bound by the fixed function units (such as ROPs, depth buffering, front end and triangle setup).