Predict: Next gen console tech (9th iteration and 10th iteration edition) [2014 - 2017]

Prophecy2k · Feb 20, 2015

Thanks Shifty and Chris for the responses. I figured this might be the case.

fehu · Feb 20, 2015

The idea with HSA is that at some point gpu units will be so close to the cpu to become some kind of decoupled float unit

3dilettante · Feb 23, 2015

Globalisateur said:
What about getting totally rid of the CPU?

Could they use the GPU compute for all CPU tasks instead? Would that be doable and reasonably efficient?

I know that this has already been answered very well days ago, but I would like to add a no to the chorus for emphasis.

Maybe the GPU is less efficient than CPU for CPU tasks but async compute (and others GPU unified pipeline stuff) could automatically optimize all available resources.

It is also not just a matter of efficiency. In terms of latency, responsiveness, and consistency there are tasks that GPUs currently are simply unacceptable for.
For an actual modern system, with all the layers of management, security, permissions, protection, reliability, and any number of metrics that are important for data that is not discardable every 1/30 or 1/60th of a second, or where errors/faults are literal system killers, GPUs are flat out not allowed the opportunity to louse things up.
This has come up with some of the discussion over Kaveri and Carizzo in the linux kernel mailing list, where apparently Kaveri's GPU could at least theoretically write to kernel memory if the driver/firmware allowed it (it does not), whereas Carizzo is physically hardwired to forbid it.
With the current console setup, and the foundational work for HSA and AMD's compute, the accelerators are distinct from and subject to the Host processor, although the host is able to do the work of the accelerator as it sees fit.

In that regard, there is a very important category of work that the GPU cannot do, and may never be suited for. While it may be tempting to think such infrastructural issues can be elided from the problem space of compute, HPC has a history of revealing system overheads that eat into performance, so you want some kind of compute client that exists in the Host space that is strong in its own right.
At the same time, the level of complexity and unforgiving nature of that space may be an area where it may be too constraining to apply to every compute client.

There is also, I think, an interesting gap in philosophy when it comes to who or what should be in charge of these compute resources.
AMD is of the mind that the compute side's scheduling hardware can autobalance and manage things on its own (at least eventually), without the OS kernel and by extension the Host domain being involved except at a higher policy-setting level. The other side is that the kernel should have the necessary visibility and authority over that work, in order to protect the system from accelerator-side code doing something stupid or dangerous.
AMD's more heavily abstracted philosophy, if it has any longevity (with "it" being the idea or AMD) may keep the two sides more distinct than the more exposed approach.

Kaotik said:
GPUs just aren't that good for serialized code, and not everything can be parallelized. But AMD is trying to paint a simpler future by already calling their APUs having "12 compute units" which of 4 are CPU cores and 8 GPU/GCN-"cores"

That's a simpler present for AMD's marketing people. It's a bigger number on the ads for a product that has a dwindling number superlatives to be given to it legitimately.
The APU is not unified enough that they can exist in the same categorization. HSA might promise a hypothetical future where they could, and even then I think it could only be done with some significant caveats, but no AMD product that is for sale even has that.

Prophecy2k said:
As GPU compute moves forward and improves, beyond the current (and perhaps even near future) GPU compute capability, will GPU designs ever reach a point where they will be able to efficiently cover the serial and complex code processing demands of current CPUs? Essentially will GPU's and CPU ever converge into a sort of hybrid design, and if so how far away are we from that?

I think one problem with this dichotomy is that a CPU core is a more distinct entity than a GPU. Modern SOCs are complex beasts, but a CPU can at least be pointed to as a single thing for critical functionality. There is a physical core you say is a CPU, where your game, compute, OS, or secure code can run.
Modern GPUs, and future GPUs with better compute and QoS present at a high level a somewhat unified entity, but they are composed of dozens of processors and sub-processors that handle different portions of the GPU's operation. This excludes explicit acceleration blocks like UVD or TrueAudio.
A CPU core handles much of this on its own, or at least can if it must.

For AMD, the command processor is at least 2 simple processors, each ACE is at least one, there are multiple DMA engines, each being at least one.
AMD patents for things like context switching or preemption indicate there being some kind of management unit for that, which will likely involve microcode for the compute units and possibly another processor. Then there are the CUs, which I am assuming are close enough to be cores in their own right.
It might come down to what happens with the previously mentioned rift in philosophy as to whether this can someday boil down to a more defined set of self-supporting processors. Any one of them could be done by a host CPU, but just replacing one of them in isolation does not seem to yield a genuine benefit.

I think much of what is the "GPU" in our minds that might be unified with a CPU is the result of a confluence of economics, legacy, engineering, and physical considerations that don't quite figure into the conceptual idea of unification. Once things reached a level of being Turing-complete on the GPU side, the conceptual unification is already there.
In terms of specialization and the practical realities of developing complex parallel architectures, I see a need for some level of specialization at a physical and implementation level for probably the lifespan of silicon VLSI as the dominant basis of our devices.

fehu said:
The idea with HSA is that at some point gpu units will be so close to the cpu to become some kind of decoupled float unit

If that's all the GPU becomes, I am not sure HSA is appropriate. A floating point coprocessor model has come before, in the form of x87. HSA includes so much in terms of a virtual ISA, discoverability, queue language, memory, and programming model that you don't need if the accelerator drops to a subsidiary unit of the host.

Esrever · Feb 23, 2015

Why can't CPU instructions be pipelined into a wider design like a GPU? Instead of 2 to 4 integer alus and float point units, you use a section of a compute unit in the GPU to it? There then wouldn't need a CPU core but just a decoder to pipeline CPU uops to a GPU compute unit. Would this be possible in the future?

Kaotik · Feb 23, 2015

Because a lot of code can't be parallelized effectively.

3dilettante · Feb 23, 2015

Esrever said:
Why can't CPU instructions be pipelined into a wider design like a GPU? Instead of 2 to 4 integer alus and float point units, you use a section of a compute unit in the GPU to it? There then wouldn't need a CPU core but just a decoder to pipeline CPU uops to a GPU compute unit. Would this be possible in the future?

That could be done, physically, but the instructions themselves are not really what I see as the problem, although there are a lot of CPU instructions that would not appear in the GPU ISA used for compute, and the command model used by the GPU front end and its ancillary units is not the same.
The GPU ISA is, however, very different, so it would be an extensive effort in itself to support raw CPU instruction decode and issue.

The actual semantics of the CPU ISA would not magically work because the GPU can decode the instructions, and the necessary behaviors for the system in terms of memory model and system protection would not fall out of having a versatile decoder.
At which point, you could send the code to the GPU with no guarantee the outcome would match the CPU, or in the case of failure of the protection model not provide a vector for malicious code or a loss of system integrity. The level of forgiveness for errors in things like the virtual memory system, the OS, and privilege levels is extremely low.

Something could be hacked into the compute domain, but the GPUs we have would still not be permitted to do all of the actions.
Things like paging in data from disk can happen with the latest partially-resident resource features, but arbitrary access to the system at large does not get done by the GPU.

zupallinere · Feb 25, 2015

3dilettante said:
That could be done, physically, but the instructions themselves are not really what I see as the problem, although there are a lot of CPU instructions that would not appear in the GPU ISA used for compute, and the command model used by the GPU front end and its ancillary units is not the same.
The GPU ISA is, however, very different, so it would be an extensive effort in itself to support raw CPU instruction decode and issue.

The actual semantics of the CPU ISA would not magically work because the GPU can decode the instructions, and the necessary behaviors for the system in terms of memory model and system protection would not fall out of having a versatile decoder.
At which point, you could send the code to the GPU with no guarantee the outcome would match the CPU, or in the case of failure of the protection model not provide a vector for malicious code or a loss of system integrity. The level of forgiveness for errors in things like the virtual memory system, the OS, and privilege levels is extremely low.

Something could be hacked into the compute domain, but the GPUs we have would still not be permitted to do all of the actions.
Things like paging in data from disk can happen with the latest partially-resident resource features, but arbitrary access to the system at large does not get done by the GPU.

If I could abstract this out a bit for my own benefit could we see the heart of the GPU as being/becomeing a compute array that could become almost a commodity of sorts and lots of it can be pushed onto the die.

How about a situation where the GPU could have a "front-end" so to speak that allows the compute array to function with better security and performance, ARM based or whatever. Something that can deal with the most valuable subset of a CPU's functionality and span the data needs of a GPU? When general CPU behavior is needed it can act to enable that behavior from the compute array otherwise it stays out of the way. Lots of die space and complexity of course and needs a ring bus ?

Xbat · Feb 25, 2015

I see quite a few people predicting 10 tflops permance. I would love that but I feel it will be 6 - 8 tflops. I feel the total power draw of the console wont go over 130 watt.
If vr becomes big like it could then I think 10+ tflops become possible and a tdp of 200 watt maybe more.

3dilettante · Feb 26, 2015

zupallinere said:
How about a situation where the GPU could have a "front-end" so to speak that allows the compute array to function with better security and performance, ARM based or whatever. Something that can deal with the most valuable subset of a CPU's functionality and span the data needs of a GPU? When general CPU behavior is needed it can act to enable that behavior from the compute array otherwise it stays out of the way. Lots of die space and complexity of course and needs a ring bus ?

Conceptually, that is close to how GPUs can handle data that needs to be paged into memory. The wavefront signals control hardware what non-resident data it needs, stalls, and a subroutine gets run on the CPU. Parts of the signalling would be handled by the existing control hardware, which is collection of somewhat more conventional cores. The kind of processor it presents to the outside world is that of a slave device, which HSA doesn't seem to change.
The idea of transferring specific functions to disparate units is something that has come up for some proposals for self-professed manycore designs that try to differentiate themselves more from multicores.

Maybe there are complications with bumping that up to appearing to be a host. The simple cores would be very cheap and validated designs, with a focus on being simple and consistent in performance. Large cores could be too cost-prohibitive and may be high-performance with the possibility of stretches of time where they perform more poorly, which may not be acceptable for a processor serving the GPU at large.

There may be concerns that the compute side, if it is allowed to more directly tie to a host core, would be a way to leak data that shouldn't leave the host domain.

Prophecy2k · Feb 26, 2015

Xbat said:
I see quite a few people predicting 10 tflops permance. I would love that but I feel it will be 6 - 8 tflops. I feel the total power draw of the console wont go over 130 watt.
If vr becomes big like it could then I think 10+ tflops become possible and a tdp of 200 watt maybe more.

A generational step in the console industry is generally a 10x difference in raw FLOPs (probably a gross oversimplification). However the 7th gen to this one was less than that, i.e. more like 7-8x, due to the reduction in silicon and TDP budgets overall with a move to an APU-based design. The intention being clear to ensure profitability of HW (or close to) at launch, instead of the usual loss-leading on HW for the first 2+ years.

The PS4 boasts what? 1.8 TFLOPs... So if the PS5 doesn't boast at least 18 TFLOPs, then Sony and MS should just wait longer untill the technology is available to enable it. The longer they wait, the more likely they will be able to move to a d/l only console model that will ensure the death of the second-hand games market... they absolutely have to throw publishers a bone if they don't want pubs to abandon consoles wholesale for greener pastures.

chris1515 · Feb 26, 2015

Prophecy2k said:
A generational step in the console industry is generally a 10x difference in raw FLOPs (probably a gross oversimplification). However the 7th gen to this one was less than that, i.e. more like 7-8x, due to the reduction in silicon and TDP budgets overall with a move to an APU-based design. The intention being clear to ensure profitability of HW (or close to) at launch, instead of the usual loss-leading on HW for the first 2+ years.

The PS4 boasts what? 1.8 TFLOPs... So if the PS5 doesn't boast at least 18 TFLOPs, then Sony and MS should just wait longer untill the technology is available to enable it. The longer they wait, the more likely they will be able to move to a d/l only console model that will ensure the death of the second-hand games market... they absolutely have to throw publishers a bone if they don't want pubs to abandon consoles wholesale for greener pastures.

I think at 12/15 Tflop/s it will be good but the biggest innovation will be stacked memory next generation it will be interesting because I think it will be more and more the bottleneck not the ALU...

Prophecy2k · Feb 26, 2015

chris1515 said:
I think at 12/15 Tflop/s it will be good but the biggest innovation will be stacked memory next generation it will be interesting because I think it will be more and more the bottleneck not the ALU...

What do you mean by bottleneck? In terms of manufacturing and/or silicon budget, or do you mean in terms of TDP?

Do stacked mem solutions consume more power?

Honestly though, I'm not sure increased memory bandwidth alone will bring with it the significant step changes in game innovation that the console industry needs. I'd be keen for something that can provide a much more radical shift in graphics rendering or existing game technology paradigms. Other than something like a radically new CPU-GPU design, and just plain ol more of the same ALU performance, what is there that can enable new rendoring techniques and/or algorithms?

sebbbi · Feb 26, 2015

If the CPU and the GPU are going to unify, I believe a good next step would be to remove the GPU command processor and let the CPU cores directly spawn the waves/warps/etc to the array of compute units. Obviously this needs shared caches between the CPU and GPU and full coherence and efficient fine grained synchronization. Intel is almost there already with Broadwell.

Intel was (long time ago) performing vertex shaders on the CPU. The CPU would be more suited to do the command processor's tasks. This would obviously allow us to do crazy stuff that is not possible with the current designs. And would at the same time sidestep all the IO/security problems.

chris1515 · Feb 26, 2015

Prophecy2k said:
What do you mean by bottleneck? In terms of manufacturing and/or silicon budget, or do you mean in terms of TDP?

Do stacked mem solutions consume more power?

Honestly though, I'm not sure increased memory bandwidth alone will bring with it the significant step changes in game innovation that the console industry needs. I'd be keen for something that can provide a much more radical shift in graphics rendering or existing game technology paradigms. Other than something like a radically new CPU-GPU design, and just plain ol more of the same ALU performance, what is there that can enable new rendoring techniques and/or algorithms?

I am sure if you ask to dev what they want more for PS4/Xb1 it is bandwith... Read for example Ubi Soft presentation, compute shader bottleneck is most of the time bandwith... The ROP are not fully exploited because the bandwith bottleneck come before...

I never said you don't need more ALU power but HBM will be a great progress.

Prophecy2k · Feb 26, 2015

sebbbi said:
If the CPU and the GPU are going to unify, I believe a good next step would be to remove the GPU command processor and let the CPU cores directly spawn the waves/warps/etc to the array of compute units. Obviously this needs shared caches between the CPU and GPU and full coherence and efficient fine grained synchronization. Intel is almost there already with Broadwell.

Intel was (long time ago) performing vertex shaders on the CPU. The CPU would be more suited to do the command processor's tasks. This would obviously allow us to do crazy stuff that is not possible with the current designs. And would at the same time sidestep all the IO/security problems.

i like the sounds of that!!

Are we talking, say, eating meatballs in Chilli crazy? Or serious "snakes on a motha f'in plane" crazy? :runaway:

Xbat · Feb 26, 2015

Prophecy2k said:
A generational step in the console industry is generally a 10x difference in raw FLOPs (probably a gross oversimplification). However the 7th gen to this one was less than that, i.e. more like 7-8x, due to the reduction in silicon and TDP budgets overall with a move to an APU-based design. The intention being clear to ensure profitability of HW (or close to) at launch, instead of the usual loss-leading on HW for the first 2+ years.

The PS4 boasts what? 1.8 TFLOPs... So if the PS5 doesn't boast at least 18 TFLOPs, then Sony and MS should just wait longer untill the technology is available to enable it. The longer they wait, the more likely they will be able to move to a d/l only console model that will ensure the death of the second-hand games market... they absolutely have to throw publishers a bone if they don't want pubs to abandon consoles wholesale for greener pastures.

I can only see them hitting 10tflops if 10nm is ready so if its another 7 year gen then yes but I think Sony is looking at a 6 year cycle with the hardware starting manufacturing in year 5

Prophecy2k · Feb 26, 2015

Xbat said:
I can only see them hitting 10tflops if 10nm is ready so if its another 7 year gen then yes but I think Sony is looking at a 6 year cycle with the hardware starting manufacturing in year 5

I wouldn't be surprised if this was a 8-10 year gen tbh. A 10tflop console wouldn't be good enough 6 years from now, imho. Perhaps 14-18tflops with high capacity / high bandwidth stacked memory solution, after 7 years would be a good enough step change. But I'm personally more intrigued by the idea of a radical design change honestly.

If next-gen sees another resolution change to 2 or 4k, then say goodbye to consoles, as games will literally look exactly the same but at a higher res.

All I care about is that we finally get enough processing power to achieve this fabled "next-gen gameplay" that seems to have remained so elusive this gen.

Xbat · Feb 26, 2015

I dont think next gen gameplay is a hardware issue anymore. In fact I think the only reason for those powerfull machines is VR I feel once we get to 5tflops + the quality of the game studios and size of the budgets start having more of an impact than the hardware power. I could be talking crap though probably am.

Deleted member 11852 · Feb 26, 2015

Prophecy2k said:
I wouldn't be surprised if this was a 8-10 year gen tbh. A 10tflop console wouldn't be good enough 6 years from now, imho.

If you measure from platform sunrise to sunset, previous PlayStations were 8-10 year ecosystems but if Sony's console console project history is any indication, it's fair to assume that there is already a group inside Sony deep into planning PlayStation 5.

I think the key difference this time is that two key pieces of technology is shifting drastically, rather than just getting smaller and faster, therefore nobody really knows what technology will really be viable for wide scale production because the economics are up in the air. New processes are risky. We have stacked RAM on the horizon and Intel are predicting going below 10nm (their next target is 7nm) will require a shift from silicon to another material [Ars Technica] which they've not worked out yet, so that's another unknown.

fehu · Feb 26, 2015

Samsung says to be on track for 7nm and seeing non problem for 5nm.
This can happen in the timeframe of the next console generation, but what's after that is still in the air, and so when you make your console on the cutting edge but the new process is too unstable for years to be economic?

Predict: Next gen console tech (9th iteration and 10th iteration edition) [2014 - 2017]

Prophecy2k

fehu

3dilettante

Esrever

Kaotik

Drunk Member

3dilettante

zupallinere

Xbat

3dilettante

Prophecy2k

chris1515

Prophecy2k

sebbbi

chris1515

Prophecy2k

Xbat

Prophecy2k

Xbat

Deleted member 11852

Guest

fehu

Similar threads