Predict: The Next Generation Console Tech

Hornet · Jan 10, 2013

liolio said:
I still think they are beating a dead horse with their cluster/CMT approach, the premise were wrong, big cores are getting nowhere.

What do you mean by the bold part? I think the problem with Piledriver is mostly single-thread performance*. Intel seems doing just fine with cores bigger than the ones in Piledriver. In my opinion, the problem with CMT is that, for a fixed die area, fat cores with SMT are, for most workloads, better than smaller cores with CMT.
- Single-thread high-IPC workloads can use all the core resources with the SMT approach, but they can't with the CMT approach.
- In multi-threaded workloads, its going to be a toss up: CMT requires more die space but provide an higher increase in throughput than SMT. Some workloads currently suffer on Piledriver because only 4 decoders are available per module, but this is going to change with Steamroller. Also the instruction cache is quite small, but it is going to be larger in Steamroller.

Once shared resources are split (instruction decoders) or enlarged (instruction cache), one has to wonder whether the die savings of CMT versus separate cores, even for multi-thread workloads, are worth the trouble. Especially since SMT is much cheaper in terms of die area and still provides decent improvements for many workloads. Maybe AMD will eventually pull off a "Pentium M" and go back to a K10-derived architecture in the future.

*There are other issues with Piledriver, especially the power consumption, but I don't think they are in any way related to CMT.

liolio · Jan 10, 2013

Hornet said:
Well, one could argue that Cell (i.e., an ambitious CPU) could have been used to improve physics, AI and the scale of gameplay. However, it ended up being used mostly to aid RSX, even by first party developers. So I guess spending more of the transistor budget on the GPU makes sense, especially considering that GPGPU is now a much more viable option that it was back in 2005/2006.

Well it did, I remember of multiple DF pieces were the PS3 did not drop frame whereas the 360 did and an increased graphic workload doesn't seeem to be the reason for that matter of fact.

For GPGPU I would not bet much on it, even nvidia has made compromise on Kepler and has come as far as stating this wrt to their embedded devices:

"Today's mobile apps do not take advantage of OCL (OpenCL), CUDA or Advanced OGL (OpenGL), nor are these APIs exposed in any OS. Tegra 4's GPU is very powerful and dedicates its resources toward improving real end user experiences."

Ok you can do more in a close platform but that's it.

For the Cell... well, I guess Sony hardware capability have further weakened since KK comment on the matter. They gave up on it.
To some extend it is sad:
BC is thrown out of the windows
I guess lot of tools / engine too.

In a console the Cell had a space to shine. It could have gone into the PSV (not the ppe but the SPE are really low power so at lower speed, on a lp power process it could have been an option).
Sony may have a solid base line for its actual product (PS3 and PSV) and grow it further with the PS4.
They could also have stick with Nvidia, Nvidia still works on hardware that migh share a lot genes with the RSX for their tegra line, the PSv could have consisted of a SoC with 1/2 low power PPC, a few SPU and Nvidia GPU. Depending on the SPU power consumption actually they could have use a conservative GPU, pretty much in charge of filling a g-buffer with the bulk of computations done on SPUs.
SPUs could have go though a slight revision (which would have been used in the Vita and the PS4), something "relatively simple" like making them wider. I think of making the SPU 8 wide for FP while keeping them 4 for integer. Looking at AVX units which followed that patch it doesn't look like a massive hardware investment.

On IBM 32nm you could packed quiet some SPU, 1 or 2 more capable CPU (than a PPU) and an upgraded RSX (no need for a massive investment as SPU would handle lot of computations).
They could have packed quiet some power (no where near enthusiast expectations though who dream about next gen matching their SLI-Xfire set-up) in a reasonably sized SoC.

They would have 3 products where codes could have been shared, BC would be a given, etc.

It seems that they could not, might have been above their capabilities (so you have pay somebody else to do the job $$$$$ ) hence that is why they may have turned to AMD which may have dome them a good deal to take care of both the CPU and GPU design. It is also something "standard" with ready to use tools, etc.

Lucid_Dreamer · Jan 10, 2013

Hornet said:
Well, one could argue that Cell (i.e., an ambitious CPU) could have been used to improve physics, AI and the scale of gameplay. However, it ended up being used mostly to aid RSX, even by first party developers. So I guess spending more of the transistor budget on the GPU makes sense, especially considering that GPGPU is now a much more viable option that it was back in 2005/2006.

It did improve physics and the scale of gameplay. Uncharted 3 had a ship on top of a water simulation with objects inside the ship reacted accordingly. Motorstorm: Apocalypse had a large scale of gameplay with large amounts of physics at 600Hz (Cell did not help with graphics). Killzone 3 had very large scale gameplay. Some people even complained of it being to large of an environment. The A.I. was also improved over the Killzone 2 A.I. (won award for dynamic A.I.). This is just off from memory. There are other examples. Are there any 3rd party examples of these things?

It seem it was doing both effectively, when used.

If the discrete GPU rumor is true, I could see them using the APU for a lot of those tasks.

Lucid_Dreamer · Jan 10, 2013

(((interference))) said:
Thanks, isn't that a bit low then since Xenon apparently has 115 GFLOPS, is that wrong?

The graph provided by IBM to Forbes said 77 GFLOPs. And, it makes sense, since the Xenon cores were almost identical to the PPU in the Cell processor. Xenon had a larger vector register. but had less L2 cache per core.

"The 115.2 figure is the theoretical peak if you include non-arithmetic instructions such as permute. These are not normally included in any measure of FLOPs."

bkilian · Jan 10, 2013

RedVi said:
128 bit wide, which is what Jaguar has per core. Unless it's some sort of custom Jaguar core to that sort of extent, which would be pretty impressive but probably unlikely.

Why unlikely? MS got IBM to add VMX128 to the Xenon.

Osamar said:
I still remeber this with the GUS http://www.youtube.com/watch?v=XtCW-axRJV8

Aah, the wonder of Future Crew. We had long conversations on IRC with PSY and Trug about how they did some of those effects. About half that code is written in turbo pascal (Well technically, in reality, it's just wrappers around "asm" statements) A number of those guys went on to form Remedy Entertainment, responsible for Max Payne and Alan Wake.

AlStrong said:
The flops capacitor is a bit of strange thing...

Dude, if I were still at MS, I would totally make that it's official name.

Helmore · Jan 10, 2013

bkilian said:
Why unlikely? MS got IBM to add VMX128 to the Xenon.

It seems quite costly and to what extent are floating point operations important in game code? On the PC it seems that it's mostly improvements in integer code execution on the CPU that improves performance in games. Now I'm not sure how comparable game code on the console and the PC is, but there has got to be something to that relation.

Bagel seed · Jan 10, 2013

Built in gesture tech with AMD APUs

Upcoming AMD APUs codenamed “Richland” and “Temash” offer gesture control with accelerated performance in comparison to traditional CPUs
“Gesture recognition is becoming an important feature to be included in digital devices, the integration of our solution into AMD’s APUs brings this game-changing technology to the masses,” comments Gideon Shmuel, CEO, eyeSight

TEL AVIV, Israel--(BUSINESS WIRE)--Following hot on the heels of AMD’s product news from CES, eyeSight today announces that its leading gesture control technology has been integrated into AMD’s upcoming Accelerated Processing Unit (APU) platforms, “Richland” and “Temash”; intended primarily for desktop, laptop and tablet PCs. With eyeSight’s gesture recognition closely optimized and integrated into AMD Gesture Control, AMD’s “Richland” and “Temash” APU solutions are able to process gestures with optimal speed, accuracy and efficiency. With eyeSight’s gesture technology, AMD is providing an ideal solution to the snowballing demand for intuitive gesture-capabilities in both business and consumer devices.

http://www.businesswire.com/news/ho...ing-AMD-APUs-Feature-Built-in-Gesture-Control

3dilettante · Jan 10, 2013

That sounds more like AMD is providing optimized software along with its APUs. The chips don't have physical changes corresponding to the bundled software.

Bagel seed · Jan 10, 2013

Yes it seems it's just optimized for the APUs.

AMD Gesture Control is designed to enable gesture recognition as a tool for controlling certain applications on your PC. Only available on upcoming AMD A10 and A8 APUs codenamed "Richland" and upcoming AMD A6 and A4 APUs codenamed "Temash.” Requires a web camera, and will only operate on PCs running Windows 7 or Windows 8 operating system. Supported Windows desktop apps include: Windows Media Player, Windows Photo Viewer, Microsoft PowerPoint and Adobe Acrobat Reader. Supported Windows Store apps include: Microsoft Photos, Microsoft Music, Microsoft Reader and Kindle.

Performance may be degraded in low lighting or intensely-focused lighting environments.

Not rave party game compatible with Illumiroom confirmed?

Hornet · Jan 10, 2013

Lucid_Dreamer said:
It did improve physics and the scale of gameplay. Uncharted 3 had a ship on top of a water simulation with objects inside the ship reacted accordingly. Motorstorm: Apocalypse had a large scale of gameplay with large amounts of physics at 600Hz (Cell did not help with graphics). Killzone 3 had very large scale gameplay. Some people even complained of it being to large of an environment. The A.I. was also improved over the Killzone 2 A.I. (won award for dynamic A.I.). This is just off from memory. There are other examples. Are there any 3rd party examples of these things?

It seem it was doing both effectively, when used.

If the discrete GPU rumor is true, I could see them using the APU for a lot of those tasks.

Well, I don't disagree with that. If you spend your transistor budget on the CPU, developers can do some cool things. However, most developers (not all, of course) might end up using the extra CPU performance to move tasks from the GPU to the CPU, like they did with Cell. Moreover, GPUs got a lot more flexible in the last 8 years and the trend seems to be the exact opposite. i.e. move CPU tasks to the GPU. For some tasks this might not be desirable, but as long as the CPU is good enough, I guess spending a larger portion of the transistor budget on the GPU makes sense. If the 8 Jaguar + 10 CU rumor is true, about 20-25% of the die area would be spent on the CPU.

anexanhume · Jan 10, 2013

bkilian said:
Why unlikely? MS got IBM to add VMX128 to the Xenon.

What customizations directly to the Jaguar core would be reasonable to expect and would be beneficial for the intended workloads? Might one of the alleged 3 customizations be in the jaguar cores themselves? Is AVX2 at all reasonable or would it make sense to devote that silicon to a bigger GPU?

edit: Given the share of die size the current FP unit has, I highly doubt AVX2 support, but it's worth asking: http://www.xbitlabs.com/picture/?src=/images/news/2012-09/amd_jaguar_5.jpg

almighty · Jan 10, 2013

Hornet said:
For the sake of completeness, the fastest AMD chip is the FX-8350, with a peak of 256 GFLOPs. I wonder if Steamroller or more likely Excavator will double the width of the FMA units and support AVX2.

With the sheer amount of power and heat AMD's FX CPU's kick out you can either count on the clock speed being at least halved for console use.

Rangers · Jan 10, 2013

upnorthsox said:
I guess there's the obvious Kinect2

As you say, the equally obvious esram

Third, idk, HSA maybe. It makes sense in that its new/different and will likely bring a nice performance boost.

Supposedly the ESRAM doesnt count as one...

So the only one I know "for sure" is the audio chip.

The talk of Jaguar optimizations might be something...

XpiderMX · Jan 10, 2013

Rangers said:
Supposedly the ESRAM doesnt count as one...

So the only one I know "for sure" is the audio chip.

The talk of Jaguar optimizations might be something...

Who is the source of this rumor?

Hornet · Jan 10, 2013

anexanhume said:
edit: Given the share of die size the current FP unit has, I highly doubt AVX2 support, but it's worth asking: http://www.xbitlabs.com/picture/?src=/images/news/2012-09/amd_jaguar_5.jpg

Keep in mind the total die size is just 3.1 mm^2. Also, I suppose the green area named "FP" actually contains:
- scheduler
- vector register file
- 2 x VALU
- 1 x VIMUL
- 1 x St. Conv.
- 1 x FPAdd
- 1 x FPMul
Replacing the FPAdd and the FPMul with two FMA units (similar to the ones in Piledriver) shouldn't add too much to the die area.

Hornet · Jan 10, 2013

almighty said:
With the sheer amount of power and heat AMD's FX CPU's kick out you can either count on the clock speed being at least halved for console use.

Well, according to most rumors, the CPUs of both consoles are most likely based on Jaguar, not on Piledriver/Steamroller. The performance per watt and die-area seem to make Jaguar a better choice, unless single-thread performance is considered very important (unlikely).

anexanhume said:
What customizations directly to the Jaguar core would be reasonable to expect and would be beneficial for the intended workloads?

TurboCore, if not present in Jaguar by default, probably makes sense to speedup lightly-threaded portions of the code. If the base clock is 1.6 GHz, Jaguar supposedly can be clocked quite a bit higher (2.0 GHz?).

itsmydamnation · Jan 10, 2013

almighty said:
With the sheer amount of power and heat AMD's FX CPU's kick out you can either count on the clock speed being at least halved for console use.

depends on your power budget but i wouldn't expect 1/2, hell given my 8350 runs @4.6 undervolted to 1.275 and will run stock @ 1.2 ( all IBT stable) there seems to be a fair bit of ability to minimize heat just via binning.

liolio · Jan 10, 2013

Lucid_Dreamer said:
The graph provided by IBM to Forbes said 77 GFLOPs. And, it makes sense, since the Xenon cores were almost identical to the PPU in the Cell processor. Xenon had a larger vector register. but had less L2 cache per core.

"The 115.2 figure is the theoretical peak if you include non-arithmetic instructions such as permute. These are not normally included in any measure of FLOPs."

If memory serves right that figure comes from the fact that Xenon core can do a dot product which account for how don't know how many FLOPS in I don't know how many cycles.
Anyway usually FMA are accepted as measurement of FLOPS as it is a really common instruction.

DuckThor Evil · Jan 10, 2013

Sonic said:
Wait. That SEGA and Sony rumor regarding same hardware......I believe that is entirely plausible. SEGA has used console hardware in the past and I don't see a reason not to use Sony's machine for the arcade. It will be cheap enough (in terms of arcade pricing) and have powerful enough hardware to make awesome visuals. It allows SEGA to have direct access to Sony's new console hardware and make for effortless porting. And with having the knowledge that comes with both arcade and home console markets will allow the company to learn the hardware faster. Then again the same advantages would be had if they used MS's console. But I would think Sony''s machine would be the basis for the new arcade hardware given this being the 3rd time I've heard this rumor.

While this scenario sounds plausible this "3rd" time was actually initiated, by the looks of it you...

Your post:

That rumor is bogus. I'm lead to believe one new arcade hardware from SEGA will be based off of one of the two next gen consoles and AMD is providing CPU & GPU. That and it has no "ring" in its name.

prompted Texan=dikazz to create the "world exclusive report" about it two days later. To be honest I don't care that much anymore what Texan does, or whether he is allowed to post etc, which is why I didn't immediately call dikazz out, like I normally do

. I'll probably just ignore his doings from now on, but I'll still know it's him everytime he shows up.

anexanhume · Jan 10, 2013

Hornet said:
Keep in mind the total die size is just 3.1 mm^2. Also, I suppose the green area named "FP" actually contains:
- scheduler
- vector register file
- 2 x VALU
- 1 x VIMUL
- 1 x St. Conv.
- 1 x FPAdd
- 1 x FPMul
Replacing the FPAdd and the FPMul with two FMA units (similar to the ones in Piledriver) shouldn't add too much to the die area.

Still, if that portion grows, the shape of the die loses its rectangular shape. I don't enough about VLSI design to know how much of a ripple affect that has to other blocks, routing and timing rework. Is it a big deal?

Hornet said:
TurboCore, if not present in Jaguar by default, probably makes sense to speedup lightly-threaded portions of the code. If the base clock is 1.6 GHz, Jaguar supposedly can be clocked quite a bit higher (2.0 GHz?).

Is it too much to ask for the CPU and GPU to throttle relative to one another as needed?

Predict: The Next Generation Console Tech

Hornet

liolio

Aquoiboniste

Lucid_Dreamer

Lucid_Dreamer

bkilian

Helmore

Bagel seed

3dilettante

Bagel seed

Hornet

anexanhume

almighty

Rangers

XpiderMX

Hornet

Hornet

itsmydamnation

liolio

Aquoiboniste

DuckThor Evil

anexanhume

Similar threads