Predict: The Next Generation Console Tech

tunafish · Mar 14, 2012

hoho said:
tunafish and upnorthsox, are you sure you are both talking about flops per module or flops per int-core?

Well, I have been talking of per module since the first post, and qualified it pretty clearly on the second reply.

Also, he quoted a throughput number of 51.2 gflops for full BD, which is just wrong. (well, it would be correct for doubles...). So I don't think the amount of cores is the issue.

Gubbi · Mar 14, 2012

upnorthsox said:
Just to be clear, I was responding to your 8 madd ops per cycle when I said half that. If you want to do that funky int ops = FP ops like they did with the consoles then be my guest and double it up (however I believe it would be 7 not 8 for BD). Actually, it'd probably be more accurate to call it a spu instead of an fpu but then MS worked too hard to turn Cell into a four letter word for them to consider that.

You're confused on multiple levels. Just pay attention to Tunafish.

Cheers

McHuj · Mar 14, 2012

Rangers said:
Yeah they are, and this seems a little ominous
...
Also implies the next console specs still arent settled. Even 2013 might be too early.

I don't know about release date, but it seems to me that RAM maybe the disputed spec still under review.

Other than Epic urging for more powerful consoles, we've had both Crytek and Dice ask for 8 GB. I don't want to get into the feasibility of 8GB, the point here is that we've had multiple dev's now publicly state that they would like a really high amount of RAM. My guess is that they may not be happy with the current projected amount of RAM.

V3 · Mar 14, 2012

Rangers said:
Also implies the next console specs still arent settled. Even 2013 might be too early.

Yeah, it'll be funny if iPad becomes the lead platform for game developments.

I really think next gen system need at least dual top end GPUs and lots of RAM to last the ten years life cycle. If not those smartphones and iPad might become good enough to make consoles irrelevant.

I think Wii U might be ok with lower specs and small form factor since Nintendo will likely put out another machine in six years or they'll go third party earlier. But for Sony and MS, put the console in an amp or receiver type case and put as much processing power as technologically possible and start with higher price and skim down.

upnorthsox · Mar 14, 2012

tunafish said:
Yes, and?

The point still stands -- each module can do 8 SP FMADDs per cycle, or, a 4-module, 8-core chip can do 32 FMADD each cycle, not 16.

Umm, still no. I am not counting the integer ops.

And what? the FPU is four wide, capable of issue, execution and
completion of four ops each cycle.

If this is wrong then show me how/where?

I don't have a problem being wrong but I want to know how I am because that reads pretty straight forward.

tunafish · Mar 14, 2012

upnorthsox said:
And what? the FPU is four wide, capable of issue, execution and
completion of four ops each cycle.

If this is wrong then show me how/where?

I don't have a problem being wrong but I want to know how I am because that reads pretty straight forward.

Each fp "op" there is a 128-bit SIMD bundle, consisting of 2 64-bit doubles, 4 32bit floats, or one of various integer formats.

So the FPU can issue 2 SIMD bundles to FMA units per clock, the total issue rate per FPU being 8 FMA per clock. In addition to this, it can issue 2 non-fma fpu ops -- including one store and one transformation. Also, two loads can be issued on the integer side.

Although to keep that up, you have to use 256-bit AVX, because the frontend limits you to 4 ops per clock partitioned between two threads.

JardeL · Mar 14, 2012

McHuj said:
I don't know about release date, but it seems to me that RAM maybe the disputed spec still under review.

This.

3dilettante · Mar 14, 2012

The front end is 4 instructions per clock, alternating between threads.
AVX-256 is not necessary to provide enough instructions to hit peak FP throughput. According to Agner Fog, it is counterproductive.
The FPU wastes resources cracking the ops, and the front end apparently can only handle one double instruction (which 256-bit ops are) at a time.

Chalk that latter nugget up on the list of unexpectedly weak things about Bulldozer.
Still, this is relative to Xenon, which is a bar BD should be able to clear.

upnorthsox · Mar 14, 2012

tunafish said:
Each fp "op" there is a 128-bit SIMD bundle, consisting of 2 64-bit doubles, 4 32bit floats, or one of various integer formats.

This part was understood.

So the FPU can issue 2 SIMD bundles to FMA units per clock, the total issue rate per FPU being 8 FMA per clock.

This is where you had lost me before, or maybe I should say the documentation because if its actually 8 per cycle per fpu then they should say that.

In addition to this, it can issue 2 non-fma fpu ops -- including one store and one transformation. Also, two loads can be issued on the integer side.

This part again was understood

Although to keep that up, you have to use 256-bit AVX, because the frontend limits you to 4 ops per clock partitioned between two threads.

So you are saying that both 128bit threads are run per cycle from the partitioned 256bit load? If so, then this would explain our difference. Though again the documentation is not clear:

"Only 1 256-bit operation can issue per cycle, however an extra cycle can be incurred as in the case of a FastPath Double if both micro ops cannot issue together."

tunafish · Mar 14, 2012

upnorthsox said:
This is where you had lost me before, or maybe I should say the documentation because if its actually 8 per cycle per fpu then they should say that.

Oh, I'm sorry. I can see how my answer was less than helpful.

This is because in x86 nomenclature, a single SSE "bundle" is a single op working on a 128-bit quantity. The fact that some ops treat their arguments as 4 distinct floats is just a detail about the op. So when you posted the quote, I frankly did not understand what you were talking about.

So you are saying that both 128bit threads are run per cycle from the partitioned 256bit load? If so, then this would explain our difference. Though again the documentation is not clear:

"Only 1 256-bit operation can issue per cycle, however an extra cycle can be incurred as in the case of a FastPath Double if both micro ops cannot issue together."

Any 256-bit ops are cracked in the decoder before issue into two separate 128-bit ops, which then issue to the 128-bit units whenever they can.

3dilettante said:
The front end is 4 instructions per clock, alternating between threads.
AVX-256 is not necessary to provide enough instructions to hit peak FP throughput. According to Agner Fog, it is counterproductive.

Agner Fog's guide for BD is out? How did I miss that?

You are, of course, correct. I was referring to the theoretical case where you want to use both FMA pipes, store, and loads simultaneously. Turns out you can't actually do that anyway -- according to Fog, BD can do either 2 loads or load+store, not 2 loads + store like SNB. (SNB has just 2 agus, but when loading 256-bit quatities, or doing unaliased loads, it can use one agu to drive both load pipes.)

This somewhat mystifying to me. K8 used a similar scheme to crack long FPU ops to half, and there you could decode as many double instructions per clock as you felt like. I frankly just assumed that BD was the same way.

Chalk that latter nugget up on the list of unexpectedly weak things about Bulldozer.

The list is getting rather long.

Still, this is relative to Xenon, which is a bar BD should be able to clear.

Of course. AMD might be playing b series compared to Intel, but Xenon (and most other options) are enrolled in the special olympics. What store-to-load hazards? It's not like anyone would ever like to read data nearby to data they've just stored. I mean, the c stack is just so passee.

babybumb · Mar 14, 2012

Rangers said:
Yeah they are, and this seems a little ominous

Posted 3/14

Also implies the next console specs still arent settled. Even 2013 might be too early.

He implies only hope is Sony/MS. What does that tell about Wii U?

I think a iPad next year will ship with Cortex A15 and a new next-gen PowerVR Rogue chip.. It will be close enough to current-gen. How will Nintendo like that? Much superior tablet connected to Apple TV..

steviep · Mar 14, 2012

babybumb said:
He implies only hope is Sony/MS. What does that tell about Wii U?

I think a iPad next year will ship with Cortex A15 and a new next-gen PowerVR Rogue chip.. It will be close enough to current-gen. How will Nintendo like that? Much superior tablet connected to Apple TV..

It tells me that Rein doesn't care to support it with any of his software, which is true.

It also tells me that he doesn't like what he sees in the current iterations of the MS/Sony dev kits.

But your correlation to Apple is particularly numbing. A $600 tablet and a $150 Apple TV is not going to replace a dedicated gaming device with sticks and buttons sold for less than half the price.

TheChefO · Mar 14, 2012

babybumb said:
I think a iPad next year will ship with Cortex A15 and a new next-gen PowerVR Rogue chip.. It will be close enough to current-gen. How will Nintendo like that? Much superior tablet connected to Apple TV..

I'll go out on a limb here and say that nextgen ipad will bury the WiiU, connected to AppleTV or not.

It will dwarf the sales of WiiU and seemingly, parents that once thought of games consoles as "too expensive" are now more than happy to drop $500 for their kids to have the latest iphone or ipad or ipod touch.

We're living in lala land full of Apple-consumer zombies, so the only thing to do is differentiate from them, or join them.

Nintendo missed on both counts (by their currently floated spec rumors).

Shifty Geezer · Mar 14, 2012

Wrong thread.

TheChefO · Mar 14, 2012

Shifty Geezer said:
Wrong thread.

That line is getting rather blurred.

Apple seems to be pushing further and further into the gaming realm and as has been pointed out numerous times prior, Games are the #1 app on iPad. And though ipad is not tied down to a wire, it isn't what most would consider a "portable/handheld" device ... More of a tablet-console than anything else. Especially if nextgen we get Rogue and A15.

It is already pushing more resolution and ram than xbox360 and ps3 and with additional horsepower, it should clear them on performance as well.

The use-case is majority at home, on the couch, right next to (or hooked up to) the tv.

Add to all of this, the push by more and more developers into ipad development and it seems inevitable to include ipad in the discussion of future consoles.

MrFox · Mar 14, 2012

I know it'll be much faster with branchy code and easier to target for, and I'm not denying the advantages, but I'm not sure I understand how a 100gflops x86 cpu is a good choice to replace the 200gflops cell from 2006.

If first party studios claimed to be hitting close to the peak with some games, it can't magically be 10 times faster. So, are they side-stepping the cpu and concentrating on the gpgpu for crunching numbers? In other words, the PPE beefed up into a nice x86 core, and the SPEs replaced by gpgpu cores?

Will there be some first party code from Sony which will fall between the cracks, with bad performance on the normal CPU, and not applicable to OpenCL or CUDA?

ERP · Mar 14, 2012

MrFox said:
I know it'll be much faster with branchy code and easier to target for, and I'm not denying the advantages, but I'm not sure I understand how a 100gflops x86 cpu is a good choice to replace the 200gflops cell from 2006.

If first party studios claimed to be hitting close to the peak with some games, it can't magically be 10 times faster. So, are they side-stepping the cpu and concentrating on the gpgpu for crunching numbers? In other words, the PPE beefed up into a nice x86 core, and the SPEs replaced by gpgpu cores?

Will there be some first party code from Sony which will fall between the cracks, with bad performance on the normal CPU, and not applicable to OpenCL or CUDA?

You really have to look at what Cell was being used for, and I very much doubt anybody was sustaining anything close to peak rates across a frame. For specific sub tasks perhaps, but are those the same tasks that now make more sense on a GPU?

Flops is a horrible measure of CPU performance, it made sense when memory latency wasn't an issue, but now performance is all about cache and specifically misses. This was true even when 360/PS3 were released, but flops are easy to market.

MrFox · Mar 14, 2012

ERP said:
Flops is a horrible measure of CPU performance, it made sense when memory latency wasn't an issue, but now performance is all about cache and specifically misses. This was true even when 360/PS3 were released, but flops are easy to market.

Wasn't that the idea of the Cell's local store and DMA ops, hiding memory latency? i.e. the tasks arranged so that large blocks are loaded, processed and written back, with the DMA ops moving it both ways "for free" while one block is processed?

It boggles the mind that NONE of the rumored "middle ground" solutions went through. Cell, Larrabee or A2. We have PC instead

Elan Tedronai · Mar 15, 2012

The longer this generation goes for sony the more i predict sony will retain it's own core cell processor and make some tweaks. Beef up the ram, get some new GPU and call it a day

RedVi · Mar 15, 2012

MrFox said:
I know it'll be much faster with branchy code and easier to target for, and I'm not denying the advantages, but I'm not sure I understand how a 100gflops x86 cpu is a good choice to replace the 200gflops cell from 2006.

If first party studios claimed to be hitting close to the peak with some games, it can't magically be 10 times faster. So, are they side-stepping the cpu and concentrating on the gpgpu for crunching numbers? In other words, the PPE beefed up into a nice x86 core, and the SPEs replaced by gpgpu cores.

Cell has been used to pick up RSX's slack for vector processing by first party devs. Doing this means that far less of the chip is free for general CPU tasks like AI. Using a CPU for purely CPU tasks, 100GFLOPS is plenty, especially when that vector processing is now being done on a GPU with 2-4TFLOPS, taking the burden off the weaker CPU/cell makes perfect sense.

Predict: The Next Generation Console Tech

tunafish

Gubbi

McHuj

V3

upnorthsox

tunafish

JardeL

3dilettante

upnorthsox

tunafish

babybumb

steviep

TheChefO

Shifty Geezer

uber-Troll!

TheChefO

MrFox

Deludedly Fantastic

ERP

MrFox

Deludedly Fantastic

Elan Tedronai

RedVi

Similar threads