Predict: The Next Generation Console Tech

Status
Not open for further replies.
That's memory itself, not the interface to it. You'd basically need to double the the FlexIO lanes to main memory in PS3 to achieve that kind of theoretical pure read or write performance. But as you note, it then would have very little left over for talking to RSX.
Flexio supposedly goes over 60GB/s from that old figure, iirc.

It would be very odd if XDR can supply 25GB/s doing read only, but the chip which was designed to work with it can only attain half of that bandwidth.

Yes but the point is you quoted something like 170 GFLOPS sustained performance for the 3960x and my question was simply is it SP or DP? Because you're comparing with SP in Cell. Not that it would be a comparable test anyway due to likely wild differences in workload and optimisation levels. For the record a 5Ghz 3960x has a theoretical SP throughput of 480 GLFOPS and 240 DP GFLOPS. So if it's achieving 170 then it's either DP or code/workload that's completely unoptimised for Sandybridge.
Or it can't approach sustained theoretical in practice for some reason. Though one can't tell, we'd need more information to tell what's the case.

The benchmark is called intel burn test, so at the least it's designed with intel in mind. The developers could be lousy coders, who knows.

PS
Any case getting back on topic of nextgen, over at neogaf I read that the vita might be using a SoC designed by sony and toshiba. So it seems likely that the next playstation could have internal designs competing for viability with what vendors may offer. A juicy custom powervr license design would be interesting to see.
 
Flexio supposedly goes over 60GB/s from that old figure, iirc.

It would be very odd if XDR can supply 25GB/s doing read only, but the chip which was designed to work with it can only attain half of that bandwidth.

You may be correct on this actually. FlexIO definitely uses unidirectional lanes however as I now understand it that's only used for communicating with RSX (at around 35GB/s aggregate bandwidth) as opposed to the main memory. There is a seperate memory controller for communicating with XDR which may well allow all read or all write. In which case the peak numbers that should be compared are indeed 25.6GB/s in the PS3 and 51.2GB/s for the 3690x.

Or it can't approach sustained theoretical in practice for some reason. Though one can't tell, we'd need more information to tell what's the case.

The benchmark is called intel burn test, so at the least it's designed with intel in mind. The developers could be lousy coders, who knows.

Exactly, there is too little information available about these tests to consider them comparable. For the record though the Intel burn test does use AVX on Sandybridge in its latest versions which the tester was no doubt using, however I'm also fairly sure it's measuring DP flops. This is the best link I could find around that though:

http://forums.overclockers.co.uk/showthread.php?t=18293284
 
The Intel Burn Test suite is made BY Intel and doesn't even run on non-Intel CPUs.

I believe it does run on AMD CPU's too.

From the changelog:

v2.53
- Linpack binaries updated (12-18) + patched for AMD CPUs
- Added "All" threads option to allow full utilization of SMT threads (e.g. HyperThreading units)
- Default times to run is now set to 10
 
Makoi that's an apple to orange comparison. You compare perfs of products using different processes. To avoid this pitfall I base my opinion on perfs per transistors.
------------------

EDIT

Speaking of Misreading... I did really well with your post. Was at job, read super fast, etc. => end up in BS

Sorry about that. I've made a proper answer.
 
Last edited by a moderator:
I believe it does run on AMD CPU's too.

From the changelog:

v2.53
- Linpack binaries updated (12-18) + patched for AMD CPUs
- Added "All" threads option to allow full utilization of SMT threads (e.g. HyperThreading units)
- Default times to run is now set to 10

Does it? I tried it not too long ago and it didn't want to run, saying "this software is only compatible with Intel CPUs" and I obviously have an AMD.
 
The Intel Burn Test suite is made BY Intel and doesn't even run on non-Intel CPUs.

:rolleyes:
No it's NOT! and yes it does!


steampoweredgod:

Linpack SP GFLOPS (which is where you are getting the Cell numbers from) are not the be all and end all of performance.

A game is far more than just basic SP floating point loads, not only do integers have a large roll to play, but DP FP is wanted in game physics and Cell will stall a lot more due to it's lack of OOOE.
 
A game is far more than just basic SP floating point loads, not only do integers have a large roll to play, but DP FP is wanted in game physics...
Really? I think that'd be a complete waste in games which can use a range of values well within SP's tolerances.
 
In response to the claim that AMD is in a position of strength, this timely article would dispute that...
http://www.pcpro.co.uk/features/372859/amd-what-went-wrong

That's probably one of the most biased and inaccurate articles I've seen in a while.
4 pages telling everything AMD did wrong, not a single word about the good performance of the graphics division and only a sentence saying "oh and there's fusion and that's cool".
Ridiculous.

That and store guys that could only get 2 Llano APUs. That's a blatant lie.
 
Linpack SP GFLOPS (which is where you are getting the Cell numbers from) are not the be all and end all of performance.
From what I understand SPE utilization and sustained performance in the real world can be quite high, it is not just in synthetic benchmarks. Others with more experience can clarify this issue.

Cell will stall a lot more due to it's lack of OOOE.
It depends on the task being worked, some tasks are more predictable and are amenable to manual optimization. I'm not sure whether this holds for game physics code, someone with more experience could tell.

From what I've heard physics is amenable to high performance on stream processor architectures, so it seems likely to be suited for such designs.

To maintain a fluid visual experience, applications typically provide at least a 30 frames per second (FPS) display rate (33ms per frame). This is the total time allotted to all game components, including physics-based simulation, graphics display, AI, and game engine code. Conventional cores cannot meet the computational demands of these applications. For example, in Section 6 we show a realistic example where a single-core desktop processor achieves only 2.3 FPS.-ParallAX: An Architecture for Real-Time Physics
One of the reasons to pursue development of the Cell B./E. processor was that the speed of physics simulation in video games was lagging far behind the graphics performance of 3D rendering. The realism level of 3D games was no longer limited by the capability of the graphics processing unit being able to display the dynamically changing environment, but instead the CPUs were not able to compute the movement of objects or particles with realistic physics simulations as fast as the graphics could render it....

With the introduction of the Cell B./E. the situation is reversed

Cell B./E. simulating a chicken farm. Each individual chicken has its own behavior model interacting with other birds. The simulator was demonstrated to provide realtime(30fps) performance with several thousand chickens. In fact, when the number of chickens
was increased to a total of 15,000 birds the Cell B./E. processor was still able to perform the simulation with interactive speed but the graphics rendering was not able to keep pace, even on a state of the art NVidia GPU and started dropping 2 out of 3 frames, resulting in 10fps "sluggish" video output
.-eHiTS® on the Cell B./E. ™ Revolutionary Hardware Technology Opens New Frontiers in Molecular Modeling-2007-2008

Mercury Computer Systems reported a 15x to 30x speedup of Fast Fourier Transform (FFT) algorithms on the Cell B./E. compared to the best commercially available substitute processors. It is projected that they will produce up to a 100x improvement for actual customer applications.-eHiTS® on the Cell B./E. ™Revolutionary Hardware Technology Opens New Frontiers in Molecular Modeling-2007-2008

That paper appears to be from at most 4years ago. Realworld performance was substantial.
 
IBM marketing material isn't really the most reliable source of information for performing a fair comparison of the two architectures. I'm sure the same could be done by Intel or AMD using similarly hand picked workloads.

Incidentally, what is the underlined section of the quote above supposed to prove? That "state of the art" NV GPU wouldn't be able to render 1 chicken if it was sufficiently detailed or it could render a million if they were sufficiently simple. So the fact that Cell could handle the physics for more chickens than the GPU could render means precisely nothing. The CPU in my phone could do the same depending on how the test was setup.
 
The Cell BBE Programming Manual itself identifies what kind of algorithms are strengths and weaknesses for the system. Seems pretty candid. Folding@Home also had interesting input about strengths and weaknesses of x86, Cell and GPGPU. I think we'll see soon enough what aspects of Cell we'll see back in next-gen IBM designs. Personally I'm thinking the ring bus on flexio was a good idea. The local store also seems efficient, but may have been to specialist for most applications.

Cell was too far out compared to the competition and not easy enough to work with to compensate. In the end if you make an engine that allows 80% of cars to run 150mph with ease, and a max of around 180mph, or you make one that allows 80% to don120mph with ease, but has a max of 250mph, the first may still be better.
 
IBM marketing material isn't really the most reliable source of information for performing a fair comparison of the two architectures. I'm sure the same could be done by Intel or AMD using similarly hand picked workloads.

Incidentally, what is the underlined section of the quote above supposed to prove? That "state of the art" NV GPU wouldn't be able to render 1 chicken if it was sufficiently detailed or it could render a million if they were sufficiently simple. So the fact that Cell could handle the physics for more chickens than the GPU could render means precisely nothing. The CPU in my phone could do the same depending on how the test was setup.

The software for the sim came from rapidmind, not ibm, the paper merely quotes the results. Rapidmind was later acquired by intel.

The chickens were quite simple from the images, and it is likely LOD was also implemented in the gpu. I do wonder what could have possibly occurred there, definitely not the number of objects as gpus could handle that number with ease, is there something about the objects behaving in a more complex way that causes a bottleneck somewhere?
Imagine the biggest flock of virtual fowl ever assembled. Each chicken is controlled by a simple artificial intelligence program, operating according to a handful of rules. Each chicken wants to move toward the rooster but must avoid collisions with other chickens, fences, and the barn. To do so, each one must constantly check the position of its nearest neighbors and other objects in its environment and then decide how to move..-link
 
The software for the sim came from rapidmind, not ibm, the paper merely quotes the results. Rapidmind was later acquired by intel.

Yes but the demo itself was still created specifically for Cell at the direction of IBM for the purpose of marketing.

It would actually be a great comparison point if Intel were to release the same demo for Sandybridge (specifically optimised of course) using the same engine.

But as such a comparison point doesn't exist there's not much value to be taken from it other than to show that Cell can calculate a simple AI routine for 16K objects simultaneously. I'm sure it looked great though.
 
GNC has less raw performance/mm^2 in comparison to VLIW architecture, that's a fact. A 28nm VLIW chip of the size of Tahiti could have packed probably 30% more ALUs.
We agree, not on the 30% though
But performance are much more consistent than before, with a much higher min. framerate, and overall I don't think that real-world scenario perf/mm^2 have declined considering the diminuished return they were getting as they increased the number of ALUs.
Indeed there is a clear increase in minimum framerate.
In the future, real perf/mm^2 for GNC will greatly surpass the old architecture, as more and more engine move to compute shader technology. Look at the performance gains AMD achieved in Civ 5, a game that uses heavily compute shader technology. It's more than 65% faster than the older architecture.
Well on the last benches of Anand review the card does indeed very well. But in the others one the difference is not that consistent (if you compare the HD6850 and the HD7770 / ~200 millions transistors including disabled SIMD and tex units, 200MHz in clock speed, or the full blown hd6870).
I do agree that it does more with less raw power.
Even if in the console space architecture are exploited in a different way, so a VLIW chip may reach more susteined and closer to its peak performance, compute is not only bound by FLOPs, but also by cache architecture, internal bandwidth and so on. And GNC (or Kepler/Fermi) offers much more on this side.
Microsoft knows where the graphics is heading. DirectX11 introduced DirectCompute, they are working on C++ AMP, an extension for high-performance computing an many other things. I'm going as far as to say that they won't have GNC in their console, but probably even a more compute orient architecture, for example Sea Island or, if they launch in 2013, something similar to that year architecture.
A fully HSA console would greatly benefit the entire AMD business.

I don't dispute that either. The chip does really great with the limitation it faces mostly the 128 bit bus vs HD 68xx. It does significantly better than HD5770 (hardware.fr gives 121 points 100 points being the hd5770 @1080p, hd6850 &70 achieve 132 and 157 points) it's 121% the perfs and 150% the transistors.
It's tough to make a fair comparison with the HD 68xx as in some games the extra bandwidth just show. the real comparison would be how perform a HD 6850 on a 128bit bus @ 1GHz, we will never know.

So the only let say fully qualified one is the HD5770 vs the HD7770 (and I go back to the first point, the 30 %). When it all said and done, no matter how promising some benchmarks are, if I take away the clock speed advantage, while no forgetting that the new gen do more how of less bandwidth and looking at the transistor count increase I'm still not sold on GCN being the way to go versus 1.5 Juniper or a tweaked Juniper.
In fact I'm also iffy about something that would be based on Barts/Kurts that could be the less "fat" of the AMD architecture (ie the Northern Island family according to anand, I remember it wrong earlier). I sounds like really good starting point too. It would have be interesting a successor to the 57xx using this architecture. It may have ended performing as well or better as the HD57xx while being tinier. In both case most likely not significantly enough for AMD to vouch it worse the effort.

Anyway now that GCN is out it's pretty hard too make a clear point (one way or the other) that it's the way to go. And from the architecture point of view the choice of something based on Northern Island doesn't sound that foolish.
In fact if want want to tweak thing s a bit, it's possibly a really good choice as there is few "fat" to the architecture, adding cache here and there over whatever depending on what you want is somehow doable. GCN is "fat" to begin with.
 
Hehe, good try. I believe IBM has been a good partner, but that's just anecdotal, I have no actual knowledge of the current relationship.

Assuming AMD is in an unassailable position simply because their tech currently is the best is a bad assumption. As a hypothetical, if NVidia offered a high performance ARM CPU at a deep discount if the console used an NVidia GPU (a Tegra 4, say), then the technical superiority of the AMD GPU would suddenly not be nearly as important.
Well in the CPU realm Nvidia is anything but unproved they are just using ARM reference design. In the GPU either in mobile or desktop environment, they do not lead. Actually the lead AMD has is big, like +33% (and more) of chips per wafer at a given level of perfs.
Then I could see Ms having intensive to move to ARM but I'm not sure they will want to pass on BC. Then IBM does as you say pretty well, etc. Then there is NVidia, I don't know if MS will license or buy an IP from AMD, but what Nvidia is willing to do is another matter.

Put it another way MS came too see AMD underplaying them and get a no, how do you expect Nvidia to act? "wow that's so sad? I've a really good deal for you?" or "we know the situation it won't be cheaper here". It's pretty clear what Nvidia would do they are not in charity business. And the you get what you get tech wise. A lesser offering. MS would have to go back to AMD looking really bad but big company are clever enough to avoid that.

In response to the claim that AMD is in a position of strength, this timely article would dispute that...
http://www.pcpro.co.uk/features/372859/amd-what-went-wrong
Well especially in tough time, if with a leading product you don't get the most of your products you are as good as dead. I don't know how it went with MS, may be they were (AMD) a more than fair of money. AMD undersold nothing.

Too illustrate say you're an investors and you see a company with for a big project has the best product by far and they under sell without the toughest fight they could put out because they need money, what do you do? It's pretty clear to me that they won't get my money so what ever they win is in fact a loss. there is no point in funding loosers...

Anyway I disagree MSFT for me needs AMD more than you think, there is no good alternative in the high performances business, they are good partners, etc. etc.
The all thing has a cost, for their sake I hope AMD get every dollars they could out of the deal and pushed MSFT as hard as they could.

I won't post further but indeed AMD may have see MSFT comes with x Hundred millions say ok to everything without questioning anything than "we're saved for 2 quarters". That's why ultimately they are set to disappear sooner than later, may be why brightest people are leaving the company, etc.
 
Isn't the RSX theoretically more powerful than Xenos yet the opposite is true in reality? So why assume that whilst older generation chips have more performance for current titles that trend will continue for future titles? Arguably greater flexibility has been more useful for developers than simply higher outright performance given the right tradeoffs.
 
Status
Not open for further replies.
Back
Top