Predict: The Next Generation Console Tech

Status
Not open for further replies.
I posted really late yesterday. I would like to add that even in case 2 (MS give a set amount of money from scratch and let AMD no choice) if AMD doesn't try to negotiate they are as good as dead.
The same apply as in case one there is not a real alternative but them for MS, they have to get a sweet deal for them selves.

To some extend I could see AMD do that kind of shit and that why they are set to failed sooner than later. In the mean time mean Nvidia makes money. Business is business.
 
Last edited by a moderator:
This is only half true. The 7770 matches the 6950 in tessellation benchmarks AFAIK, which has 2.6 billion transistors (1.1 billion more than CV).

I don't expect a full GCN neither in the WiiU nor in the next Xbox, but I would be disappointed if it wouldn't use some of it's tech.
The HD 6850 is made of 1.7 billions of transistors like the HD 6870 and a 2 SIMD arrays & tex units are disabled. Anand got confused in it's last review and gave the transistor number of cypress (2.15 and I don't know where you got your 2.6 figure)somebody with an account (or a twit?) (they do a good job usually one of them post here once in while, etc.) should inform him. See here and here. It changes perceptions quiet a few.

Tessellation scales up pretty well with clock speed (so clock speed of the dedicated hardware).
Anyway it's not to dispute great improvements, if you one a good example look at fluid simulation in the anandtech review it's a 100% increase in perfs (vs 5770) with 50% more transistors . Overall especially focusing on game the story is not the same.

Tessellation is overrated anyway, I don't expect to have anything but a minor impact. Most GPU can already push out more than one vertex per pixel /an insane amount of vertices.
 
Tessellation is overrated anyway, I don't expect to have anything but a minor impact. Most GPU can already push out more than one vertex per pixel /an insane amount of vertices.
Which is exactly why tesselation is important! We can render more than one vertex per pixel, but we can't model and store at that level of detail. A displacement map allows the art to be captured more accurately in an efficient manner, with tesselation enabling the detail with LOD. After devs have got to grips with it, the improvement should be noticeable. Hopefully, gone will be the texture boxes that current crumbling walls are made of and in their place, actual crumbling blocks!
 
Which is exactly why tessellation is important! We can render more than one vertex per pixel, but we can't model and store at that level of detail. A displacement map allows the art to be captured more accurately in an efficient manner, with tessellation enabling the detail with LOD. After devs have got to grips with it, the improvement should be noticeable. Hopefully, gone will be the texture boxes that current crumbling walls are made of and in their place, actual crumbling blocks!
I listen to Laa-Yosh on this one. Even high level of tessellation won't compete with the precision backed in normal map. I think he stated that you may not want tessellation+displacement to mess too much with you normal maps.
Not push too hard with displacement t's better and cheaper than POM. Things is if I looks at the benchmarks every GPU since AMD 5xxxx pushes enough triangles. Unengine 1&2 was really good marketing shit that worked perfectly and messed up people perception in regard to tessellation.

A more proper use is the directx 11 SDK sample with adaptive tessellation on + displacement on. AMD GPU does great have been doing great for a while.
 
A more proper use is the directx 11 SDK sample with adaptive tessellation on + displacement on. AMD GPU does great have been doing great for a while.
:???: Doesn't that prove the point? Fully accurate models achieved with a simple mesh and a texture, rather than faked and with clear faults observable. you can still include normal maps as appropriate, but we can lose the regular, triangulated lines that still plague games.
 
:???: Doesn't that prove the point? Fully accurate models achieved with a simple mesh and a texture, rather than faked and with clear faults observable. you can still include normal maps as appropriate, but we can lose the regular, triangulated lines that still plague games.
Well that's what I call a "light use" having sub pixel vertices on the other hand, I'm not sure it's doing well with the details backed in normal maps which are computed at way higher precision using REYES or that kind of thing.
To some extend the GPUs are good enough to push detailed enough models without tessellation I guess the problem is more the weight of those models.
 
Last edited by a moderator:
To some extend the GPUs are good enough to push detailed enough models without tessellation I guess the problem is more the weight of those models.
Yes. Even if you have the GPu power to render CGI quality assets in realtime, you won't be able to fit them into RAM or stream them fast enough. Tesselation is an excellent compromise, shifting storage requirements to processing requirements. It'd also work well with procedural damage and stuff. I don't know if it could work well with megatexturing, but if it could, it'd be excellent for adding depth to game detail.
 
Notice it says Unidirectional. that version's numbers are for the previous i7 processor, iirc, the new one has double the bandwidth, but unless they changed it from 2x unidirectional, it will have half the bandwidth available for reading or aggregate 50ish read 25ish.

You're looking at the QPI link on the i7 which is used for connecting to other CPU's in multi CPU systems, not the memory interface which I still believe is bi-directional.

Regarding cell I do not know but I read that the 25ish GB is achieved while either doing all reads or all writes, intermingling of read and write will reduce it to 20ish GB a similar drop to that of the i7 in real world.

Your own link from your previous post specifically said the FlexIO was composed if unidirectional lanes. That means in the PS3 with 25.6GB/sec total system memory bandwidth, the most likely split is 12.8GB/s read and 12.8GB/s write. Obviously if the entire FlexIO interface was given over to main memory communication this would be significantly higher. But unless you want a console with no GPU that's not the best option.


Well given the paltry l1 cache size, data to be computed will most likely be housed in l2.(ed: thanks to the update it appears this is per core as there are six it may suffice)
Sure that number is big but if we used the cell spu local store bandwidth, which is what cell uses for computing from local memory, the bandwidth would seemingly far eclipse the computational bandwidth available to the i7 at l2.(ed or it might not given this latest update)

Well it depends on the size of the data being worked on. On small data sizes the L1 will be used, from there up to 256K (per core) the L2 will be used and from there up to 2.5MB per core or 15MB in total the L3 will be used. The 3960x has over 633GB/s of bandwidth to its last level cache. How does that compare to the bandwidth available to Cell for workloads greater than 256K per SPU? Or less than for that matter?

http://techreport.com/articles.x/19670

Interesting but there is a reason the local store is bigger than that, and if it affects performance it may explain the difficulties the i7 faces even in synthetic benchmarks even overclocked.

To what difficulties are you referring? Do you have some evidence to suggest any other architecture performs better in similarly optimised code?

Realtime physics does not seem to necessitate double precision, afaik, if it does then you've a point there.

I never said it did. My question was whether the figure you quoted for measured floating point performance was in dual precision or single precision flops. Its it was DP then it's clearly not comparable to your figures for Cell at all.

I'd assumed local store access was similar to l2 access in terms of computational costs, if l3 access is as efficient or can serve to stream to l2 and l2's performance is sufficient for approaching theoretical performance. Then you have a point.

The i7 benchmark is called intel burn test, let us not assume it is some lousy unoptimized code optimized for some obscure ancient cpu. It's had

for years, and was recoded from the ground up not so long ago.

While running that benchmark it is said the i7 may have 20+C rise in temperature and the processor may even fail to work. Even with this it cannot achieve what the cell sustains without burning up.

Where is your evidence for any of this? Yopu have absolutely no idea how the tests compare in terms of optimisation for the architecture they are running on but it doesn;t take a lot of common sense to realise that code designed specifically for Cell purely to achieve maximum FP throughput is going to achieve better results than code designed for benchmarking different x86 architectures. As I asked before, is the benchmark you are comparing with even using AVX?

As far as I can see it is very likely the cell needs both the local store size, as well as the 25ish GB bandwidth to attain the sustained 200Gflop single precision performance.

If the new i7 quickpath is like the old unidirectional, the available read bandwidth is 25GB the same as cell. But the memory is likely higher latency

As I said above, QPI has nothing to do with main memory access. In fact if I'm correct about main memory on the 3960 being bi-directional then the opposite of your argument is true. It's actually a potential 52.6 GB/s for either read or write vs a maximum of 12.8 GB/s for Cell in PS3 for either. Not that this matters at all since the workloads always going to be split between both read and write and it's a vast oversimplification of what's actually happening in any case.

IF the above is true it is unlikely the latest i7 will be able to sustain substantially higher performance than cell at tasks for which the cell was designed for, assuming bandwidth played a critical part in cell sustaining such performance.

Rgd bandwidth - see above. Rgd latency, DDR3 and XDR do not measure latency in the same way. Even DDR2 and DDR3 do not. Latency is roughly equivalent between DDR2 and DDR3 despite the different ratings applied to each. I'm sure we've been over this before?
 
Yes. Even if you have the GPu power to render CGI quality assets in realtime, you won't be able to fit them into RAM or stream them fast enough. Tesselation is an excellent compromise, shifting storage requirements to processing requirements. It'd also work well with procedural damage and stuff. I don't know if it could work well with megatexturing, but if it could, it'd be excellent for adding depth to game detail.
I guess it's like for any tool you have use it where it makes sense, developers they start to have a quiet healthy selection of tools :)

I don't believe ** they can render CGI in realtime, but without knowing I would say 3d data weight quiet a lot. You divide the sides of a square in 2 you and up with 4 squares. I guess it would add up pretty fast. I think that's why devs are looking to octree structure (or something like that) because as store now geometry would weight too much.
Then you need LOD of everything including the ground, wall etc. Tessellation is a good solution to improve detail here and there.

But I think it applies to things like a touchscreen/pad or sixaxis (everything in fact) it has to be used where it makes sense either you end up with cheap gimmick (marketing guys yes I'M pointing my finger at you :LOL: ).

EDIT
** misread your statement "If you have the gpu power, etc." so answer is a bit off. I let it as it is.
 
The DRAM interface uses the same lines for reads and writes, there is only one direction possible at a time. ~21 GB/s is the total max possible for read+write on a 2600, double that for the quad-channel SB-E.
 
Keep in mind that for game development purposes one of the spus is locked on the current ps3(for OS purposes), and if this is like other ps3 games the remaining cell is not fully dedicated to physics but is likely aiding the frail gpu.

If the above quotes are true we can assume that a fraction of the cell is enough to outpower a 4 core i7 in an optimized game code when it comes to physics by a professional development team.

The question is, could the latest i7 outpower the full cell at its game? And if so by how much and at what costs?(burning up 100+W with a highspeed fan attached?) .

The level of optimisation specific to an i7 that went into the PC version is likely embarrassing compared to what went into the PS3 version for Cell so the comparison isn't valid. We simply have no idea how that i7 would have faired in a console compared to being used in the PC.

For example you mention the Cell diverting power to help out RSX (of which there's no evidence in this case) but what of the well known and often discussed necessity for PC CPU's do deal with Windows and the API?

And which i7 was he talking about? At what clock speed? Assuming it was a Nehalem then you can add 50% straight away for core count and then a large amount again for AVX. Probably double or more if the implementation was in a console vs the PC implementation and then a little extra again for the likely higher clock speed of the 3960x.

It's a silly comparison for judging the relative overall performance of the 2 architectures.
 
The DRAM interface uses the same lines for reads and writes, there is only one direction possible at a time. ~21 GB/s is the total max possible for read+write on a 2600, double that for the quad-channel SB-E.

Cheers that's what I thought. So *in theory* you could use all of your bandwidth on read or all on write or some combination inbetween. i.e. you're not limited to half the theoretical maximum for read and the other half for write.
 
Does MS quit IBM after OoO did not make it into Xenon? They were bad business partners? Are they to pass on them this time around? (trying to pull info out of your nose shamelessly :LOL: ).
Hehe, good try. I believe IBM has been a good partner, but that's just anecdotal, I have no actual knowledge of the current relationship.

Assuming AMD is in an unassailable position simply because their tech currently is the best is a bad assumption. As a hypothetical, if NVidia offered a high performance ARM CPU at a deep discount if the console used an NVidia GPU (a Tegra 4, say), then the technical superiority of the AMD GPU would suddenly not be nearly as important.
 
Your own link from your previous post specifically said the FlexIO was composed if unidirectional lanes. That means in the PS3 with 25.6GB/sec total system memory bandwidth, the most likely split is 12.8GB/s read and 12.8GB/s write. Obviously if the entire FlexIO interface was given over to main memory communication this would be significantly higher. But unless you want a console with no GPU that's not the best option.
The xdr info indicates 25.6GB/sec can be attained while doing pure reads or pure writes and mixed read/writes lead to 21~GB performance, iirc.

Aggregate bandwidth with flexio is higher than 25.6GB/s but as that involves gpu interaction not main memory we're neglecting it.

I never said it did. My question was whether the figure you quoted for measured floating point performance was in dual precision or single precision flops. Its it was DP then it's clearly not comparable to your figures for Cell at all.
If we went and asked other things such as general purpose performance the i7 would have a clear edge. IF DP is not relevant for present purposes(realtime physics performance), it won't matter in terms of the application discussed.


The DRAM interface uses the same lines for reads and writes, there is only one direction possible at a time. ~21 GB/s is the total max possible for read+write on a 2600, double that for the quad-channel SB-E.
Thanks for the clarification

It's a silly comparison for judging the relative overall performance of the 2 architectures.
The idea is performance in real time physics, each architecture shines at the tasks it was best designed for. The i7 is using intel's best fab tech, the cell is a six year old 90nm design.

Regarding dedication we've the following
For example, we’ve reserved one whole processor on the PS3 for Dolby Digital 5.1 sound
So together with the sony reserved core that's two cores that aren't being used for physics.

With regards to windows background consumption, it was my understanding that the latest windows had been more optimized for game performance.
 
Last edited by a moderator:
Cheers that's what I thought. So *in theory* you could use all of your bandwidth on read or all on write or some combination inbetween. i.e. you're not limited to half the theoretical maximum for read and the other half for write.

It's much easier to get full bandwidth with traffic that has long sequential bursts in one direction.
DRAM incurs turnaround penalties when switching between read and write, and since it has a wide parallel interface and is organized into banks and arrays internally, it really likes moving data in big chunks.
 
Assuming AMD is in an unassailable position simply because their tech currently is the best is a bad assumption. As a hypothetical, if NVidia offered a high performance ARM CPU at a deep discount if the console used an NVidia GPU (a Tegra 4, say), then the technical superiority of the AMD GPU would suddenly not be nearly as important.

I guess also if they (you guys) want to use GPGPU, the best bet would likely be NVidia as well. One thing I wondered about was using an off the shelf GPU and having them keep it 'current' through 2-3 product cycles like they did with the G92 (8800, 9800, GTS 250 etc) and get great yields and the ability to make use of the duds in other products as well as economies of scale. Although the same could also apply to AMD, though it doesn't seem to be nearly their style.
 
Eastman are purposefully misreading what I wrote because it's becoming below3d here and I seriously consider actively reporting which I never do usually. What you said is IP have no intrinsic value (like pretty much everything, that's not how price actually is determined by the market / in a negotiation) and so one should not to make the most of its IP. It's not how it's working end of line for anybody selling IP or anything else for that matter.
Mabye I am miss reading you , i dunno i'm just reading what you wrote .

Look at it this way. AMD isn't going to fab the chips . They are going to take an existing design and modify it for MS . Whats it matter to AMD which design MS wants , they will just modify it how MS wants it , collect an up front fee and a fee for each chip produced. IT may actually cost AMD more to go back to an older design and retool it for a fab process that it wasn't design for .

Are you purposefully ignoring that I said the GCN could still be used but clearly there will be negotiation which have only end purpose: price either you buy the IP or licensing it.

I'm sure MS would have bought the best IP that AMD has and paid to keep it exclusive to them. Why would MS want an older GPU tech ? Was XENOS based on the x1x00 series or the x8x0 series ?

Neither you consider what I just wrote again to Blackowitz, GCN is more efficient,etc it comes at a cost 50% more trannies if you compare 5770 ot 7770.
The 7770 also has support for newer API's like DX 11.1

Why would Nintendo use supposedly a RV700 design in your world? I'll tell you it's what N consider the best bang for bucks in both licensing costs / buying the IP and bang per transistors which also translate in costs productions ones.

There are many reasons why .

For a Holiday 2012 launch the only chip that could be tuned for nintendo was a rv700 (if this is what they are using )

Other liscensing agrements prohibt nintendo from getting newer IP

Rumors out there point to AMD having all 3 console contracts. MS might have striked first and gotten an exclusive with GCN , NVidia may have tied up something based on VLIW 4 and Nintendo may have been left with the oldest of AMD's current designs .

Who knows what happened in the backround
 
Yep, I see that as a huge advantage for AMD. Having the new architecture in the next console would give them quite a nice advantage as the effort of most developers trying to get the most out of that design would also spill over to the PC realm...

hey someone read my post !

Yup this is why I believe AMD is trying its best to get GCN into the next xbox .

Hell I bet AMD is trying to get GCN into the next nintendo system also.

The benfits far out weigh any upfront monitary gains they would get by trying to force either company into paying more for GCN over the older tech.

Look at it this way


Right now there is 65 million xbox 360s. Unless MS messes up royaly the next xbox will do at least as well.

So AMD could get GCN in 65 million xbox .

If they get GCN in the next nintendo system ... well the wii has sold 95m . Perhaps the wii u wont sell as well. But i think you guys get the point .

Last gen the majority of the games were designed to lead on the xbox 360. Next gen if the xbox has GCN then every game will be optimised for GCN .

Look at all the problems AMD is having currently getting games to run well on their cards. This could largely be solved by having GCN in one or two next gen consoles.

To top it off AMD could have 3 years or so of just refining GCN .. adding more shader units, Rops and what not as needed but still keep that base that is in the consoles . It would change the pc gaming landscape drasitcly as well and people will stop saying AMD drivers suck , because all games will be developed for amd's hardware.
 
Either I can't see in anyway how GCN should be a given either versus older architecture that proved brilliant or something really custom (changing too much to proved and well performing design sounds risky depending at how much resources you throw at the design).

GNC has less raw performance/mm^2 in comparison to VLIW architecture, that's a fact. A 28nm VLIW chip of the size of Tahiti could have packed probably 30% more ALUs.
But performance are much more consistent than before, with a much higher min. framerate, and overall I don't think that real-world scenario perf/mm^2 have declined considering the diminuished return they were getting as they increased the number of ALUs.
In the future, real perf/mm^2 for GNC will greatly surpass the old architecture, as more and more engine move to compute shader technology. Look at the performance gains AMD achieved in Civ 5, a game that uses heavily compute shader technology. It's more than 65% faster than the older architecture.
Even if in the console space architecture are exploited in a different way, so a VLIW chip may reach more susteined and closer to its peak performance, compute is not only bound by FLOPs, but also by cache architecture, internal bandwidth and so on. And GNC (or Kepler/Fermi) offers much more on this side.
Microsoft knows where the graphics is heading. DirectX11 introduced DirectCompute, they are working on C++ AMP, an extension for high-performance computing an many other things. I'm going as far as to say that they won't have GNC in their console, but probably even a more compute orient architecture, for example Sea Island or, if they launch in 2013, something similar to that year architecture.
A fully HSA console would greatly benefit the entire AMD business.
 
The xdr info indicates 25.6GB/sec can be attained while doing pure reads or pure writes and mixed read/writes lead to 21~GB performance, iirc.

That's memory itself, not the interface to it. You'd basically need to double the the FlexIO lanes to main memory in PS3 to achieve that kind of theoretical pure read or write performance. But as you note, it then would have very little left over for talking to RSX.

If we went and asked other things such as general purpose performance the i7 would have a clear edge. IF DP is not relevant for present purposes(realtime physics performance), it won't matter in terms of the application discussed.

Yes but the point is you quoted something like 170 GFLOPS sustained performance for the 3960x and my question was simply is it SP or DP? Because you're comparing with SP in Cell. Not that it would be a comparable test anyway due to likely wild differences in workload and optimisation levels. For the record a 5Ghz 3960x has a theoretical SP throughput of 480 GLFOPS and 240 DP GFLOPS. So if it's achieving 170 then it's either DP or code/workload that's completely unoptimised for Sandybridge.

With regards to windows background consumption, it was my understanding that the latest windows had been more optimized for game performance.

While I'd like that to be true, it's debatable at best. W7 performs roughly the same as XP in games. But it's not just that running games within Windows that saps performance, it's the API. DX11 helps a lot but I'm guessing the game you're referring to is DX9.
 
Status
Not open for further replies.
Back
Top