PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

Status
Not open for further replies.
It does have more bandwitdh than reference HD7870 card. <<...>> A 7790 has the same Gflops rating than a 7850, same gtexels/s (even higher triangle setup) but is way behind in 1080p bench, that's because of the ROPs (and bandwidth).

Not sure where you're heading here, I miss the punchline. BTW the depthcheck bw I was referring to could be decreased by using hierarchical Z instead of per pixel Z checks, which only requires 26GB/s if I'm not mistaken. So it doesn't need to be that bad. For alpha blending I don't know any alternative, so that will still decrease the fillrate to half its maximum .
 
I wasn't saying you were wrong, just that based on your same calculation, Liverpool has a bandwidth advantage even compare to a HD 7870 (143 GB/s vs 176 GB/s). And even if you can't fully exploit the raster pipeline capabilities it's a better situation. DF is wrong stating 16 ROPs is enough for 1080p. Every 7790 benchmark proves otherwise.

(It does make a lot less sense without half of the post)
 
I wasn't saying you were wrong, just that based on your same calculation, Liverpool has a bandwidth advantage even compare to a HD 7870 (143 GB/s vs 176 GB/s). And even if you can't fully exploit the raster pipeline capabilities it's a better situation. DF is wrong stating 16 ROPs is enough for 1080p. Every 7790 benchmark proves otherwise.

How do we know it's the rops, and not the bandwidth or the 1GB 7790 framebuffer vs 2GB in 7850, crippling 7790 vs 7850?

A 660 Ti with 24 ROPS outperforms a GTX 580 with 48 ROPS at 1080P. Where's the sweet spot? Dunno offhand. We'd need a bench similar cards with 16 and 32, which I cant see happening (they will be from different vendors at least)
 
That would be possible if the gpu had context switching, but that doesnt seem the case as is something AMD is cooking for GCN 3?.As of today you must have a CU running graphics threads or compute threads but not both.So if you have a CU with a 70% of rendering threads efficiency you cant run compute threads to fill the stalls.What Sony is looking for with 64 compute queues and onion+ as far as i understand is having the CUs working purely on compute busy with many threads and reduce latency in data feeding( this with onion+ ) and get that 70% of efficiency also with computing running CUs.


Sony apply for a content switching patent,don't know if that is relevant to my theory or if the patent is for other purpose and can't or can be use for that..
 
How do we know it's the rops, and not the bandwidth or the 1GB 7790 framebuffer vs 2GB in 7850, crippling 7790 vs 7850?

A 660 Ti with 24 ROPS outperforms a GTX 580 with 48 ROPS at 1080P. Where's the sweet spot? Dunno offhand. We'd need a bench similar cards with 16 and 32, which I cant see happening (they will be from different vendors at least)


The 1GB framebuffer should be easy to spot if the problem is there,in most cases there will be no difference in most games but those that require allot or ram,if i am not mistaken.

http://www.anandtech.com/show/6359/the-nvidia-geforce-gtx-650-ti-review/6

Look at the review of the 650Ti,the 1gb version run all games like the 2GB version,but when the topic is Skyrim the 2GB version almost double the 1GB in performance.

The same for he 7850 1GB vs 2GB model.
 
How do we know it's the rops, and not the bandwidth or the 1GB 7790 framebuffer vs 2GB in 7850, crippling 7790 vs 7850?

A 660 Ti with 24 ROPS outperforms a GTX 580 with 48 ROPS at 1080P. Where's the sweet spot? Dunno offhand. We'd need a bench similar cards with 16 and 32, which I cant see happening (they will be from different vendors at least)

That's a different architecture (gpu generation also) and 660 Ti has better pixel (higher clock) and texel capabilities.

About the memory limitation we can check this month with the upcoming 2GB versions out, but i have been trying to OC memory up to 1650 Mhz on a 7790 and you don't get any boost at all on non vram limited games (less than 1024 MB in afterburner).

Core clock work. I have little doubts that gpu is crippled by its pixel fillrate, still 17.2 Gpixels/s even with 1075 Mhz core OC, a reference 7850 has 27,5 at 860 Mhz. It's a really big difference.
 
That's a different architecture (gpu generation also) and 660 Ti has better pixel (higher clock) and texel capabilities.

About the memory limitation we can check this month with the upcoming 2GB versions out, but i have been trying to OC memory up to 1650 Mhz on a 7790 and you don't get any boost at all on non vram limited games (less than 1024 MB in afterburner).

i think that's sort of the point lol. of course there's no difference on non-vram limited games!

i did look into 7850 1gb vs 2gb a little bit, and there doesn't seem to be a difference on most titles. but some there is.

it's impossible to narrow down completely. but for some reason i suspect memory bandwidth is the biggest issue. but it should be mostly down to that or ROPS (but unknown which), with some effect by 1gb fb.

it also comes into play imo that pc is a arena where hardware bends around software, and not vice versa, yet there are limits to that.

if you suddenly doubled the shaders on a 7850, left everything else the same but now it is 3.6 tf, pc titles would not see too much speedup, only the cases that were shader limited to any extent. but if you did the same in a console, programmers would use up all those new shaders, bending the software to the hardware.
 
i think that's sort of the point lol. of course there's no difference on non-vram limited games!

What i meant was that in either case there s no the difference vram limited or not, because in this particular case (no vram limitation) it's not about the amount of memory, it just leave ROPs or bandwidth (again 0,0% increase with +10 GB/s, just on memory, no core clock, but maybe at +20/+30 it makes a sudden magical jump..). Core clock does have a direct influence on fillrate as you certainly aware, the boost in performance is there, slight (but it's there).

Of course it's PC environement, maybe the conclusion are not as clear, or can be totally applied to console environement, but in the end it gives a good idea of obvious bottlnecks i think.
 
What i meant was that in either case there s no the difference vram limited or not, because in this particular case (no vram limitation) it's not about the amount of memory, it just leave ROPs or bandwidth (again 0,0% increase with +10 GB/s, just on memory, no core clock, but maybe at +20/+30 it makes a sudden magical jump..). Core clock does have a direct influence on fillrate as you certainly aware, the boost in performance is there, slight (but it's there).

Of course it's PC environement, maybe the conclusion are not as clear, or can be totally applied to console environement, but in the end it gives a good idea of obvious bottlnecks i think.

But it could be shaders the bottleneck that causes speedup with core clock, or anything else too. Hard to isolate as you say.

I guess you could sort of start to isolate out mem bandwidth if you could get those benches, BW OC vs core OC, on titles you know arent VRAM limited. Starting to get pretty complex though.

It seems that without fail, when I see a OC test compare mem OC vs core OC, the core OC is more effective. Dunno why that is, as it doesn't seem to be logical. The 650 Ti BOOST for example, seems to get most of it's performance bump from it's more bandwidth.
 
But it could be shaders the bottleneck that causes speedup with core clock, or anything else too. Hard to isolate as you say.

I guess you could sort of start to isolate out mem bandwidth if you could get those benches, BW OC vs core OC, on titles you know arent VRAM limited. Starting to get pretty complex though.

It seems that without fail, when I see a OC test compare mem OC vs core OC, the core OC is more effective. Dunno why that is, as it doesn't seem to be logical. The 650 Ti BOOST for example, seems to get most of it's performance bump from it's more bandwidth.

To me shaders performance is mostly related to normalized Gflops value (true enough if you compare the same unified architecture) also with practical scheduling efficiency.

7790 lacks 2 CU compare to a 7850 but the theorical peak is the same at reference clocks, it's 1792 Gflops vs 1761 (7790 is even higher, also in Mtriangles/s and slightly in Texels/s).
 
How do we know it's the rops, and not the bandwidth or the 1GB 7790 framebuffer vs 2GB in 7850, crippling 7790 vs 7850?

A 660 Ti with 24 ROPS outperforms a GTX 580 with 48 ROPS at 1080P. Where's the sweet spot? Dunno offhand. We'd need a bench similar cards with 16 and 32, which I cant see happening (they will be from different vendors at least)

I would guess AMD knows the sweet spot. We can try to reason it out, but ultimately it is their GCN tech and they make 16 and 32 ROP parts. It is safe to assume they know the bottlenecks and would advise Sony accordingly. I'd love to reverse engineer the technical reasons, but like you said you would need the same part with 16 and 32 and they don't make it. They have a line in the sand, low end has 16 and high end 32, other things also change when moving up and down the performance scale.
 
Surely a Hybrid drive would be a very bad choice as it would limit drive upgrade options severly.

What happens if they die off in 3 years?

I think integrating solid state memory much closer to the processors would be a much better option, both in terms of performance and flexibility.

Jeff Rigby PM'ed me in GAF. I will just relay a short message.

He thinks the Flash RAM will come in handy in low power mode. In this case, the ARM CPU will remain awake, and does its things without spinning up the HDD. Apparently there will be strict EU guidelines on low power mode.

I haven't done any reading in this area.
 
Jeff Rigby PM'ed me in GAF. I will just relay a short message.

He thinks the Flash RAM will come in handy in low power mode. In this case, the ARM CPU will remain awake, and does its things without spinning up the HDD. Apparently there will be strict EU guidelines on low power mode.

I haven't done any reading in this area.

I think the EU guideline is 0.5 watts.
But it allows preprogrammed wake ups and higher power draw when doing something like background downloading for example.
 
I'm still not sure I understand it completely, but it looks like the directive can almost be ignored for both the PS4 and the 720. As long as it has a reason to be up, it's not considered standby. Even the GDDR5 refresh.
http://ec.europa.eu/energy/efficien...tion/guidelines_for_smes_1275_2008_okt_09.pdf
If any function beyond the "standby"-functions is provided, then the corresponding operating condition of is not considered “standby” any more. Examples include:

  • network communication functions through network interfaces such as LAN, USB, RS-232C, Wi-Fi, HDMI and infrared communications other than that of remote control.
  • network reactivation functions such as Wake on LAN (e.g. PC’s may have Wake On LAN activated in ACPI S4 and S5 modes, see below).
  • volatile memory preservation functions enabling instant reactivation without booting (e.g.ACPI S3 for PCs)
  • sleep mode as defined in ENERGY STAR for those conditions which, e.g., maintain network connectivity, or conditions providing enhanced reactivation functions as those defined under "reactivation function" in the Regulation.
  • the quick restart functions with OS active status such as present in equipment with hard-disk (e.g. DVD recorder, mini compo with HD)
  • security alarm activation power supplying functions supporting other equipment (e.g. TV sets supplying power to antennas, video/DVD recorders supplying power to the RF signal from antenna towards
  • TV set battery presence and power level detector after completion of battery charging
  • an active network download mode, such as present in e.g. DVD recorders that receive updates of the Electronic Program Guide at some pre-programmed moments in time.
 
I think the EU guideline is 0.5 watts.
But it allows preprogrammed wake ups and higher power draw when doing something like background downloading for example.

I think he is referring to the new law for gaming consoles which is something like 45 watts peak for when the consoles are not being used for gaming, i believe the 0.5 watt limit you mentioned is only for standby mode.

/speculation/ It's also quite possibly part of the reason that nintendo went low power with the Wii U, they jumped the gun in conforming to the new law before it was finalised, then Sony and Microsoft made them change the law to when the consoles are not gaming therefore allowing them to keep the ~200watt consoles.
 
Possible, I don't keep up with this stuff:)

Anyways, 800MHz * 32 ROPS * (4Bcolor + 4Bdepth) = 204.8GB/s while the reported rate is only 176GB/s right? So in that case it can have 32 ROPS that just write or read color without Z and therefore won't feature free alpha, depth check etc. That only requires half the bandwidth and leaves some room for texture reads, correct me if I'm wrong.

Actually, since modern rops typically optimize for 64-bit color and more than one sample, they should be capable of way more throughput than that, by 2-4 times.

However, they also have the gpu L2 caches between them and the memory interface. They often provide more that 2x amplification for the memory bandwidth, depending a lot on the load ofc.
 
Actually, since modern rops typically optimize for 64-bit color and more than one sample, they should be capable of way more throughput than that, by 2-4 times.

I understand, the memory speed isn't capable of that amount of throughput though. This thread did make me realize that a fine granularity ROP setup isn't that bad at all; blending or multisampling isn't needed all the time, so keeping resources idling in those cases is a waste of silicon.

However, they also have the gpu L2 caches between them and the memory interface. They often provide more that 2x amplification for the memory bandwidth, depending a lot on the load ofc.

Isn't that amplification just applicable for local (and more or less discrete) read/writes? I mean, these cache sizes are far from the framebuffer sizes right? So I can see it help when doing DOF, but not when multipassing in the way DOOM 3 did (separately blend each light into the framebuffer) .

The GCN diagrams don't clearly show the L2 being an intermediate step for ROP output.

http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf

I didn't read thru the entire paper, but aren't Z$ and C$ cache blocks?
 
Last edited by a moderator:
Michiel Van Der Leeuw, technical director at Guerrilla Games gives his own take on the system's efficiency.
http://www.videogamer.com/ps4/killz...ce_bottlenecks_claims_killzone_developer.html
The fact that the best pieces of hardware are also devised from, or optimised versions of, the stuff we find in PCs doesn't make it any less a console," Van Der Leeuw explained. "A PC is a number of parts that also [have] bridges in-between, where there are inefficiencies that may [come in if they're not] exactly the right match.
We've got the right amount of memory, video card; everything's balanced out. It was a very conscious effort to make sure that – with the speed of the memory, the amount of compute units, the speed of the hard drive – there would not be any bottlenecks.

I think it was for more than a year that we knew the main ingredients and there was just discussion after discussion trying to find a bottleneck. Take a look at this design; try to find the bottleneck.
So the 8g gddr5 ram is not excessive after all, I wonder what crazy things they would do with it that 4g couldn't.
 
GG, Santa Monica & Naughty Dog and Sony Studios in general with just 512MB amazed me.
With 8GB they will scare me.

I wonder if with a "easier to use/code" console more dev will be able to achieve the quality that, on PS3, was only achieved by Sony Studios.
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top