WTH IGN?! They posted the spec analysis from Major Nelson!!

Jawed said:
Nite_Hawk said:
Just the raw output pixels will be 7.68MB/frame at 1600pixels*1200pixels*4bytes/pixel.
You have to count Z/stencil, which is normally another 4 bytes for the pixel.

Of course as you say, there is much more to it than that (overdraw, aa samples, etc), but how did you arrive at 59MB/frame?
Instead of 15MB per frame you have to account for the 4xAA samples, which in uncompressed form is 4x (colour + z/stencil=8 bytes) per pixel, i.e. 32 bytes per pixel, 59MB.

Check the back-buffer calculaton here:

http://www.beyond3d.com/reviews/sapphire/512/index.php?p=01#why

Jawed

Ok, I just wanted to know what you were including to get to 59MB/s. What additional expense would AA filtering add on besides the 4X that was included? Also, (I know you said you didn't include this) Z compression seems like it's a pretty basic feature these days for anything but the lowest end cards.

Nite_Hawk
 
DemoCoder said:
And you're forgetting that the AA downfilter step doesn't occur until the final frame is done, requires minimal bandwidth,and can be handled on "scanout".
But every time a new fragment enters the ROP you have to re-evaluate the geometry coverage of all the AA samples, i.e. z-sort to determine the resulting colour of the 4 AA samples. The maths isn't difficult but the bandwidth hit is there alright.

Moreover, z-testing and overdraw as not as simple as you make out. Z-tests are accelerated by hierarchical-Z,
Agreed.

and overdraw is helped by Early-Z rejection.
Agreed. But 3x-5x overdraw seem to be the commonly quoted numbers.

The only real serious bottleneck for the RSX I'll grant you is alpha blending, but that totally depends on the workload, and all it does is influence the types of workloads developers through at the GPU. For example, the PS2 excelled at alpha blending, hence alpha blend effects were used alot more.
I really know sod all about PS/2 so I can't relate to it I'm afraid.

The idea that an RSX level GPU can't handle 1280x720 is sheer nonsense.
[I'm not sure which thread we're in as I type this:] My argument is that 1080p fp16 blended HDR with 4xAA is prolly beyond the limits of RSX. If RSX is 2x as fast as NV40 and can do 2xAA with HDR, then that means that 720p fp16 blended HDR with 2xAA should be playable. FP16 blended HDR is a killer on NV40.

Without HDR I don't doubt RSX will be very playable at 720p 4xAA - hell 6800 Ultra is pretty much unbustable at that res already.

Like I said, previous generation GPUs have already demonstrated nearly 3Megapixels/s @ HD resolutions and 2xFSAA. Have you been playing all your PC titles at 640x480 these years? Battlefield 2 is coming out soon, has next-gen level polygon counts, and I assure you it will run at 720p @ 60fps on top-end GPUs.
When my recently purchased 9800Pro's fan stopped spinning I was relegated back to my trusty, original, Radeon 32MB SDR, so yes I'm back to gaming at low res in OFP (but with all the options on high, except I have to keep the draw distance to around 500m) and HL-2 (all the options on minimum!). I haven't even tried the other games (not many...).

I have to say I'm deeply impressed by next-gen console games. I think PC has about 2 years in the wilderness, primarily because the CPUs are so shit - we should have been running dual-core processors for about 2 years by now...

Anyway, it's with much relief that devs are going to be forced to take multi-threaded game apps seriously now.

Jawed
 
Nite_Hawk said:
Ok, I just wanted to know what you were including to get to 59MB/s. What additional expense would AA filtering add on besides the 4X that was included?
Sadly that's the $64,000 question. 4x overdraw, 2,000,000 triangles on screen, 1280x720p - erm, I dunno.

Also, (I know you said you didn't include this) Z compression seems like it's a pretty basic feature these days for anything but the lowest end cards.
There's compression all over the place in GPUs. It's really hard to find any data on meaningful averages in various scenarios.

This is why this whole game of predicting how useful the EDRAM on R500 is so tough - or trying to determine if RSX is bandwidth limited in non-HDR rendering is so tough...

Here's a screenshot from The Project by Crytek. Lots of high-end shaders, soft shadowing, multiple lights, lots of triangles, 4xAA:

http://www.cupidity.f9.co.uk/Demo0.jpg

My 9800 Pro (running on an Athlon 64 3500+) was struggling somewhat. If RSX is 5x as fast, that'll be playable :D Is that next-gen graphics?...

Jawed
 
Well, the 9800 Pro has 1/2 the bandwidth, since texture lookup and FB contend with each other. 4xFSAA also cuts fillrate in half, so you're down to 1500Mpix/s. Finally, shader wise, the RSX is probably 3x+ faster (3x times as many ALUs plus higher clock)

Hopefully, 4xFSAA on RSX isn't a fillrate hit.
 
Number of FSAA samples usually scales with bandwidth, given there isn't any leap in bandwidth (the opposite for RSX) I have my doubts this element would have changed.
 
Dave, I concur. However I have heard there have been some ROP changes related to HDR efficiency. On the other hand, the # of ROPs or # of samples a ROP can write haven't always been balanced with bandwidth on these GPUs. :0

I would say that the RSX's bandwidth has to be viewed carefully. It is a step back from the 6800 Ultra, being 128-bit GDDR3, on the other hand, it has a sgemented bus with a second 20+gb/s rate, so in totality, it has more than a 6800 Ultra and a less contended bus. It is more like the Voodoo2 :) In reality it has a "256-bit-like" bus that probably has superior latency than a single 256-bit GDDR3 bus. Given that the CPU won't use anywhere near the system's RAM bandwidth, I can see XDR being dedicated for textures and geometry, and GDDR3 being dedicated for framebuffer and render-to-texture textures.

In other words, I think the RSX should be faster, easily, than a 6800 Ultra, not just because of shader ALUs, but because the memory layout is more efficient.
 
DemoCoder said:
I would say that the RSX's bandwidth has to be viewed carefully. It is a step back from the 6800 Ultra, being 128-bit GDDR3, on the other hand, it has a sgemented bus with a second 20+gb/s rate,
Sadly, it's only 15GB/s.

[Would you please stick to using B for byte, not b (which is for bit)?]

so in totality, it has more than a 6800 Ultra and a less contended bus. It is more like the Voodoo2 :) In reality it has a "256-bit-like" bus that probably has superior latency than a single 256-bit GDDR3 bus. Given that the CPU won't use anywhere near the system's RAM bandwidth, I can see XDR being dedicated for textures and geometry, and GDDR3 being dedicated for framebuffer and render-to-texture textures.
I agree, all this should work.

Jawed
 
DemoCoder said:
Dave, I concur. However I have heard there have been some ROP changes related to HDR efficiency. On the other hand, the # of ROPs or # of samples a ROP can write haven't always been balanced with bandwidth on these GPUs. :0

I wouldn't really say that it follows that becuase there have been HDR changes there will be FSAA changes (in terms of the number of samples written per cycle.

Given that the CPU won't use anywhere near the system's RAM bandwidth, I can see XDR being dedicated for textures and geometry, and GDDR3 being dedicated for framebuffer and render-to-texture textures.

It might not use the bandwidth, but its more likely to use the space. Segmenting things like that means you'll be eating your system memory whilst wasting the graphics RAM - I'd say that this would be the exception rather than the norm.
 
One can use GDDR3 bandwith just for framebuffer read/write..but then what about the remaining 200 mb of memory? ;)
The best solution, imho, is to split even frame buffer bw between xdr and gddr ram: color buffer in one memory pool and zbuffer in the other one, or 2 channels color buffer in one memory pool and remaining 2 channels + zbuffer in the other memory pool)
Low bandwith requirements data would then be stored in vram.
Obviously I don't know if RSX can read/write from split color/zbuffer or even multi render targets, but from what we know this should be possible.
Developers will have to distribuite bandwith requirements between memory pools to achieve maximum performance and to use all the available memory.
 
I'd say that if you look at the average game engine, most of the system memory is taken up by textures and geometry, even on PC, where textures get duplicated in both system RAM and graphics RAM. If you just count up the space used by auxillary datastructures (e.g. BSP, AI, game logic) it's not that much. Typical unpacked BSP on HL2 for example eats up about 10mb.
 
Mmmm, well, talking to "A N Other" developer (who is now working on the PS3) he mentioned that they would generally budget RAM allocation at 50% for system and 50% for graphics previously; not sure that is representative of all developers.
 
50% of XDR, or 50% of total? I don't profess to know as much as that developer, but I think it will be awhile before people figure out the right balance, which is one of the chief things I think the XB360 has going for it: easier development. I simply don't believe Sony's tales about easy PS3 development.

I honestly don't see the main engine eating up that much heap space that isn't related to 3D media, except for sounds. Maybe some PS2 developers can comment, but I see textures and geometry eating up the most amount of space.
 
DemoCoder said:
which is one of the chief things I think the XB360 has going for it: easier development. I simply don't believe Sony's tales about easy PS3 development.

Why? Do you think chopping up your game loop into multiple threads is easier than keeping it a single thread and batching out your geometry calculations?
 
It's easier to program 6 granular threads on general purpose HW than to program 2 general purpose threads, and 7 specialty threads in a different instruction set, with different RAM, different memory constraints, etc

Extracting max performance from CELL and RSX will be more difficult than XBCPU IMHO. CELL SPE doesn't even have a dot product instruction, so achieving 1dot/cycle on SPE requires alot of workaround.
 
Jawed said:
DemoCoder said:
I would say that the RSX's bandwidth has to be viewed carefully. It is a step back from the 6800 Ultra, being 128-bit GDDR3, on the other hand, it has a sgemented bus with a second 20+gb/s rate,
Sadly, it's only 15GB/s.

I think he's referring to read bw. Since Cell<->RSX interlink is 35GB/s and that's 20GB/s read & 15GB/s write, unless I'm mistaken.

DemoCoder: That's an interesting idea about splitting up storage to optimize use of the aggregate bw. Cell can read from VRAM all the same too. BTW, what did you hear about the ROPs being optimized for HDR? In what way? RSX is still such a mystery still. Only a week passed now, but I'm impatient. :devilish:

Oh yeah, any idea what the rest of that FlexIO bw is gonna be used for? It's supposed to be a 75GB/s bus, but the diagram Sony showed only looks to use 40GB/s of it. :? PEACE.
 
DemoCoder said:
It's easier to program 6 granular threads on general purpose HW than to program 2 general purpose threads, and 7 specialty threads in a different instruction set, with different RAM, different memory constraints, etc

While the memory constraints are important, everything else should just be C++ functions and objects. And many long concurrent threads are a bitch to manage.

Extracting max performance from CELL and RSX will be more difficult than XBCPU IMHO. CELL SPE doesn't even have a dot product instruction, so achieving 1dot/cycle on SPE requires alot of workaround.

Does the XBCPU has a one-cycle dot product? But that doesn't even matter, as you have more than twice as much SPEs as you have XBCPUs, disregarding the program- and OS threads.
 
MechanizedDeath said:
Oh yeah, any idea what the rest of that FlexIO bw is gonna be used for? It's supposed to be a 75GB/s bus, but the diagram Sony showed only looks to use 40GB/s of it. :? PEACE.

They didn't use it I presume, 70GB/s was overkill anyway, for a one Cell system.
 
C++ isn't magic and generating optimal code for an SPE is much harder than generating it for a modern desktop processor. Moreover, writing multithreaded code and dealing with concurrency has always been a hard problem, even in languages designed to deal with it (like Erlang, Occam, Concurrent Clean, etc)

For example, if you think you can just recompile Havok, Novadex, or ODE physics engines for SPE just because they are C++, you are in for a rude awakening.
 
DemoCoder said:
C++ isn't magic and generating optimal code for an SPE is much harder than generating it for a modern desktop processor.

Would that be much harder than generating code for the custom PowerPC variant? And don't you think, that using compilers nowadays gives you much more bang for your buck than having your programmers learn and use assembler?

Moreover, writing multithreaded code and dealing with concurrency has always been a hard problem, even in languages designed to deal with it (like Erlang, Occam, Concurrent Clean, etc)

Exactly. So try to stay away from that. Just use a single, serial thread to manage everything and do the calculations by uploading data blocks to the other units. That saves you a large amount of headaches.

For example, if you think you can just recompile Havok, Novadex, or ODE physics engines for SPE just because they are C++, you are in for a rude awakening.

Sure, you need to split, rewrite and batch the functions that execute the calculations. So what, if that makes it run seven times as fast with less overhead?
 
Back
Top