View Full Version : Observations, thoughts and questions about X360 and PS3
scificube
30-Sep-2005, 20:18
I have taken many of observations concerning the PS3 and X360 and have come to form some opinions about what I see. I would like to put out what I’m seeing and see if I am at least justified in my thinking and whether or not I need to consider other things. I will warn those who would get upset about it. Yes, I do see allot of advantages for the PS3. I expect this may rub some people the wrong way so I invite you all to give me your perspective on things. It’s not like I’m going to disagree with facts or I have my mind made up. It is the primary purpose of this thread to see if have framed things correctly and secondarily to see if any interesting things should pop up during the discussion if there is any…I hope so, I really do.
Oh…I am forced to apologize for how long this post is. Most of it is stuff one may already know so again I apologize in advance for that. I felt it better though to explain well where I was coming from rather than simply spouting off a string of declarative statements. For one, I am not so brash as to think I am an authority on anything so making decelerations just seems something wrong for me to do. The other issue is if I my thinking is indeed off, I need to have my thinking out there so people can see where exactly it went bad and correct me at right spot so that I can understand better what they’re saying.
Next-Gen CPUs:
What I would like to note is that Xenon has 3 cores while Cell has 8 active cores. What I have begun to focus on is that this means Xenon can handle 3 HW threads while Cell can handle 8 HW threads. I think this is important when thinking about execution resources. Logical threads on a core end up fighting for execution resources while HW threads do not. Xenon’s cores all can handle 2 logical threads allowing Xenon to handle 6 threads at a time. Cell’s PPE supports 2 logical threads but it’s seven SPUs support only one HW thread. As I don’t know the PPC equivalent I am forced to describe Xenon’s cores and Cell’s PPE as being hyper-threaded. The classification would seem to fit in my mind as well. This seems a very significant thing to me when comparing Xenon and Cell. What I gather is that all 6 threads on Xenon if used have to fight for execution resources in that if a resource is not free that particular thread must wait or be swapped out (unlikely). However 7 of the 9 threads Cell can handle have full reign over the resources on a core. (threads on the SPUs). I feel this may be the most or at least one of the most significant observations we can draw about these two CPUs. Threads on Cell have more resource of various types available to them. There are two things I gather from this. Threads with more resources available to them can get more work done. Threads that don’t have to wait for resources to free up can get more work done.
Critical thoughts:
Threads on Cell can better leverage the power of the silicon available because of the structure of the HW. There is also more silicon to take advantage of.
Pitfalls:
1. One would note that if 2 threads are executing on a core but are demand different resources in the pipelines that this increases efficiency not decreases it.
2. SPUs in Cell can only task on execution element at a time putting them at a disadvantage.
I agree. In this situation the logical threads can be considered to be just as efficient as HW threads doing the same thing. However, one must realize that the situation will arise quite often when this is not the case.
SPUs don’t need to do more than one thing at a time. Threads on SPUs have the whole core to themselves so while one is doing its thing there are six others doing their thing. With Xenon one can only guarantee 3 threads are executing where 3 more can be executing if resources are available to do work. (Work != threads being still being present…the threads aren’t going anywhere)
Real world basis:
X2 vs. P4 with HT. The X2 routinely wins out because it has 2 actual cores that can give 2 threads more execution units to work with where, as the P4 with HT does not and thus cannot. Something to note is that HT on the average only provides a 10-20% bump in speed where it would not seem uncommon for the X2 to have a 100% speed bump.
Bandwidth:
There is a surprising disparity here!
Before I go on I need to caution that I am unsure as to whether RSX has access to the PS3’s XDR. I know I’ve read it a bunch of times but I can’t recall where I saw it officially or at least in solid form such as a technical document. If this is not the case than most of what I say next is made null and void. I am confident however this is the case as it makes good sense.
Xenos serves as the memory controller in the X360. Xenos can read/write 22.4GB/s from the GDDR3 in the system and can read/write 10.8GB/s from Xenon. I’ve seen no mention yet of a south bridge so I will assume for now that I/O for system components eats into the bandwidth the GDDR3 provides. 2GB/s or less seems reasonable.
Cell must have it’s own memory controller to access the XDR ram it uses for main ram. RSX must have it’s own memory controller to access it’s pool of GDDR3. What is interesting is that RSX also has access to the XDR ram in the system. This allows RSX to access up 512 MB of Ram minus whatever Cell consumes which makes perfect sense. (no different than what Xenon consumes of the GDDR3 in X360) RSX has 22.4GB/s bandwidth to its pool of GDDR3. RSX can write 20GB/s to Cell and read 15GB/s from Cell. Cell does not have access to the GDDR3 in the system. What is most significant is that via Cell’s memory controller RSX had an additional 25.6GB/s read/write bandwidth with the XDR ram in the system. This provides RSX with 48GB/s to read/write from memory in the system on top of the read/write bandwidth between it and Cell.
It is not difficult to see how this is possible when looking how two Cell chips would communicate via a crossbar. The crossbar makes possible the communication between Cells while Flexio handle communicating with the XDR Cell uses. I suspect RSX merely sits on the other side of one of the crossbars in the PS3…and now we know just how much Cell chips data Cell chips can communicate between each other in such a setup by looking at the link between Cell and RSX. Also there appears to be a south bridge in the PS3 that will keep I/O from other system components from stealing band with from Cell and RSX.
Critical thoughts:
Where Xenos would appear to be headed for bandwidth limited days in having less bandwidth available to it than a PC GPU, RSX on the other hand has more bandwidth available to it than any GPU part seen to date…at least by me. The only reason I don’t classify this as too much bandwidth is that RSX is clocked at 550MHz so in the end that extras bandwidth should come in handy in trying to feed RSX. (and I've heard there's no such thing as too much bandwidth) Looks like a good fit situation where RSX isn’t cruising towards being bandwidth limited but doesn’t have bandwidth to spare either.
Pitfalls:
MS says its directx compression will give them 50% more bandwidth to work with. I see this statement a PR move until proven otherwise. 50% more bandwidth due to what being compressed when and where will un-compressed data be stored and won’t storing that eat up bandwidth just the same? Is this compression HW accelerated? DX calls eat up CPU time not bandwidth with its overhead. Seems bogus until someone explains to me how this could be possible.
The intelligent memory of Xenos removes a good amount of the bandwidth load from the main ram by containing the frame buffer in it and not there. However Xenos still needs to get textures from main ram and “tiles” from/for the frame buffer when rendering to 1080i. Textures should be pretty large and larger if things are rendered to 1080i and not scaled to 1080i. These thought are what make me think Xenos “could” be bandwidth limited. Xenos is the X360’s memory controller forcing Xenos to share the bandwidth from the GDDR3 at all times. This moves me from “could” be bandwidth limited to thinking that it is probably the case especially when rendering to 1080i.
RSX will not have constant access to 48GB/s of bandwidth. In using Cell’s memory controller to get to the XDR pool it is actually consuming bandwidth Cell could use. There is an interesting possibility here though. One idea is to feed Cell with data from RSX (it’s pipes) while RSX is accessing XDR ram. This would ensure Cell doesn’t completely starve for data to work on during this interval. Just an idea.
Something to keep in mind is that RSX has 22.4GB/s bandwidth guaranteed and can variably pull from another 25.6GB/s as needed or when possible. Xenos is always sharing its 22.4GB/s bandwidth to memory with Xenon despite how the daughter die eases things.
CPU-GPU relations:
Xenos can send/receive data at 10.8GB/s to/from Xenon. They will communicate primarily to aid each other in the task of rendering.
RSX can send data to Cell at 20GB/s and receive data from Cell at 15GB/s. They will also work together primarily on the task of rendering.
Here is the way I’m looking at it. From my perspective one of the parts is working at the task at hand…rendering while the other is supplying it data to work on…a VERY intelligent memory if you will. One could then look at the bandwidth between the two parts as that from the “graphics processor” to its memory and go from there.
When thinking like this these are some observations I have. When looking at the bandwidth between Xenos and Xenon it appears to be about 1/3 of that “graphics processors” enjoy now. When looking at Cell and RSX the bandwidth appears to be about 1/2 to 2/3 rds what “graphic processors” utilize today. (PC graphics parts get approx. 30GB/s bandwidth)
This is not so bad when you go back and look at “how” each part will contribute to rendering. Xenos and RSX are to do the heavy lifting while Xenon and Cell are to provide flexibility these parts don’t have and then do what they can to aid in the task(s) at hand. These parts won’t need to communicate THAT much data but there are some interesting differences I think about just what can be done here.
I expect neither Xenon nor Cell to be particularly good at rasterization. I expect both Xenon and Cell to aid in vertex processing, particles, and post processing affects. I am thinking along the lines of tessellation, disp mapping, integrating particles into the physics simulation, etc. What can be done and to what extent is the question.
As things look now I would say that Xenon and Xenos are at a disadvantage to Cell and RSX on these special tasks. Cell has more threads it can dedicate to these tasks where each thread has more execution resources available to getting the tasks done. The second issue is that Cell and RSX can communicate more data between one another. The last issue has to do with ease of use but I feel it is significant. I cannot find where I saw it (so if you wish to dismiss this I understand) but it would appear programmers can use Nvidia’s Cg to program in vertex work for Cell’s SPUs. This does nothing for making the HW more powerful in relative terms but it’s goes a long way in turning the potential into the kinetic. To my knowledge there is no equivalent for this in working with Xenon’s core to the same end.
Pitfalls:
I have not ignored that Xenos can do a heck of lot of vertex processing on it’s own. Xenos is not a CPU however so it’s lack flexibility in what it can do with its entire vertex processing power. Xenos should be able to throw up INSANE amounts of geometry but it will not be applying physics etc to it…these tasks still best falls on Xenon and I’ve already spoken to the situation there. (the MEMEXPORT capability is a factor here though and shouldn’t be ignored)The other thing to consider is when Xenon is in INSANE vertex worker mode it is also in “not doing pixel work” mode and Xenon won’t be able to pick up the slack there…Cell wouldn’t be able to do it either.
Random thoughts:
I am curious about how well Xenos, it’s tessellator and Xenon can work together to make displacement-mapping work. There is a Z-brush demo of this I think. Could we see stuff like this in real time? Yes/no/maybe…what would it take of the tessellator etc?
I think it might be a good thing to place RSX’s frame/z buffers in the XDR ram. This would give Cell direct access to them so that some interesting things may be possible. What could this allow to be done? This would guarantee RSX consumes some of Cell’s bandwidth to main ram but it could it also free up bandwidth between Cell and RSX? Would the trade off be worth it?
Is there any news as what changed with respect to Cell’s PPE VSU? Any clue into what Crytek means by saying that Cell’s PPE has slighter/somewhat better “hyper threading” than a core in Xenon?
Is it true the dynamic scheduler used with Cell will find work for the SPUs to do if they aren’t tasked explicitly to something?
I don’t understand the issue with Xenos being triangle setup limited. Could someone explain this to me? Is it an issue where all the arrays couldn’t be tasked to vertex processing or is this merely a limit to the amount of vertex processing each array could do?
What would be some interesting things Nvidia could do with RSX’s feature set? Just want to here some interesting ideas.
MOST IMPORTANT THOUGHTS:
What do you think of anything I said?
What other factors may play a role when thinking about the overall performance of these machines and would these factors be more significant than these? (I’d prefer HW related issues, but if they tie into ease of use etc that’s ok)
Don’t worry about going over my head…learn me something! That’s what I’m here for.
Titanio
30-Sep-2005, 20:31
I'm sure there'll be much more said about this, but I'll just raise two small informational points for you:
Cell does not have access to the GDDR3 in the system.
I don't think this is true, as far as I was aware, both cell and rsx have access to each others memory pools.
The issue of bandwidth is an interesting one though. I'm also curious about how things will materially pan out on PS3, especially if HDR and MSAA can't be done together. That could significantly cut down on the framebuffer bw requirements and leave more left over for non-framebuffer tasks (texturing etc.) than X360, after you factor out CPU consumption.
Also on another point you made, some PC GPUs have ~50GB/s of bandwidth at the moment, and of course they don't have to share with the CPU.
I cannot find where I saw it (so if you wish to dismiss this I understand) but it would appear programmers can use Nvidia’s Cg to program in vertex work for Cell’s SPUs.
This is by no means confirmed, AFAIK. A lot of people have speculated that it would make a lot of sense, but nothing certain yet.
edit - a third point, X360 does have a southbridge, two PCi Express lanes 500MB/s up, 500MB/s down.
Lysander
30-Sep-2005, 20:47
Scifi, you are very vague between logic and hardware thread difference. I insist that X2cpu is 6 hardware threaded. Can someone in short define basic hardware difference between spe and ppe(dd2)?
Titanio
30-Sep-2005, 20:53
Scifi, you are very vague between logic and hardware thread difference. I insist that X2cpu is 6 hardware threaded. Can someone in short define basic hardware difference between spe and ppe(dd2)?
Xenon's threads are hardware threads, but they share the core. The SPU has one thread at a time that has the SPU to itself.
What I would like to note is that Xenon has 3 cores while Cell has 8 active cores. What I have begun to focus on is that this means Xenon can handle 3 HW threads while Cell can handle 8 HW threads.Cell can handle 9 HW threads and Xenon can handle 6 HW threads though context switching may be more costly in Xenon than Cell PPE when a core has more than 2 threads.
scificube
30-Sep-2005, 20:55
What cards have 50GB/s? I'm just curious. That's a HECK of allot of bandwith. I guess I haven't kept up well enough.
If what you say about Cell being able to access the GDDR3 is correct then that certainly eases the bandwith load all around in the system. I am curious as to how this could be possible though. Flexio talks to only XDR no? So could the crossbar be where the link is...that would suggest some customization there right? Actually I have RSX sitting on the other side of the crossbar where another Cell chip normally would so there definitely would have to be some further customization.
Perhaps it's best to ask where you saw Cell would be able to access the GDDR3 and maybe I could figure the out the answers for myself. I need to find that pdf that detailed how the crossbar setup worked as well.
I was just throwing the Cg thing out there of course. If it's not the case I still it as an ease of use issue. The potential will still be there but may be lost in the effort to pull it out.
It is interesting to think about MSAA + HDR. If they can be done together bandwith is saved in not doing both at the same time right? The penalty is it taking more time on the processing end to get the job done right? Any ideas as to which is more desirable?
I also wonder if Nvidia has paid attention to what Valve has done with it Lost Coasts Expansion. If their solution works with AA, performs well and give pretty good results perhaps it would be best if both Nvidia and ATI went in the direction of promoting HDR usage in this fashion. It will of course require Nvidia to swallow it's pride and relent on one of it's prized marketing lines.
scificube
30-Sep-2005, 21:01
Scifi, you are very vague between logic and hardware thread difference. I insist that X2cpu is 6 hardware threaded. Can someone in short define basic hardware difference between spe and ppe(dd2)?
I only wanted to highlight the difference between threads that do and do not have to share resources. I've seen multiple threads running on a core simultaneously referred to being "logical" threads presented to the OS. I consider all threads to be HW threads but some distinction must be made to describe things better so I used that convention. The naming convention really isn't what is important to me. It is the resources available to each thread that I make note of and I think is significant.
Did that help? I couldn't explain myself out of a paper bad ya know :(
scificube
30-Sep-2005, 21:10
Cell can handle 9 HW threads and Xenon can handle 6 HW threads though context switching may be more costly in Xenon than Cell PPE when a core has more than 2 threads.
Again I re-iterate that what the threads are called are not what I make note of. If you ignore the naming convention you'll see that I did not steal anything from Xenon in that I did make it clear it could handle 6 threads.
I really did not address context switching in the manner that you describe. I have no idea as to the penalty for doing this with either Cell or Xenon. For 2 threads on a core in Xenon switches should be very fast between them. Switches still need to be done though and cannot take place until execution resources free up. Switching one of these 2 threads out with another from elsewhere will be more costly for sure but I've no idea as to whether it more or less costly on a Xenon core than on Cell's PPE. Given the similarities between these two cores I would imagine the cost of doing this is close to being equal. (perhaps cache issues could affect this though to provide some separation)
Titanio
30-Sep-2005, 21:13
What cards have 50GB/s? I'm just curious. That's a HECK of allot of bandwith. I guess I haven't kept up well enough.
Sorry, that should have been more like ~40GB/s. The 7800 GTX starts at 38.4GB/s, different implementations might have more if the clockspeed has been bumped up. And of course, that's not shared with the CPU.
If what you say about Cell being able to access the GDDR3 is correct then that certainly eases the bandwith load all around in the system. I am curious as to how this could be possible though. Flexio talks to only XDR no? So could the crossbar be where the link is...that would suggest some customization there right? Perhaps it's best to ask where you saw Cell would be able to access the GDDR3 and maybe I could figure the answers for myself.
I'll try look it up to be sure, I'm second guessing that now ;) I thought it was mentioned in some interview somewhere..
edit - kutaragi did confirm cell can access gddr3, see the end of this post.
It is interesting to think about MSAA + HDR. If they can be done together bandwith is saved in not doing both at the same time right? The penalty is it taking more time on the processing end to get the job done right? Any ideas as to which is more desirable?
Not quite sure what you mean, but I meant that if it was not physically possible for RSX to do MSAA and HDR together, then framebuffer bandwidth requirements, vs say Xenos, will be a lot lower. Also if you factor in colour compression. So it still has to use main memory bandwidth for that, but the requirement should be a lot lower than it otherwise would be if RSX could do MSAA and HDR together, and you were using both (I think, at least!?).
I also wonder if Nvidia has paid attention to what Valve has done with it Lost Coasts Expansion. If their solution works with AA, performs well and give pretty good results perhaps it would be best if both Nvidia and ATI went in the direction of promoting HDR usage in this fashion. It will of course require Nvidia to swallow it's pride and relent on one of it's prized marketing lines.
Hadn't heard what Valve were up to - any links to more info?
edit - Here's a Kutaragi quote re. Cell accessing GDDR3:
"CELL and RSX have close relationship and both can access the main memory and the VRAM transparently. CELL can access the VRAM just like the main memory, and RSX can use the main memory as a frame buffer. They are just separated for the main usage, and do not really have distinction."
Every time there is debate about cell vs. xenon I am reminded about the article that Deano linked (http://www.gotw.ca/publications/concurrency-ddj.htm)to. If I understand it correctly, the thrust of the article is about how even though you can take a problem or task and break it down to utilize MP the fact remains that the final execution speed will be determined by the slowest process. So while cells 9 processing elements may be 50% more that xenon's 6 in the real world the difference may be of little consequence. Or, on the other hand I could be talking out of my ass. :lol:
scificube
30-Sep-2005, 21:31
Sorry, that should have been more like ~40GB/s. The 7800 GTX starts at 38.4GB/s, different implementations might have more if the clockspeed has been bumped up. And of course, that's not shared with the CPU.
Thanks. Well know I know and knowing...god that show sucked.
Not quite sure what you mean, but I meant that if it was not physically possible for RSX to do MSAA and HDR together, then framebuffer bandwidth requirements, vs say Xenos, will be a lot lower. Also if you factor in colour compression. So it still has to use main memory bandwidth for that, but the requirement should be a lot lower than it otherwise would be if RSX could do MSAA and HDR together, and you were using both (I think, at least!?).
It is known HDR is possible on the PS3. It's also a given MSAA is as well. I was thinking if the tasks could not be done simultaneously than they must be apart from one another. I doubt devs are going to give up on HDR and AA being done together (perhaps not at the same time) with their PS3 games. I was thinking I think in error this would save bandwith and only cost extra processing time. Bandwith should still be consumed by HDR and MSAA just the same if they are done seperately...I'm thinking logically on this as I don't know for sure.
Hadn't heard what Valve were up to - any links to more info?
I'll try to find you a quote. Apparrently Valve has found a way for SM2.0 HW to do HDR and MSAA at the same time and to boot with good performance. They supposed tested four different methods and came up with the one they are going to use for the Lost Coast level. If SM2.0 HW can handle it I've little doubt both Xenos and RSX will blast through HDR+MSAA using Valve's method...if Valve feels like sharing how they got it done. I think Humus is onto them anyway with his own experiments. Give me a moment and I'll try to find you a link.
edit - Here's a Kutaragi quote re. Cell accessing GDDR3:
"CELL and RSX have close relationship and both can access the main memory and the VRAM transparently. CELL can access the VRAM just like the main memory, and RSX can use the main memory as a frame buffer. They are just separated for the main usage, and do not really have distinction."
KK said it? Well...it's better than nothing. I kid. I doubt he'd say something like this when devs will tear him a new one for lying to them about something so significant.
Titanio
30-Sep-2005, 21:34
Every time there is debate about cell vs. xenon I am reminded about the article that Deano linked (http://www.gotw.ca/publications/concurrency-ddj.htm)to. If I understand it correctly, the thrust of the article is about how even though you can take a problem or task and break it down to utilize MP the fact remains that the final execution speed will be determined by the slowest process. So while cells 9 processing elements may be 50% more that xenon's 6 in the real world the difference may be of little consequence. Or, on the other hand I could be talking out of my ass. :lol:
If you require things to be finished at the same time, you'd break your "slow" task up further or get your other tasks to do more. For example, if a particular task was the "slow" task on chip X, you might have the opportunity to split it up in a number of different ways and execute those parts concurrently on chip Y, thus ridding yourself of that bottleneck. Or if while you're waiting for a "slow" task to finish on one core, you could spend your other core's "free" time doing a multi-frame task for example, or move them on to wholly independent tasks of the next frame perhaps, if you couldn't simply increase the amount they were doing on tasks for the current frame (but heh, i'm sure you could).
In other words, I wouldn't worry about the chip getting used in the instance of having a relatively slow task/thread.
rendezvous
30-Sep-2005, 21:34
I only wanted to highlight the difference between threads that do and do not have to share resources. I've seen multiple threads running on a core simultaneously referred to being "logical" threads presented to the OS. I consider all threads to be HW threads but some distinction must be made to describe things better so I used that convention. The naming convention really isn't what is important to me. It is the resources available to each thread that I make note of and I think is significant.
Did that help? I couldn't explain myself out of a paper bad ya know :(
What resources?
All threads on a system is sharing the memory which is a system resource.
They are also sharing the execution units on the CPU, even on non multithreaded processors by time slicing.
One way of saying what I think you want to say is that the PPE and the cores in the Xenon have multiple (two) hardware contexts for multithreading.
And the equivalent to the capabilities in the PPE (and i presume the Xenon cores) in the PC world is not Hyper Threading or SMT (Simultaneous Multi Threading).
The difference is that you in the case of the PPE have fine grained multithreading where only instructions from one core can be issued each clock cycle whereas in the case of SMT you are able to issue and execute from both threads simulataneously.
The second pitfall in the first section seems to have a typo so please forgive me if i misinterpreted it.
2. SPUs in Cell can only task on execution element at a time putting them at a disadvantage.
The SPU is able to execute on two execution units at a time.
To be blunt:
Wrong in lots of ways. But any more details would get me in trouble.
scificube
30-Sep-2005, 21:40
http://www.bit-tech.net/gaming/2005/09/14/lost_coast_screens/2.html
Here you go Titanio :)
The statement that they have HDR+MSAA working with SM2.0 HW is at the bottom of the page.
The whole article is a good read if you're interested, but be warned...it's full of Lost Coast spoilers in the form of HDR/no HDR comparison shots.
Titanio
30-Sep-2005, 21:41
It is known HDR is possible on the PS3. It's also a given MSAA is as well. I was thinking if the tasks could not be done simultaneously than they must be apart from one another. I doubt devs are going to give up on HDR and AA being done together (perhaps not at the same time) with their PS3 games. I was thinking I think in error this would save bandwith and only cost extra processing time. Bandwith should still be consumed by HDR and MSAA just the same if they are done seperately...I'm thinking logically on this as I don't know for sure.
I don't think you can just do them seperately. There might be other ways as per your Valve example, but I'm sure there are catches. I'd be very interested to hear more about Valve's work on that though.
edit - thanks for the link. Sounds interesting, pity there isn't more detail!
scificube
30-Sep-2005, 21:41
To be blunt:
Wrong in lots of ways. But any more details would get me in trouble.
That's not fair. Just use a word. Threading, bandwith...whatever.
rendezvous
30-Sep-2005, 21:42
To be blunt:
Wrong in lots of ways. But any more details would get me in trouble.
I knew i shouldn't trust MPR. :(
BlueTsunami
30-Sep-2005, 21:59
That's not fair. Just use a word. Threading, bandwith...whatever.
:lol: It sucks. I know. I also hate that I can read through your post...some or alot may not be right (as DeanoC stated, alot) and not be corrected. :(
scificube
30-Sep-2005, 22:02
What resources?
All threads on a system is sharing the memory which is a system resource.
They are also sharing the execution units on the CPU, even on non multithreaded processors by time slicing.
I am only talking about execution units...FPU, VMX, etc.
One way of saying what I think you want to say is that the PPE and the cores in the Xenon have multiple (two) hardware contexts for multithreading.
Sounds good to me.
And the equivalent to the capabilities in the PPE (and i presume the Xenon cores) in the PC world is not Hyper Threading or SMT (Simultaneous Multi Threading).
The difference is that you in the case of the PPE have fine grained multithreading where only instructions from one core can be issued each clock cycle whereas in the case of SMT you are able to issue and execute from both threads simulataneously.
I've seen discussions to this affect. Real SMT required doubling the resouces...might as well toss in another core is what I gathered. From the perspective of the hardware contexts I am looking at it like this. One thread is exectuting and if it hangs up on something a quick switch can be perfomed so that the other thread can execute and the threads flip flop back and forth at a really fast rate. This is what I thought but someone suggested the other approach that I laid out above was the more correct way of looking at this so I went with what they said.
edit:
I think I get it...my friend is right about how HT works. Rendezvous I missed it the first time. HT is not the correct way to look at the PPE and Xenon cores. Well...that was painful.
end edit:
I actually agree with your thinking but the guy is often so sharp I was scared to go against what he said.
In truth, allot of what I said is nullified if in fact I am correct.
The second pitfall in the first section seems to have a typo so please forgive me if i misinterpreted it.
The SPU is able to execute on two execution units at a time.
Oops...I knew that. Actually that was a bad description of a different idea I was trying to convey. When talking agian with friends they have made not that SPU only work on one thread at a time which puts them at a disadvantage. I had too many execution units runninng through my head.
I should add to the thinking about that pitfall --- Considering things again. The SPUs may in fact be at a disadvantage. When they hang there is no other context avaliable to switch to and must wait for the PPE to task them to something else. Xenon's cores however have another hw constext to switch to so in this respect they would have a good advantage.
-------------------------------------------
Did I find it DeanoC? anybody...c'mon someone tell me...
scificube
30-Sep-2005, 22:04
I knew i shouldn't trust MPR. :(
What's MPR? ...microsoft PR? Actually I may have done MS a disservice not the other way around.
Titanio
30-Sep-2005, 22:07
:lol: It sucks. I know. I also hate that I can read through your post...some or alot may not be right (as DeanoC stated, alot) and not be corrected. :(
Can DeanoC be a little more specific about what he was referring to? The post immediately previous, or some posts, or all? :p
MPR = Microprocessor Report, I think
I knew i shouldn't trust MPR. :sad:
No to be fair, the bits from MPR are largely right...
But really you don't know what version of Cell is being used, what RSX is and how Cell <-> RSX works. How likely are you to be right?
Trying to estimate things like bandwidth, with the information you have currently is futile.
scificube
30-Sep-2005, 22:16
No to be fair, the bits from MPR are largely right...
But really you don't know what version of Cell is being used, what RSX is and how Cell <-> RSX works. How likely are you to be right?
Trying to estimate things like bandwidth, with the information you have currently is futile.
This means...give up for now right? ...I'm so impatient though. I didn't mean any harm. Just wanted to talk about things a little. Bummer...I tried hard too.
I gather Cell has changed since DD2 was done maybe? I know you can't say.
Thanks for even bothering to look in the first place though :)
edit:
Can't seem to find a way to lock this so if a mod sees it just delete it or something.
overclocked
30-Sep-2005, 22:19
Cell must have it’s own memory controller to access the XDR ram it uses for main ram. RSX must have it’s own memory controller to access it’s pool of GDDR3. What is interesting is that RSX also has access to the XDR ram in the system. This allows RSX to access up 512 MB of Ram minus whatever Cell consumes which makes perfect sense. (no different than what Xenon consumes of the GDDR3 in X360) RSX has 22.4GB/s bandwidth to its pool of GDDR3. RSX can write 20GB/s to Cell and read 15GB/s from Cell. Cell does not have access to the GDDR3 in the system. What is most significant is that via Cell’s memory controller RSX had an additional 25.6GB/s read/write bandwidth with the XDR ram in the system. This provides RSX with 48GB/s to read/write from memory in the system on top of the read/write bandwidth between it and Cell.
I dont understand how you can calculate the bandwidth for RSX to 48GB by adding the VRAM+XDRAM togheter.
As i se it as the bandwidth of VRAM is underwhelming you could "lock" a certain portion of the XDR memory and read from there as your framebuffer but as i understand you cant get more then the write(15GB) that the flexIO supports and you would still need to let the Cpu have this BW shared also(if my understanding of the bus i right of course) .
So in a game where the dev would need around 30GB of bandwidth you still have around 5-7 GB for Cell "info" left to be sent to the RSX. Thats how i have understand the hardware atleast, if some knows more well then im wrong.
edit - Here's a Kutaragi quote re. Cell accessing GDDR3:
"CELL and RSX have close relationship and both can access the main memory and the VRAM transparently. CELL can access the VRAM just like the main memory, and RSX can use the main memory as a frame buffer. They are just separated for the main usage, and do not really have distinction."
Let's hope Kutaragi is right :cool:
scificube
30-Sep-2005, 22:31
I dont understand how you can calculate the bandwidth for RSX to 48GB by adding the VRAM+XDRAM togheter.
As i se it as the bandwidth of VRAM is underwhelming you could "lock" a certain portion of the XDR memory and read from there as your framebuffer but as i understand you cant get more then the write(15GB) that the flexIO supports and you would still need to let the Cpu have this BW shared also(if my understanding of the bus i right of course) .
So in a game where the dev would need around 30GB of bandwidth you still have around 5-7 GB for Cell "info" left to be sent to the RSX. Thats how i have understand the hardware atleast, if some knows more well then im wrong.
I made a bad assumption. I think you're right. I was thinking about how fast Cell could access XDR when I should of been thinking about how fast the mem controller could pipe data to RSX. It has also been pointed out that a 7800 can gobble up 38GB/s of bandwith. In retrospect it would appear RSX has a good chance of being hungry for bandwith too.
I seriously want to scrap this thread and never show my face again. I'm disgusted with myself.
Titanio
30-Sep-2005, 22:33
I seriously want to scrap this thread and never show my face again. I'm disgusted with myself.
Geez scifi, relax, it's been a good thread. I've liked it anyway :)
Let's hope Kutaragi is right
I guess this ain't how it's working in dev kits right now, but..yeah, I hope he's right when it comes to the final system!
edit - about cell<->rsx, doesn't rsx write to cell at 15GB/s and read from it at 20GB/s? That's enough to allow it to consume XDR's bandwidth entirely if it wanted (XDR is 12.8GB/s read, 12.8GB/s write), and still leave some cell<->rsx bandwidth for cell to read and write directly from chip to chip (7.2GB/s to RSX and 2.2GB/s to read from it) , but of course, you'd have no xdr bandwidth left over for cell ;) So assuming no Cell bandwidth usage, you could say RSX had 48GB/s to use. In reality it will be "up to" that figure depending on CPU usage and how much cpu<->gpu bandwidth you require for things other than direct memory transactions....?
rendezvous
30-Sep-2005, 22:36
No to be fair, the bits from MPR are largely right...
But really you don't know what version of Cell is being used, what RSX is and how Cell <-> RSX works. How likely are you to be right?
Trying to estimate things like bandwidth, with the information you have currently is futile.
So MPR is largely right.
It is somewhat in line with what you wrote in your presentation, and both ways seem to make sense.
I just hope you would be more clear ;) but I respect your NDA and time will tell eventually anyway.
And no, I don't really know what version of Cell is being used, for all I know it could be a totally revamped design with PPE's featuring out of order execution, andvanced tournament branch predictiors and an espresso maker.
What we mortals have to go on is what is publicly available on the current generations of Cell which isn't much on the PPE.
As for the RSX I think we have even less info, which is why I refrain from speculate on it and how it works together with CELL.
I appreciate that you say if something is wrong. I want nothing more than to (see a) decrease the amount of wrong information.
scificube
30-Sep-2005, 22:44
Thanks Titanio. I'll be fine. It all a learning experience :) ...and much to learn I have.
I seriously want to scrap this thread and never show my face again. I'm disgusted with myself.
Theres nothing wrong with the thread or the speculation that goes with it, its an interesting read. As long as everybody knows the limits, some people unfortately assume that they know more than the do, just don't fall into that trap and you'll be fine...
Deano
scificube
30-Sep-2005, 22:51
Gotta hand it to you fellas a B3D. You're a good bunch.
Titanio:
Just checked Ign (again) and they have it 20GB/s read from Cell and 15GB/s write to Cell for RSX.
Titanio
30-Sep-2005, 23:03
Just checked Ign (again) and they have it 20GB/s read from Cell and 15GB/s write to Cell for RSX.
Yeah, I checked against the conference vid. It's biased toward the GPU getting data, which seems logical.
http://img184.imageshack.us/img184/9713/ps35xx.jpg
I could be wrong, but RSX could saturate the XDR bandwidth if it really wanted (i.e. 48GB/s to itself), and still leave some flexio over for Cell. Of course, that leaves it with no main memory bandwidth of its own.
One question I have: can the SPUs load data and/or SPU code directly off the southbridge i.e. a disc or HD or camera or whatever, without that data/code having to pass through XDR? Can you treat them as their own little computers with their own RAM, and a connection to the southbridge? Of course, latency would be through the roof. That could save a little memory bandwidth in certain instances (starting up, where all data has to come off the southbridge anyway, perhaps?), theoretically, although I'm not sure how practical it'd be.
edit - thinking about it further, the savings would be MEASLY. I guess it's just a theoretical point ;)
Shifty Geezer
30-Sep-2005, 23:08
Cell can handle 9 HW threads and Xenon can handle 6 HW threadsTo be clear on what this really means though, TTBOMK XeCPU can have 3 threads actually executing at any one time and Cell can have 8. The hardware threads on XeCPU and PPE are an optimization for context switches on stalls or task-switching.
This is on the understanding the functional units on a core cannot be shared between threads concurrently where resources are left going unused, such as thread 2 using the VMX while thread 1 is using the core's integer ALU, which I think is the case but am hazy on. Even if it's the case XeCPU can share such resources, that'll only be part time when they're free. In terms of program threads in execution at a given clock cycle then, it's 3 for XeCPU, 8 for Cell.
In terms of program threads in execution at a given clock cycle then, it's 3 for XeCPU, 8 for Cell.Thanks, I posted without reading most of scificube wrote so it seems my reckless post was out of context. Anyway superscalar PPE has some hardware resources to support 2-way fine-grained MT if not full execution resource for true SMT, therefore they can't be "logical" or software threads, as described in MPR article (http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D/$file/MPR-Cell-details-article-021405.pdf) rendezvous probably indicated (sorry if it's a different article)
The Cell Power core has hardware fine-grain multithreading. The multithreading design supports fine-grained multithreading with round-robin thread scheduling. If both threads are active, the processor will fetch an instruction from each thread in turn. When one thread cannot issue a new instruction or is not active, the other active thread will be allowed to issue an instruction every cycle. Threading does add some burden to the die size (around 7% in this case), as there must be duplicated register files, program counters, and parallel instruction buffers (before the decode stage).
But I'm still wondering about what Crytek guy said about the difference between PPE threading and Xbox 360 CPU threading.
Gholbine
01-Oct-2005, 02:03
When you guys say that current PC GPUs have ~40GB/s of bandwidth, what does that mean? Bandwidth to what? I was under the impression that the CPU<=>GPU bandwidth in the next-generation consoles was streets ahead of current PCs.
overclocked
01-Oct-2005, 04:29
Geez scifi, relax, it's been a good thread. I've liked it anyway :)
edit - about cell<->rsx, doesn't rsx write to cell at 15GB/s and read from it at 20GB/s? That's enough to allow it to consume XDR's bandwidth entirely if it wanted (XDR is 12.8GB/s read, 12.8GB/s write), and still leave some cell<->rsx bandwidth for cell to read and write directly from chip to chip (7.2GB/s to RSX and 2.2GB/s to read from it) , but of course, you'd have no xdr bandwidth left over for cell ;) So assuming no Cell bandwidth usage, you could say RSX had 48GB/s to use. In reality it will be "up to" that figure depending on CPU usage and how much cpu<->gpu bandwidth you require for things other than direct memory transactions....?
I was also wrong with that one, its 15GB write and 20GB read for RSX.
Heh i thought the XDR was 25,6GB read that changes the picture i guess, in a good case you could have 15GB for framebuffer and 5Gigs left for cell to send over the 20GB FlexIO, but thats wrong then(IF 25,6GB).
Edit
Are you sure about the 12,8GB read/write for XDR cause on the diagram you showed the bandwidth to/from RSX is on two arrows while the XDR inteface shows 25,6GB both ways..
overclocked
01-Oct-2005, 04:33
When you guys say that current PC GPUs have ~40GB/s of bandwidth, what does that mean? Bandwidth to what? I was under the impression that the CPU<=>GPU bandwidth in the next-generation consoles was streets ahead of current PCs.
Bandwith between GPU-GDDR3 alone.
Shifty Geezer
01-Oct-2005, 10:01
...therefore they can't be "logical" or software threads, as described in MPR article (http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D/$file/MPR-Cell-details-article-021405.pdf) rendezvous probably indicated...
If both threads are active, the processor will fetch an instruction from each thread in turn.
Wow, I ought to read that article. This explains Deano's comments on PPE being used as a dual 1.6 GHz core, as if you have both threads active then they're interleaved. Have I missed the same sort of details on XeCPU's cores as well? I checked the Ars explanation that only seemed to confirm the second thread remains inactive unless the first becomes inactive, but I'm not absolutely sure that's the case.
Titanio
01-Oct-2005, 10:17
Are you sure about the 12,8GB read/write for XDR cause on the diagram you showed the bandwidth to/from RSX is on two arrows while the XDR inteface shows 25,6GB both ways..
I think it only shows the XDR bandwidth on one, bi-directional arrow, because that figure is split evenly going up and down. If it was 25.6GB/s both ways, we'd be talking about 51.2GB/s of bandwidth to XDR ;)
And that is an interesting quote from MPR, I had missed that too. So basically if a thread is blocked, it's like a 3.2Ghz PPE, but if both threads are not blocked, it's like 2 1.6GHz PPEs? Interesting..
In reality, the split would be arbitrary I guess, depending on the blocking behaviour of the threads. One might get 2.5 billion cycles, the other 0.7billion etc. etc.
That might also explain the Crytek guy's comments if Xenon is different in this respect?
overclocked
01-Oct-2005, 10:36
I think it only shows the XDR bandwidth on one, bi-directional arrow, because that figure is split evenly going up and down. If it was 25.6GB/s both ways, we'd be talking about 51.2GB/s of bandwidth to XDR ;)
And that is an interesting quote from MPR, I had missed that too. So basically if a thread is blocked, it's like a 3.2Ghz PPE, but if both threads are not blocked, it's like 2 1.6GHz PPEs? Interesting..
In reality, the split would be arbitrary I guess, depending on the blocking behaviour of the threads. One might get 2.5 billion cycles, the other 0.7billion etc. etc.
That might also explain the Crytek guy's comments if Xenon is different in this respect?
I think i misunderstand what your saying or rather what im thinking, i thought of the main-ram as a regular FSB as in PC. The RSX is going to be really bandwidth starved, i mean it has as much bandwith as my 6800 (256bit-350DDR) witch is starved. Add what(?) 4-5 times the fillrate and its not looking good.
Titanio
01-Oct-2005, 11:00
I think i misunderstand what your saying or rather what im thinking, i thought of the main-ram as a regular FSB as in PC. The RSX is going to be really bandwidth starved, i mean it has as much bandwith as my 6800 (256bit-350DDR) witch is starved. Add what(?) 4-5 times the fillrate and its not looking good.
I'm not really sure what you mean, I wasn't really commenting on RSX's bandwidth situation in the post you quoted..?
Also, vs a 6800 I thought RSX's fillrate might be more like 2x? (6.4Gigapixels to 13.2 - extrapolated from G70 figures? Although that'd depend on what 6800 you were talking about). But we don't really know.
I'm not sure if you're going to be consuming your fillrate though, it wouldn't even be possible to use it fully without compression at least (?) I think it may be a case of doing more per pixel than firing out more pixels too.
You can probably consider RSX's bandwidth situation to be 48GB/s - CPU consumption - framebuffer consumption. The latter has been a point of debate, I don't think we ever really figured out what it would look like, but of course, it's gonna vary from game to game anyway. You can try thinking about it yourself - a 720p FP16 frame is 7.03MB, so I guess your variables after that would be how many times you're reading it in and writing it out per sec (your overdraw, your framerate, how many times on average you'll be reading pixels back). You might want to consider the z-buffer too, and you may want to consider color and z compression (which NVidia claims can be 4:1, or maybe better now, I don't know, but they don't go into a lot of detail about it). Take that out, take out CPU usage, and that's how much you'll have left for texture/vertex reads etc. But this is still all speculative until we have final detail on RSX.
I hope he's right when it comes to the final system!
me too :wink:
edit - about cell<->rsx, doesn't rsx write to cell at 15GB/s and read from it at 20GB/s? That's enough to allow it to consume XDR's bandwidth entirely if it wanted (XDR is 12.8GB/s read, 12.8GB/s write),
I'm not 100% sure about that but I don't think XDR bandwidth is split, you can have full bw when you write to mem and when you read from mem, obviously not at the same time.
In reality it will be "up to" that figure depending on CPU usage and how much cpu<->gpu bandwidth you require for things other than direct memory transactions....?
:???::oops::twisted:
Titanio
01-Oct-2005, 14:36
I'm not 100% sure about that but I don't think XDR bandwidth is split, you can have full bw when you write to mem and when you read from mem, obviously not at the same time.
Ahh..interesting. I didn't know that at all, I just assumed it was an even split. So basically you can read and write any amount as long as combined you don't exceed 25.6GB/s? That's more flexible than I thought.
??? oops twisted
:p i don't know what this means! But I guess I won't be finding out today..:)
overclocked
01-Oct-2005, 14:46
I'm not really sure what you mean, I wasn't really commenting on RSX's bandwidth situation in the post you quoted..?
Also, vs a 6800 I thought RSX's fillrate might be more like 2x? (6.4Gigapixels to 13.2 - extrapolated from G70 figures? Although that'd depend on what 6800 you were talking about). But we don't really know.
I'm not sure if you're going to be consuming your fillrate though, it wouldn't even be possible to use it fully without compression at least (?) I think it may be a case of doing more per pixel than firing out more pixels too.
You can probably consider RSX's bandwidth situation to be 48GB/s - CPU consumption - framebuffer consumption. The latter has been a point of debate, I don't think we ever really figured out what it would look like, but of course, it's gonna vary from game to game anyway. You can try thinking about it yourself - a 720p FP16 frame is 7.37MB, so I guess your variables after that would be how many times you're reading it in and writing it out per sec (your overdraw, your framerate, how many times on average you'll be reading pixels back). You might want to consider the z-buffer too, and you may want to consider color and z compression (which NVidia claims can be 4:1, or maybe better now, I don't know, but they don't go into a lot of detail about it). Take that out, take out CPU usage, and that's how much you'll have left for texture/vertex reads etc. But this is still all speculative until we have final detail on RSX.
Your right but i was rather spinning on the Bandwidht issue, i should have maked that clearer.
On reference to cards i just made a simple compare between a 6800vanilla 325core again for a bandwidth comparison. A stock 6800NU has 3900Gtex/Gpixel and assuming rsx has 13200 plus the more advanced ALU config in G70 you should have about maybe 5x shaderfillrate atleast and still the same bandwidht as the former. All im saying is that on what we "know" this seems like a pretty big bottleneck.
I havent read anything about XDR so i cant qualify to say but if it is what i thought first. You have 25,6GB thats available in any direction you want, either read 20GB and write 5GB or how you like it the best, cause that would help with bandwidht restrictions instead of having this "fixed" 12,8UP and 12,8Down.
I agree with the other things you saying that will help bandwidth but it would be good to know for certain how the XDR works, so go and find a quote/link now Titanio :-)
Edit
It was actually the split you mentioned about the XDR that made my whole understanding of PS3 go down...Hehe
Titanio
01-Oct-2005, 15:04
On reference to cards i just made a simple compare between a 6800vanilla 325core again for a bandwidth comparison. A stock 6800NU has 3900Gtex/Gpixel and assuming rsx has 13200 plus the more advanced ALU config in G70 you should have about maybe 5x shaderfillrate atleast and still the same bandwidht as the former. All im saying is that on what we "know" this seems like a pretty big bottleneck.
It has the same amount of BW on the VRAM side alone, but if it's pulling from and/or pushing to XDR, that increases available BW. I doubt your fillrate would be going that high though to be honest.
I havent read anything about XDR so i cant qualify to say but if it is what i thought first. You have 25,6GB thats available in any direction you want, either read 20GB and write 5GB or how you like it the best, cause that would help with bandwidht restrictions instead of having this "fixed" 12,8UP and 12,8Down.
Yeah, my bad, it's 25.6GB/s in an arbitrary split seemingly. Just to be clear, I assume GDDR3 is the same?
overclocked
01-Oct-2005, 15:28
It has the same amount of BW on the VRAM side alone, but if it's pulling from and/or pushing to XDR, that increases available BW.
Yes i know that so we are actually on the same side here about using a portion of the XDR as video memory.
But really you don't know what version of Cell is being used, what RSX is and how Cell <-> RSX works
Well, is that good or bad? :wink: Most of us think it's just a higher clocked G70 (although transistor count is higher...)
Titanio
01-Oct-2005, 15:44
Yes i know that so we are actually on the same side here about using a portion of the XDR as video memory.
Yeah, absolutely. I was actually just thinking of one theoretical setup, placing the framebuffer entirely in XDR. My understanding may fall down here, but I figured if you threw 6GB/s at the CPU, you could then access (read or write) an entire 720p 64-bit frame + 32bit zbuffer ~2000 times a sec (or about 80 times per frame at 30fps, 40 times per frame at 60fps) , and leave the GDDR3 bandwidth completely untouched for texture/vertex access. Aggregate chip-to-chip bandwidth would drop significantly, to about 15GB/s, but under the same model with X360, it'd be no worst off from that perspective (6GB/s for the CPU would bring X360's chip-to-chip total bw down around 15GB/s too).
It really comes down to two things as far as I can tell though - how Nvidia's color/z compression works (which I'm not attempting to factor in yet, and which could make a decent difference), and how much you can do to reduce the depth complexity of your scene. It could be very worthwhile spending some cycles on Cell on good occlusion culling.
scificube
01-Oct-2005, 16:19
I wish I had seen this MPR before I said anything...and had that diagram in front of me. Can someone post a link to the MPR? I'd certainly like to read it. It would be to everyone's benefit as it's obvious what happens when your flying blind LOL! anyway...
As far as RSX's bandwith goes...is this how RSX looks to have access to the XDR pool now:
RSX requests->FLEXIO->EIB->XDRAM controller->XDR Ram->XDRAM controller->EIB->FLEXIO->RSX
2 things would be true if this is right:
1. Even though the XDRAM controller can read/write 25.6GB/s (not at the same time?), having to move data back through FLEXIO effectively sets the max read at 20GB/s and the max write at 15GB/s for RSX to and from the XDR.
2. Unless there's a seperate link explicity for the purpose (unlikely right?) then data must flow over Cell's EIB in these situations which could have a bad affect if there is too much clutter on the EIB so that all the cores cannot communicate as freely as one would desire. I can't remember how fast the EIB is...so this may be so fast as this is not likely.
The idea is that RSX request can piggy back Cell requests to XDR and when those requests return they eat into the bandwith Cell has to communicate data back to RSX that it was working on.(the 20GB/s out to RSX from FLEXIO). Requests to XDR wouldn't really eat into RSX bandwith to write to Cell because request are trivaily small so that RSX could still write to Cell with 99.99...99% of it's bandwith.
This make sense guys? if so would RSX accessing the XDR ram be a problem for Cell's EIB?
As far as how threads are handled...
I think like Shifty does, and thread exectution being interleaved finally makes DeanoC's comments that the PPE is like unto 2 1.6GHz processors make perfect sense. It also makes clear comments the affect the Xenon is more efficient than Cell while Cell is more powerful. If I understand things correctly the situation where all all threads are being put to use the situaton is like this:
Xenon: like six processors running at 1.6Ghz due to quick switching between HW contexts and also because of the HW contexts Xenon is apt to efficiently maintain this situation when a thread blocks for whatever reason.
Cell: With the PPE the situation it is like unto a core on Xenon. It's like 2 1.6Ghz processors when 2 threads are on the PPE. The SPUs should be looked at as 7 3.2Ghz processors. However, since there is no support for a second HW context when a thread blocks it is much more costly than on the PPE or a core on Xenon. I suppose this is where comments to the affect of there being no way to hide latency etc. come from.
Ok...I'm still not in la la land am I?
scificube
01-Oct-2005, 16:29
Yeah, absolutely. I was actually just thinking of one theoretical setup, placing the framebuffer entirely in XDR. My understanding may fall down here, but I figured if you threw 6GB/s at the CPU, you could then access (read or write) an entire 720p 64-bit frame + 32bit zbuffer ~2000 times a sec (or about 80 times per frame at 30fps, 40 times per frame at 60fps) , and leave the GDDR3 bandwidth completely untouched for texture/vertex access. Aggregate chip-to-chip bandwidth would drop significantly, to about 15GB/s, but under the same model with X360, it'd be no worst off from that perspective.
It really comes down to two things as far as I can tell though - how Nvidia's color/z compression works (which I'm not attempting to factor in yet, and which could make a decent difference), and how much you can do to reduce the depth complexity of your scene. It could be very worthwhile spending some cycles on Cell on good occlusion culling.
Stuff like this is what I was talking about in my original post...though I lack the insight to go as far as you have with this. This I find very interesting. Things like this would be difficult on the X360 because Xenos's framebuffer is in the e-Dram barring the intelligent memory is setup to handle this kind of stuff already.
rendezvous
01-Oct-2005, 17:32
Can someone post a link to the MPR? I'd certainly like to read it. I
I can post a link to the MPR article i refered to earlier.
location where MPR Cell article can be downloaded from (http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D)
Titanio
01-Oct-2005, 17:54
Things like this would be difficult on the X360 because Xenos's framebuffer is in the e-Dram barring the intelligent memory is setup to handle this kind of stuff already.
But you wouldn't want the framebuffer to be anywhere else on X360, that's what the eDram is for, and it has plenty of bandwidth for it. PS3 has more "general" bandwidth, so to speak, but you have to accomodate the framebuffer with it too. My point is that depending on your CPU and framebuffer needs, you may still end up having more BW for texture/vertex reads in PS3 anyway.
It's hard to compare, though. CPU consumption is probably going to be different on both systems, and that presents some interesting issues too (for example, I think Cell could burn through data a lot quicker than Xenon because it has much more execution logic to feed, but on the other hand, Xenon's cache may be pretty "busy" swapping things in and out of memory).
2. Unless there a seperate link explicity for the purpose (unlikely right?) then data must flow over Cells EIB in these situations which could have a bad affect if there is too much clutter on the EIB so that all the cores cannot communicate as freely as one would desire. I can't remember how fast the EIB is...so this may be so fast as this is not likely.
I don't have exact figures, but IIRC it's roughly in the 200GB/s to 300GB/s range. There's plenty of BW there, I don't think there'd be much or a problem shuttling data from FlexIO to the XDR interface.
scificube
01-Oct-2005, 18:00
Titanio:
What I meant to say I it would be difficult for Xenon to act in a same manner on the frame buffer in the e-Dram as Cell could in main memory.
I certainly do think the frame buffer is in an excellent place in Xenos's e-Dram. I realize the daughter die will blast through tasks traditionally attributed to working with the frame buffer.
It was a flexibility thing not a knock or anything.
edit:
If bandwith around the EIB is indeed in the (200-300)GB/s range I am of course inclined to agree with you that shuttling data from XDR through it to RSX wouldn't seem a problem.
scificube
01-Oct-2005, 18:00
I can post a link to the MPR article i refered to earlier.
location where MPR Cell article can be downloaded from (http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D)
Thankyou! :)
Titanio
01-Oct-2005, 18:23
Titanio:
What I meant to say I it would be difficult for Xenon to act in a same manner on the frame buffer in the e-Dram as Cell could in main memory.
I certainly do think the frame buffer is in an excellent place in Xenos's e-Dram. I realize the daughter die will blast through tasks traditionally attributed to working with the frame buffer.
It was a flexibility thing not a knock or anything.
Gotcha, I wasn't really thinking in those terms, but putting the framebuffer in XDR would indeed leave it very close to Cell if you wanted to do something with it there..
If bandwith around the EIB is indeed in the (200-300)GB/s range I am of course inclined to agree with you that shuttling data from XDR through it to RSX wouldn't seem a problem.
Trying to figure out something more exact - It's 96 bytes per cycle, so if that's every cycle, it'd be about 286GB/s I think. But I thought I might have read somewhere that the EIB was clocked at half the rate of the chip, so in that case it'd be ~143GB/s. Either way, I think it should be enough to keep everything fed.
scificube
01-Oct-2005, 18:40
Trying to figure out something more exact - It's 96 bytes per cycle, so if that's every cycle, it'd be about 286GB/s I think. But I thought I might have read somewhere that the EIB was clocked at half the rate of the chip, so in that case it'd be ~143GB/s. Either way, I think it should be enough to keep everything fed.
Understood and agreed.
Understood and agreed.
Not saying this is or isn't possible but.....
You really have to look beyond bandwidth for these things. It's a requirement obviously, but for it to be practical to put the framebuffer in XDRam the destination blender in RSX would have to be able to absorb the additional latency. And that's simply not a known quantity.
As Deano said earlier, the speculation in these threads is interesting, but your speculating with far from complete knowledge, and there is a tendency to get fixated on one or two technical numbers, coupled with quotes taken out of context.
scificube
01-Oct-2005, 19:25
I understood and value what DeanoC said. I'm only looking at possibilities. I don't think anyone is making predictions...just working out what they can understand and sharing ideas.
I expect to be wrong. Maybe not so wrong as I was yesterday but I expect to be wrong and talking about it with B3D members is the best way I can think of checking myself.
I certainly don't know where else to go to talk about this stuff where green guys like me can at least be given the chance to speak.
With that said I want to toss another idea out there...
Maybe RSX shouldn't put it's frame buffer in the XDR but a sort of shadow copy of it with only data Cell could use would be there. The real frame buffer would reside in the GDDR3 mem pool. This way Cell and RSX could work concurrently on unrelated tasks associated with the frame buffer and Cell could deliver it's results to RSX in a sort of "just in time" manner to RSX for the final blend/tasks. Cell could be doing something else while RSX is doing z-tests or something else. This may be a way to combat the latency.
And that's simply not a known quantity..
Unfurtunately there're are too unknown quantities/features at this time..
I don't think we have the full picture yet.
Unfurtunately there're are too unknown quantities/features at this time..
I don't think we have the full picture yet.
Sounds to me that even the developers are being left in the dark on a lot until they get their final devkits in December.
Titanio
01-Oct-2005, 19:35
Not saying this is or isn't possible but.....
You really have to look beyond bandwidth for these things. It's a requirement obviously, but for it to be practical to put the framebuffer in XDRam the destination blender in RSX would have to be able to absorb the additional latency. And that's simply not a known quantity.
As Deano said earlier, the speculation in these threads is interesting, but your speculating with far from complete knowledge, and there is a tendency to get fixated on one or two technical numbers, coupled with quotes taken out of context.
True, I was tending to ignore latency. Although it did cross my mind that if XDR's behaviour in this regard is a little better than GDDR3, it might help balance out a little.
You could flip it over and have the framebuffer in GDDR3, but I was putting it in XDR because of the greater amount of bandwidth there. With my same figures above, there'd still be some GDDR3 BW left over for other things, but it seems a little odd to me to have so little going into relatively so much memory. That also assumes then texture/vertex reads etc. are less bandwidth sensitive and could sustain any greater penalty going over to XDR..
But yeah, it is just playing with numbers for now. There are a number of factors that could render all this moot (hopefully in a good way ;)).
MechanizedDeath
01-Oct-2005, 20:24
What's more latency sensitive, vertex/texture fetches or framebuffer read/writes? I mean, is it even necessary to read the framebuffer data that often unless you're doing lots of RTT or other environment-mapping ops? There was some prior speculation on how the memory should be partitioned in PS3, and I thought I remembered reading that GDDR3 would be better for the frame buffer, with XDR for texture/vertex data. PEACE.
Titanio
01-Oct-2005, 20:37
I mean, is it even necessary to read the framebuffer data that often unless you're doing lots of RTT or other environment-mapping ops?
The most consistent offender in terms of readbacks that crossed my mind was for the z-buffer, every time you make a comparison it'll need to read the current value in the zbuffer. Which is why I was saying earlier that reducing depth complexity could be very worthwhile, via the CPU and/or GPU (not too familiar with what the GPU offers here?).
There are probably other ops that require a decent amont of reading back, depending on what you're doing.
overclocked
02-Oct-2005, 02:21
True, I was tending to ignore latency. Although it did cross my mind that if XDR's behaviour in this regard is a little better than GDDR3, it might help balance out a little.
You could flip it over and have the framebuffer in GDDR3, but I was putting it in XDR because of the greater amount of bandwidth there. With my same figures above, there'd still be some GDDR3 BW left over for other things, but it seems a little odd to me to have so little going into relatively so much memory. That also assumes then texture/vertex reads etc. are less bandwidth sensitive and could sustain any greater penalty going over to XDR..
But yeah, it is just playing with numbers for now. There are a number of factors that could render all this moot (hopefully in a good way ;)).
I was more in the line with having the framebuffer in the Vram an use a "locked" say 64MB(for ex) of the XDR for say normalmaps and other texturedata. There could be a possibilty to send parts of the framebuffer to do postprocessing with Cell, thats what i se as advantage with numa but uma is easier to work with. Then how much that is hype and what can be done is of course upp in the air but its interesting to guess and speculate none the less.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.