I have taken many of observations concerning the PS3 and X360 and have come to form some opinions about what I see. I would like to put out what I’m seeing and see if I am at least justified in my thinking and whether or not I need to consider other things. I will warn those who would get upset about it. Yes, I do see allot of advantages for the PS3. I expect this may rub some people the wrong way so I invite you all to give me your perspective on things. It’s not like I’m going to disagree with facts or I have my mind made up. It is the primary purpose of this thread to see if have framed things correctly and secondarily to see if any interesting things should pop up during the discussion if there is any…I hope so, I really do.
Oh…I am forced to apologize for how long this post is. Most of it is stuff one may already know so again I apologize in advance for that. I felt it better though to explain well where I was coming from rather than simply spouting off a string of declarative statements. For one, I am not so brash as to think I am an authority on anything so making decelerations just seems something wrong for me to do. The other issue is if I my thinking is indeed off, I need to have my thinking out there so people can see where exactly it went bad and correct me at right spot so that I can understand better what they’re saying.
Next-Gen CPUs:
What I would like to note is that Xenon has 3 cores while Cell has 8 active cores. What I have begun to focus on is that this means Xenon can handle 3 HW threads while Cell can handle 8 HW threads. I think this is important when thinking about execution resources. Logical threads on a core end up fighting for execution resources while HW threads do not. Xenon’s cores all can handle 2 logical threads allowing Xenon to handle 6 threads at a time. Cell’s PPE supports 2 logical threads but it’s seven SPUs support only one HW thread. As I don’t know the PPC equivalent I am forced to describe Xenon’s cores and Cell’s PPE as being hyper-threaded. The classification would seem to fit in my mind as well. This seems a very significant thing to me when comparing Xenon and Cell. What I gather is that all 6 threads on Xenon if used have to fight for execution resources in that if a resource is not free that particular thread must wait or be swapped out (unlikely). However 7 of the 9 threads Cell can handle have full reign over the resources on a core. (threads on the SPUs). I feel this may be the most or at least one of the most significant observations we can draw about these two CPUs. Threads on Cell have more resource of various types available to them. There are two things I gather from this. Threads with more resources available to them can get more work done. Threads that don’t have to wait for resources to free up can get more work done.
Critical thoughts:
Threads on Cell can better leverage the power of the silicon available because of the structure of the HW. There is also more silicon to take advantage of.
Pitfalls:
1. One would note that if 2 threads are executing on a core but are demand different resources in the pipelines that this increases efficiency not decreases it.
2. SPUs in Cell can only task on execution element at a time putting them at a disadvantage.
I agree. In this situation the logical threads can be considered to be just as efficient as HW threads doing the same thing. However, one must realize that the situation will arise quite often when this is not the case.
SPUs don’t need to do more than one thing at a time. Threads on SPUs have the whole core to themselves so while one is doing its thing there are six others doing their thing. With Xenon one can only guarantee 3 threads are executing where 3 more can be executing if resources are available to do work. (Work != threads being still being present…the threads aren’t going anywhere)
Real world basis:
X2 vs. P4 with HT. The X2 routinely wins out because it has 2 actual cores that can give 2 threads more execution units to work with where, as the P4 with HT does not and thus cannot. Something to note is that HT on the average only provides a 10-20% bump in speed where it would not seem uncommon for the X2 to have a 100% speed bump.
Bandwidth:
There is a surprising disparity here!
Before I go on I need to caution that I am unsure as to whether RSX has access to the PS3’s XDR. I know I’ve read it a bunch of times but I can’t recall where I saw it officially or at least in solid form such as a technical document. If this is not the case than most of what I say next is made null and void. I am confident however this is the case as it makes good sense.
Xenos serves as the memory controller in the X360. Xenos can read/write 22.4GB/s from the GDDR3 in the system and can read/write 10.8GB/s from Xenon. I’ve seen no mention yet of a south bridge so I will assume for now that I/O for system components eats into the bandwidth the GDDR3 provides. 2GB/s or less seems reasonable.
Cell must have it’s own memory controller to access the XDR ram it uses for main ram. RSX must have it’s own memory controller to access it’s pool of GDDR3. What is interesting is that RSX also has access to the XDR ram in the system. This allows RSX to access up 512 MB of Ram minus whatever Cell consumes which makes perfect sense. (no different than what Xenon consumes of the GDDR3 in X360) RSX has 22.4GB/s bandwidth to its pool of GDDR3. RSX can write 20GB/s to Cell and read 15GB/s from Cell. Cell does not have access to the GDDR3 in the system. What is most significant is that via Cell’s memory controller RSX had an additional 25.6GB/s read/write bandwidth with the XDR ram in the system. This provides RSX with 48GB/s to read/write from memory in the system on top of the read/write bandwidth between it and Cell.
It is not difficult to see how this is possible when looking how two Cell chips would communicate via a crossbar. The crossbar makes possible the communication between Cells while Flexio handle communicating with the XDR Cell uses. I suspect RSX merely sits on the other side of one of the crossbars in the PS3…and now we know just how much Cell chips data Cell chips can communicate between each other in such a setup by looking at the link between Cell and RSX. Also there appears to be a south bridge in the PS3 that will keep I/O from other system components from stealing band with from Cell and RSX.
Critical thoughts:
Where Xenos would appear to be headed for bandwidth limited days in having less bandwidth available to it than a PC GPU, RSX on the other hand has more bandwidth available to it than any GPU part seen to date…at least by me. The only reason I don’t classify this as too much bandwidth is that RSX is clocked at 550MHz so in the end that extras bandwidth should come in handy in trying to feed RSX. (and I've heard there's no such thing as too much bandwidth) Looks like a good fit situation where RSX isn’t cruising towards being bandwidth limited but doesn’t have bandwidth to spare either.
Pitfalls:
MS says its directx compression will give them 50% more bandwidth to work with. I see this statement a PR move until proven otherwise. 50% more bandwidth due to what being compressed when and where will un-compressed data be stored and won’t storing that eat up bandwidth just the same? Is this compression HW accelerated? DX calls eat up CPU time not bandwidth with its overhead. Seems bogus until someone explains to me how this could be possible.
The intelligent memory of Xenos removes a good amount of the bandwidth load from the main ram by containing the frame buffer in it and not there. However Xenos still needs to get textures from main ram and “tiles†from/for the frame buffer when rendering to 1080i. Textures should be pretty large and larger if things are rendered to 1080i and not scaled to 1080i. These thought are what make me think Xenos “could†be bandwidth limited. Xenos is the X360’s memory controller forcing Xenos to share the bandwidth from the GDDR3 at all times. This moves me from “could†be bandwidth limited to thinking that it is probably the case especially when rendering to 1080i.
RSX will not have constant access to 48GB/s of bandwidth. In using Cell’s memory controller to get to the XDR pool it is actually consuming bandwidth Cell could use. There is an interesting possibility here though. One idea is to feed Cell with data from RSX (it’s pipes) while RSX is accessing XDR ram. This would ensure Cell doesn’t completely starve for data to work on during this interval. Just an idea.
Something to keep in mind is that RSX has 22.4GB/s bandwidth guaranteed and can variably pull from another 25.6GB/s as needed or when possible. Xenos is always sharing its 22.4GB/s bandwidth to memory with Xenon despite how the daughter die eases things.
CPU-GPU relations:
Xenos can send/receive data at 10.8GB/s to/from Xenon. They will communicate primarily to aid each other in the task of rendering.
RSX can send data to Cell at 20GB/s and receive data from Cell at 15GB/s. They will also work together primarily on the task of rendering.
Here is the way I’m looking at it. From my perspective one of the parts is working at the task at hand…rendering while the other is supplying it data to work on…a VERY intelligent memory if you will. One could then look at the bandwidth between the two parts as that from the “graphics processor†to its memory and go from there.
When thinking like this these are some observations I have. When looking at the bandwidth between Xenos and Xenon it appears to be about 1/3 of that “graphics processors†enjoy now. When looking at Cell and RSX the bandwidth appears to be about 1/2 to 2/3 rds what “graphic processors†utilize today. (PC graphics parts get approx. 30GB/s bandwidth)
This is not so bad when you go back and look at “how†each part will contribute to rendering. Xenos and RSX are to do the heavy lifting while Xenon and Cell are to provide flexibility these parts don’t have and then do what they can to aid in the task(s) at hand. These parts won’t need to communicate THAT much data but there are some interesting differences I think about just what can be done here.
I expect neither Xenon nor Cell to be particularly good at rasterization. I expect both Xenon and Cell to aid in vertex processing, particles, and post processing affects. I am thinking along the lines of tessellation, disp mapping, integrating particles into the physics simulation, etc. What can be done and to what extent is the question.
As things look now I would say that Xenon and Xenos are at a disadvantage to Cell and RSX on these special tasks. Cell has more threads it can dedicate to these tasks where each thread has more execution resources available to getting the tasks done. The second issue is that Cell and RSX can communicate more data between one another. The last issue has to do with ease of use but I feel it is significant. I cannot find where I saw it (so if you wish to dismiss this I understand) but it would appear programmers can use Nvidia’s Cg to program in vertex work for Cell’s SPUs. This does nothing for making the HW more powerful in relative terms but it’s goes a long way in turning the potential into the kinetic. To my knowledge there is no equivalent for this in working with Xenon’s core to the same end.
Pitfalls:
I have not ignored that Xenos can do a heck of lot of vertex processing on it’s own. Xenos is not a CPU however so it’s lack flexibility in what it can do with its entire vertex processing power. Xenos should be able to throw up INSANE amounts of geometry but it will not be applying physics etc to it…these tasks still best falls on Xenon and I’ve already spoken to the situation there. (the MEMEXPORT capability is a factor here though and shouldn’t be ignored)The other thing to consider is when Xenon is in INSANE vertex worker mode it is also in “not doing pixel work†mode and Xenon won’t be able to pick up the slack there…Cell wouldn’t be able to do it either.
Random thoughts:
I am curious about how well Xenos, it’s tessellator and Xenon can work together to make displacement-mapping work. There is a Z-brush demo of this I think. Could we see stuff like this in real time? Yes/no/maybe…what would it take of the tessellator etc?
I think it might be a good thing to place RSX’s frame/z buffers in the XDR ram. This would give Cell direct access to them so that some interesting things may be possible. What could this allow to be done? This would guarantee RSX consumes some of Cell’s bandwidth to main ram but it could it also free up bandwidth between Cell and RSX? Would the trade off be worth it?
Is there any news as what changed with respect to Cell’s PPE VSU? Any clue into what Crytek means by saying that Cell’s PPE has slighter/somewhat better “hyper threading†than a core in Xenon?
Is it true the dynamic scheduler used with Cell will find work for the SPUs to do if they aren’t tasked explicitly to something?
I don’t understand the issue with Xenos being triangle setup limited. Could someone explain this to me? Is it an issue where all the arrays couldn’t be tasked to vertex processing or is this merely a limit to the amount of vertex processing each array could do?
What would be some interesting things Nvidia could do with RSX’s feature set? Just want to here some interesting ideas.
MOST IMPORTANT THOUGHTS:
What do you think of anything I said?
What other factors may play a role when thinking about the overall performance of these machines and would these factors be more significant than these? (I’d prefer HW related issues, but if they tie into ease of use etc that’s ok)
Don’t worry about going over my head…learn me something! That’s what I’m here for.
Oh…I am forced to apologize for how long this post is. Most of it is stuff one may already know so again I apologize in advance for that. I felt it better though to explain well where I was coming from rather than simply spouting off a string of declarative statements. For one, I am not so brash as to think I am an authority on anything so making decelerations just seems something wrong for me to do. The other issue is if I my thinking is indeed off, I need to have my thinking out there so people can see where exactly it went bad and correct me at right spot so that I can understand better what they’re saying.
Next-Gen CPUs:
What I would like to note is that Xenon has 3 cores while Cell has 8 active cores. What I have begun to focus on is that this means Xenon can handle 3 HW threads while Cell can handle 8 HW threads. I think this is important when thinking about execution resources. Logical threads on a core end up fighting for execution resources while HW threads do not. Xenon’s cores all can handle 2 logical threads allowing Xenon to handle 6 threads at a time. Cell’s PPE supports 2 logical threads but it’s seven SPUs support only one HW thread. As I don’t know the PPC equivalent I am forced to describe Xenon’s cores and Cell’s PPE as being hyper-threaded. The classification would seem to fit in my mind as well. This seems a very significant thing to me when comparing Xenon and Cell. What I gather is that all 6 threads on Xenon if used have to fight for execution resources in that if a resource is not free that particular thread must wait or be swapped out (unlikely). However 7 of the 9 threads Cell can handle have full reign over the resources on a core. (threads on the SPUs). I feel this may be the most or at least one of the most significant observations we can draw about these two CPUs. Threads on Cell have more resource of various types available to them. There are two things I gather from this. Threads with more resources available to them can get more work done. Threads that don’t have to wait for resources to free up can get more work done.
Critical thoughts:
Threads on Cell can better leverage the power of the silicon available because of the structure of the HW. There is also more silicon to take advantage of.
Pitfalls:
1. One would note that if 2 threads are executing on a core but are demand different resources in the pipelines that this increases efficiency not decreases it.
2. SPUs in Cell can only task on execution element at a time putting them at a disadvantage.
I agree. In this situation the logical threads can be considered to be just as efficient as HW threads doing the same thing. However, one must realize that the situation will arise quite often when this is not the case.
SPUs don’t need to do more than one thing at a time. Threads on SPUs have the whole core to themselves so while one is doing its thing there are six others doing their thing. With Xenon one can only guarantee 3 threads are executing where 3 more can be executing if resources are available to do work. (Work != threads being still being present…the threads aren’t going anywhere)
Real world basis:
X2 vs. P4 with HT. The X2 routinely wins out because it has 2 actual cores that can give 2 threads more execution units to work with where, as the P4 with HT does not and thus cannot. Something to note is that HT on the average only provides a 10-20% bump in speed where it would not seem uncommon for the X2 to have a 100% speed bump.
Bandwidth:
There is a surprising disparity here!
Before I go on I need to caution that I am unsure as to whether RSX has access to the PS3’s XDR. I know I’ve read it a bunch of times but I can’t recall where I saw it officially or at least in solid form such as a technical document. If this is not the case than most of what I say next is made null and void. I am confident however this is the case as it makes good sense.
Xenos serves as the memory controller in the X360. Xenos can read/write 22.4GB/s from the GDDR3 in the system and can read/write 10.8GB/s from Xenon. I’ve seen no mention yet of a south bridge so I will assume for now that I/O for system components eats into the bandwidth the GDDR3 provides. 2GB/s or less seems reasonable.
Cell must have it’s own memory controller to access the XDR ram it uses for main ram. RSX must have it’s own memory controller to access it’s pool of GDDR3. What is interesting is that RSX also has access to the XDR ram in the system. This allows RSX to access up 512 MB of Ram minus whatever Cell consumes which makes perfect sense. (no different than what Xenon consumes of the GDDR3 in X360) RSX has 22.4GB/s bandwidth to its pool of GDDR3. RSX can write 20GB/s to Cell and read 15GB/s from Cell. Cell does not have access to the GDDR3 in the system. What is most significant is that via Cell’s memory controller RSX had an additional 25.6GB/s read/write bandwidth with the XDR ram in the system. This provides RSX with 48GB/s to read/write from memory in the system on top of the read/write bandwidth between it and Cell.
It is not difficult to see how this is possible when looking how two Cell chips would communicate via a crossbar. The crossbar makes possible the communication between Cells while Flexio handle communicating with the XDR Cell uses. I suspect RSX merely sits on the other side of one of the crossbars in the PS3…and now we know just how much Cell chips data Cell chips can communicate between each other in such a setup by looking at the link between Cell and RSX. Also there appears to be a south bridge in the PS3 that will keep I/O from other system components from stealing band with from Cell and RSX.
Critical thoughts:
Where Xenos would appear to be headed for bandwidth limited days in having less bandwidth available to it than a PC GPU, RSX on the other hand has more bandwidth available to it than any GPU part seen to date…at least by me. The only reason I don’t classify this as too much bandwidth is that RSX is clocked at 550MHz so in the end that extras bandwidth should come in handy in trying to feed RSX. (and I've heard there's no such thing as too much bandwidth) Looks like a good fit situation where RSX isn’t cruising towards being bandwidth limited but doesn’t have bandwidth to spare either.
Pitfalls:
MS says its directx compression will give them 50% more bandwidth to work with. I see this statement a PR move until proven otherwise. 50% more bandwidth due to what being compressed when and where will un-compressed data be stored and won’t storing that eat up bandwidth just the same? Is this compression HW accelerated? DX calls eat up CPU time not bandwidth with its overhead. Seems bogus until someone explains to me how this could be possible.
The intelligent memory of Xenos removes a good amount of the bandwidth load from the main ram by containing the frame buffer in it and not there. However Xenos still needs to get textures from main ram and “tiles†from/for the frame buffer when rendering to 1080i. Textures should be pretty large and larger if things are rendered to 1080i and not scaled to 1080i. These thought are what make me think Xenos “could†be bandwidth limited. Xenos is the X360’s memory controller forcing Xenos to share the bandwidth from the GDDR3 at all times. This moves me from “could†be bandwidth limited to thinking that it is probably the case especially when rendering to 1080i.
RSX will not have constant access to 48GB/s of bandwidth. In using Cell’s memory controller to get to the XDR pool it is actually consuming bandwidth Cell could use. There is an interesting possibility here though. One idea is to feed Cell with data from RSX (it’s pipes) while RSX is accessing XDR ram. This would ensure Cell doesn’t completely starve for data to work on during this interval. Just an idea.
Something to keep in mind is that RSX has 22.4GB/s bandwidth guaranteed and can variably pull from another 25.6GB/s as needed or when possible. Xenos is always sharing its 22.4GB/s bandwidth to memory with Xenon despite how the daughter die eases things.
CPU-GPU relations:
Xenos can send/receive data at 10.8GB/s to/from Xenon. They will communicate primarily to aid each other in the task of rendering.
RSX can send data to Cell at 20GB/s and receive data from Cell at 15GB/s. They will also work together primarily on the task of rendering.
Here is the way I’m looking at it. From my perspective one of the parts is working at the task at hand…rendering while the other is supplying it data to work on…a VERY intelligent memory if you will. One could then look at the bandwidth between the two parts as that from the “graphics processor†to its memory and go from there.
When thinking like this these are some observations I have. When looking at the bandwidth between Xenos and Xenon it appears to be about 1/3 of that “graphics processors†enjoy now. When looking at Cell and RSX the bandwidth appears to be about 1/2 to 2/3 rds what “graphic processors†utilize today. (PC graphics parts get approx. 30GB/s bandwidth)
This is not so bad when you go back and look at “how†each part will contribute to rendering. Xenos and RSX are to do the heavy lifting while Xenon and Cell are to provide flexibility these parts don’t have and then do what they can to aid in the task(s) at hand. These parts won’t need to communicate THAT much data but there are some interesting differences I think about just what can be done here.
I expect neither Xenon nor Cell to be particularly good at rasterization. I expect both Xenon and Cell to aid in vertex processing, particles, and post processing affects. I am thinking along the lines of tessellation, disp mapping, integrating particles into the physics simulation, etc. What can be done and to what extent is the question.
As things look now I would say that Xenon and Xenos are at a disadvantage to Cell and RSX on these special tasks. Cell has more threads it can dedicate to these tasks where each thread has more execution resources available to getting the tasks done. The second issue is that Cell and RSX can communicate more data between one another. The last issue has to do with ease of use but I feel it is significant. I cannot find where I saw it (so if you wish to dismiss this I understand) but it would appear programmers can use Nvidia’s Cg to program in vertex work for Cell’s SPUs. This does nothing for making the HW more powerful in relative terms but it’s goes a long way in turning the potential into the kinetic. To my knowledge there is no equivalent for this in working with Xenon’s core to the same end.
Pitfalls:
I have not ignored that Xenos can do a heck of lot of vertex processing on it’s own. Xenos is not a CPU however so it’s lack flexibility in what it can do with its entire vertex processing power. Xenos should be able to throw up INSANE amounts of geometry but it will not be applying physics etc to it…these tasks still best falls on Xenon and I’ve already spoken to the situation there. (the MEMEXPORT capability is a factor here though and shouldn’t be ignored)The other thing to consider is when Xenon is in INSANE vertex worker mode it is also in “not doing pixel work†mode and Xenon won’t be able to pick up the slack there…Cell wouldn’t be able to do it either.
Random thoughts:
I am curious about how well Xenos, it’s tessellator and Xenon can work together to make displacement-mapping work. There is a Z-brush demo of this I think. Could we see stuff like this in real time? Yes/no/maybe…what would it take of the tessellator etc?
I think it might be a good thing to place RSX’s frame/z buffers in the XDR ram. This would give Cell direct access to them so that some interesting things may be possible. What could this allow to be done? This would guarantee RSX consumes some of Cell’s bandwidth to main ram but it could it also free up bandwidth between Cell and RSX? Would the trade off be worth it?
Is there any news as what changed with respect to Cell’s PPE VSU? Any clue into what Crytek means by saying that Cell’s PPE has slighter/somewhat better “hyper threading†than a core in Xenon?
Is it true the dynamic scheduler used with Cell will find work for the SPUs to do if they aren’t tasked explicitly to something?
I don’t understand the issue with Xenos being triangle setup limited. Could someone explain this to me? Is it an issue where all the arrays couldn’t be tasked to vertex processing or is this merely a limit to the amount of vertex processing each array could do?
What would be some interesting things Nvidia could do with RSX’s feature set? Just want to here some interesting ideas.
MOST IMPORTANT THOUGHTS:
What do you think of anything I said?
What other factors may play a role when thinking about the overall performance of these machines and would these factors be more significant than these? (I’d prefer HW related issues, but if they tie into ease of use etc that’s ok)
Don’t worry about going over my head…learn me something! That’s what I’m here for.