HotChips 17 - More info upcoming on CELL & PS3

Status
Not open for further replies.
The obvious disadvantage is the fact you're using up precious die area to add redundancy instead of more processing power.
 
Last edited by a moderator:
Overclocking increases performance without the need for increasing the cache size, end of discussion.

You hit an obvious wall there as soon enough the disparity between processor's clock-rate and memory's clock rate will be so high that each cache miss will be a too strong hit on your performance to make overclocking not worth your time and the heat generated (if the chip manages to clock that high that is ;)).

Sometimes's linear increases in clock-rate do produce a less than linear increment in performance that shrinks as the clock-rate grows.
 
Panajev2001a said:
You hit an obvious wall there as soon enough the disparity between processor's clock-rate and memory's clock rate will be so high that each cache miss will be a too strong hit on your performance to make overclocking not worth your time and the heat generated (if the chip manages to clock that high that is ;)).

Sometimes's linear increases in clock-rate do produce a less than linear increment in performance that shrinks as the clock-rate grows.

I totally agree with you. Yes eventually you'll hit a wall. :smile:
 
Last edited by a moderator:
nondescript said:
No, not true...I've personally made FPGA-based logic clocked at over 100MHz, and the new generations can go much higher.
Ah, well, I sort of was talking about the bigass FPGA arrays that are capable of simulating large processors with tens of millions of transistors (and these do run very slowly). I should have realized there were other types of FPGAs too, heh. :)
 
PC-Engine said:
I totally agree with you. Yes eventually you'll hit a wall. :smile:

"Objects may be closer than they do appear" ;). (yes, I know that it is dependent on many factors including CPU architecture, Cache Hierarchy, Memory Interface, software workload, etc...)
 
pc-Engine...that A64 post was in response to me...and how did I appear to be hung up on large cache sizes being a requirement for good performance...I mean that is exactly the opposite of the intent of my post.

I recognized larger caches do help but when it comes to real performance gains it's about more execution elements etc in a processor.

Just some other comments...

The redundancy in the Cell is not a waste. It is a saving for the PS3. Now chips with defects that take out only one SPE can still be used. Those Cells with no defects can be used for other purposes or placed into the PS3. If die size is a set size (by fab restrictions or just as a logical argument) and you can either increase or decrease the number of cores on that die...the more cores you use combined with how lax your requirements are for a useful chip will net you more usable chips than using larger cores. If you increase die size and keep the number of cores set per die well...the larger your die the more likely defects are to occur meaning as die size goes up the number of usable chips goes down.

This only makes logical sense.

This is why going to a smaller fab process saves one money no?

I don't think many engineers would agree that adding more larger cores to a chip is a way to increase the number of useful chips one can obtain per wafer. You are reducing the number of die per wafer and at the same time increasing the probability of defects per core. Sounds bad to me.
 
Last edited by a moderator:
Will aggregate/total/overall performance increase??? YES or NO??
.
.
Sure it could just like I could win the lottery. Aggregate/total/overall performance is more likely to INCREASE by adding a 4th PPE
You have no ability to read whatsoever, do you? I said pretty clearly that the chances of a drop are pretty high. Oh, I'm sorry... that was in the middle of a paragraph, so it must have completely lost you, what with having to read sentences one after another. Don't ever confuse the theoretical gain with real-world gain. Expecting more is the most effective way to be disappointed.

Even if it has a 75% chance of increasing in real-world usage, that's really bad -- and that makes it not worth it. Engineering is not just about how well something works, but how well it fails. Now, on the other hand, if 360 were using a significantly lower-latency memory subsystem, there would be fewer concerns. It's been more than shown a thousand times that this factor is one of, if not the, biggest bottleneck. Will there be workloads within gaming that pretty much stay entirely within cache? Absolutely... in those cases, yeah, the 4th core will help. Those are trivial. Are there going to be cases where the 4th core WILL hurt real-world performance? Absolutely. There cannot be a doubt about that. What it comes down to is whether those cases are enough slower that it outweighs any point where it helps or doesn't make any difference. If the miss rate increases by even 5%, that pretty much closes that case.

Yes, you can partition the cache, but that's no different from making the cache too small. Your little dreamed up figure about "minimum usable cache" is many times higher than you think. As I've said at least twice before, and you've never paid attention to (surprise, surprise), is that it's not just number of threads, not just number of cores... You keep talking about "usable cache for 3/4 cores," and the figure for that depends on several variables, which is why arguing by pointing out Niagara's cache sise has absolutely no place. There isn't any similarity in the workloads or platform environments the chips are placed into. Factor in hardware threads, clock speed, memory speed, cost of cache misses, theoretical throughput, motherboard chipset, target workloads. That's why POWER5 has its 144 MB of L3 cache or why dual-core Athlons have their 1 MB per core... those matters were factored in for their specific cases.

You're talking about adding more execution resources vying for the same route to memory. It WILL increase miss rate, and one of the things about that is that the impact of each added core steadily drops, up until a point, where it basically explodes. Without actually having a 4-core chip, the only thing I can do is really look at how badly things perform as they are. Jumping to memory is already a bottleneck on single-threaded performance as it is, and in those cases, there aren't other cores or other threads vying for that same path. When your in a position like that, you're already in a position to suffer badly for each extra thread. Given what has been seen thus far, I'm inclined to believe MS when they say that 3 cores is a "sweet spot," although not for the same reasons they speak of.

Overclocking increases performance without the need for increasing the cache size, end of discussion.
You have no clue what you're talking about. Have you increased the number of resources vying for cache? You've sped up the core logic and sped up the cache, and depending on how you're approaching it, you've probably also sped up memory, and every other bus in your machine... wow and that's faster... I would never have dreamt it. You are aware, of course, that PC processors are conservatively clocked as they are and that a family of PC CPUs is more than likely intended to scale much farther than the speed you've bought it at... After all, if someone manufactured cache amounts specific to each and every speed grade, that costs too much to produce.
 
i would listen to shootmymonkey, Pc-engine. He at least as access to the X360 devkit and knows how it handles better than any of us.

not only that, but he speaks the truth... say, northwood architecture handles well from a certain cache amount upwards, but there is a critical point aka Celeron where Cache is low enough to criple the pipelines so much that the performance loss is enormous.
The same cache drop affects Sempron but not dramaticlyl... Cpu's behave differently. nothing new here, you know this.

that being said, there is nothing you can say now about if a 4th core would not bottleneck the entire chip, because you and i have no ideia how those Cores behave. Being in the family of PPC means nothing in here, they are modded.

So, a 3-core design is probably the design that keeps performance good without cache being a bottleneck. 4th core maybe is the beyond the cache limits. Maybe not.... Shootmymonkey is the one working with X360, not me...
 
Last edited by a moderator:
PC-Engine said:
The obvious disadvantage is the fact you're using up precious die area to add redundancy instead of more processing power.

Sure but that doesn't seem to have much to do with the amount of cores like you were suggesting.

If anything, it seems that as the # of cores goes up, so does yield (in terms of the PS3 since they are specced for 7).
 
scificube said:
pc-Engine...that A64 post was in response to me...and how did I appear to be hung up on large cache sizes being a requirement for good performance...I mean that is exactly the opposite of the intent of my post.

I recognized larger caches do help but when it comes to real performance gains it's about more execution elements etc in a processor.

Just some other comments...

The redundancy in the Cell is not a waste. It is a saving for the PS3. Now chips with defects that take out only one SPE can still be used. Those Cells with no defects can be used for other purposes or placed into the PS3. If die size is a set size (by fab restrictions or just as a logical argument) and you can either increase or decrease the number of cores on that die...the more cores you use combined with how lax your requirements are for a useful chip will net you more usable chips than using larger cores. If you increase die size and keep the number of cores set per die well...the larger your die the more likely defects are to occur meaning as die size goes up the number of usable chips goes down.

This only makes logical sense.

This is why going to a smaller fab process saves one money no?

I don't think many engineers would agree that adding more larger cores to a chip is a way to increase the number of useful chips one can obtain per wafer. You are reducing the number of die per wafer and at the same time increasing the probability of defects per core. Sounds bad to me.

First of all I wasn't responding to you. Second I was agreeing with A64. ;)

Adding a 4th core will decrease the yields relative to a 3 core yes. The question is how much does it decrease the yields and is it an acceptable tradeoff in terms of gaining more performance from the result of having 4 cores instead of 3. Obviously MS went with the 3 core because it's smaller and cheaper than a 4 core and is enough for their needs. Finally redundancy CAN be a waste. It just depends on how much die area the redundant blocks take up. For example does it make sense to have 50% of your die area being redundant? 40%? 30%? 20%? Where do you draw the line? Higher redundancy can make your chip bigger than it needs to be. In fact the bigger the chip the more defects you have so you even need more redundancy. It becomes a domino effect.
 
PC-Engine said:
First of all I wasn't responding to you. Second I was agreeing with A64. ;)

Adding a 4th core will decrease the yields relative to a 3 core yes. The question is how much does it decrease the yields and is it an acceptable tradeoff in terms of gaining more performance from the result of having 4 cores instead of 3. Obviously MS went with the 3 core because it's smaller and cheaper than a 4 core and is enough for their needs. Finally redundancy CAN be a waste. It just depends on how much die area the redundant blocks take up. For example does it make sense to have 50% of your die area being redundant? 40%? 30%? 20%? Where do you draw the line? Higher redundancy can make your chip bigger than it needs to be. In fact the bigger the chip the more defects you have so you even need more redundancy. It becomes a domino effect.

It's all good man...just though you misunderstood me. That's all.

I certainly understand that there is a limit to the usefulness of having redundant cores and that this is an argument that is well...relative to the situation. That's the best I can describe it. Redundancy makes sense in the case of the PS3. Sony will not be making money off Cell here. Sony can't charge a premium to cover the costs of a bunch of tossed chips...so Sony moved to have fewer chips get tossed. If we were talking about the X2 well in this case why should AMD go this route? No good reason I'd imagine. They can charge a premium for their chips. In a way they can save as well as there less silicon is used in providing for redundancy. AMD really doesn't have anywhere to toss chips with defective cores either. Sony can stick them in TVs or whatever the heck they plan to use them.

Redundancy isn't good or bad. It's a tool that either does or doesn't get you the results you want. Those results being subject to die size, core complexity, types of defects...it's either can work for you or not. It works for Sony...not so much for MS. Different approaches-> different results -> different advantages/disadvantages. IMO (<-teh insulator :p)
 
Given what has been seen thus far, I'm inclined to believe MS when they say that 3 cores is a "sweet spot," although not for the same reasons they speak of.

And what reasons did MS use to conclude it is a sweet spot?

You have no clue what you're talking about. Have you increased the number of resources vying for cache? You've sped up the core logic and sped up the cache, and depending on how you're approaching it, you've probably also sped up memory, and every other bus in your machine... wow and that's faster... I would never have dreamt it. You are aware, of course, that PC processors are conservatively clocked as they are and that a family of PC CPUs is more than likely intended to scale much farther than the speed you've bought it at... After all, if someone manufactured cache amounts specific to each and every speed grade, that costs too much to produce.

My statement was replying to your higher clock equal missed cache cycles statement. It has nothing to do with adding more cores and keeping the clock the same. As I already said before, adding a 4th core without increasing the cache size is a tradeoff yes, I never said it was a win-win situation. Adding a 4th core AND clocking it higher is a totally different situation.

If anything, it seems that as the # of cores goes up, so does yield (in terms of the PS3 since they are specced for 7).

That's because they downgraded it to 7. Originally they wanted 8. MS can down grade their CPU to 2 cores and claim better yields too, but what's the point? What I see is you assuming the redundant SPE automatically solving the yield issue making it equal to XCPU yields or better. What if the actual yields even with the redundant SPE is worse than XCPU even with 4 cores? Heck we don't even know the die size of a hypothetical 4 core XCPU.
 
Last edited by a moderator:
PC-Engine said:
That's because they downgraded it to 7. Originally they wanted 8.

That very well might be because they were worried about yields but that doesn't necessarily have anything to do with the #s of cores unlike die size.

PC-Engine said:
MS can down grade their CPU to 2 cores and claim better yields too, but what's the point?

Yes they certainly could increase yields by doing this but then would have crippled the performance of the CPU. I suppose we can take this to mean that they're not worried about yields or are willing to suffer whatever defect rate they expect it to be.

PC-Engine said:
What I see is you assuming the redundant SPE automatically solving the yield issue making it equal to XCPU yields or better.

Nope, not at all. In fact I took pains to clearly point this out. I'm not claiming yields will be better for CELL or worse. I'm just saying that having more cores (with some acting as redundant) improves yields not hurts it.

Ty said:
If anything, it seems that as the # of cores goes up, so does yield (in terms of the PS3 since they are specced for 7).

PC-Engine said:
What if the actual yields even with the redundant SPE is worse than XCPU even with 4 cores? Heck we don't even know the die size of a hypothetical 4 core XCPU.

Exactly. None of us know which will have better yields.

My statements are simply borne out of your previous ones which seemed to fly in the face of elementary engineering.

The more cores on a die, the higher the chances of a core being bad. 8 is two times 4. If you tak out the SPEs and replaced them with 3 PPEs you would get better yields and would not need a 5th PPE for redundancy.

It's not the # of cores that impact the defect rate. So having a redundant core for CELL is good for their yields. It doesn't imply that CELL will be better or worse than the XCPU in terms of yields.
 
PC-Engine said:
MS can down grade their CPU to 2 cores and claim better yields too, but what's the point?
I've never heard that it has such an architectural feature built in. It has a tight crossbar instead of the ring bus in Cell so I assume they couldn't.
 
Nope, not at all. In fact I took pains to clearly point this out. I'm not claiming yields will be better for CELL or worse. I'm just saying that having more cores (with some acting as redundant) improves yields not hurts it.

It doesn't exactly improve yields in an absolute sense. Sure adding more redundancy will keep your chip "alive", but if you need a bigger die to do it then it becomes hipocritical because a bigger die has a higher chance of defects not to mention higher cost. It might just be cheaper to have a slighly smaller die with no redundancy.

It's not the # of cores that impact the defect rate. So having a redundant core for CELL is good for their yields. It doesn't imply that CELL will be better or worse than the XCPU in terms of yields.

Of course it's related to the number of cores since more cores take up more space and more space equal bigger die. It seems elementary engineering is not your forte.

one said:
I've never heard that it has such an architectural feature built in. It has a tight crossbar instead of the ring bus in Cell so I assume they couldn't.

Not really sure what a crossbar has anything to do with using less cores. XCPU is just like a ARM MPCore. You can add remove cores as necessary from 1-4 or even higher.
 
Last edited by a moderator:
one said:
I've never heard that it has such an architectural feature built in. It has a tight crossbar instead of the ring bus in Cell so I assume they couldn't.
AFAIK it's not a true ring in cell, rather a bidirectional set of two buses.

Would have been cool if it had been an actual ring, and you could have had multiple transfers going between many SPEs (and the PPE) at the same time... :)
 
My statement was replying to your higher clock equal missed cache cycles statement.
I didn't say it equals missed cache cycles... I said that each cache miss costs more CPU cycles when clock speed has changed (assuming memory latency is still the same in absolute time) -- the cache miss rate doesn't really change when you overclock. 100 ns of latency at 2 GHz is 200 cycles, but at 3.2 GHz, it's 320 cycles. That's 120 more cycles that could have been time that you're working. If you overclock a CPU using only the multiplier, you'll see that as opposed to changing memory bus clock in addition.

Adding more cores IS a different situation as it DOES affect the cache miss rate. It's a balancing game at that point. Personally, I'd have to say that the one and only reason why the cache is 1 MB is simply cost. They set certain design goals within certain die size/cost bounds, and 1 MB cache was the largest they could get.
 
PC-Engine said:
It doesn't exactly improve yields in an absolute sense. Sure adding more redundancy will keep your chip "alive", but if you need a bigger die to do it then it becomes hipocritical because a bigger die has a higher chance of defects not to mention higher cost. It might just be cheaper to have a slighly smaller die with no redundancy.

Yes, I can see this. Especially if the defect rate is not linear to thye area of die. At some point adding more cores could negatively impact defect rate beyond the ability of additional cores to increase yields.

PC-Engine said:
Of course it's related to the number of cores since more cores take up more space and more space equal bigger die.

This argument is valid when comparing two examples of the same CPU - one with X number of cores and one with Y number of cores, e.g. Cell with 6 cores and Cell with 8 cores. This is not a valid argument between two different CPUs like Cell and XeCPU because the sizes of the cores are what impact the defect rate as die size goes up.

PC-Engine said:
It seems elementary engineering is not your forte.
Nope. Which is why I don't argue with engineers. :) Wait, didn't you tell me you were an engineer?
 
Status
Not open for further replies.
Back
Top