Cell/CPu architectures as a GPU (Frostbite Spinoff)

Cyan · Mar 17, 2011

Arwin said:
Certainly possible, but wouldn't you agree that when you are looking at games exclusively, there are a lot more games on PS3 that use Cell extensively than that there are games using GPGPU even remotely extensively on PC? Because it sure looks that way to me ...

It's a very complex matter. PCs base their rendering power on GPUs, and they have been unbeatable, graphically wise. GPGPUs maybe don't make much sense in the PC space because of that.

Also Lionhead didn't explain their work in great detail, about how their new engine works but I am seeing numbers there I have never heard about in the PC space. It seems GPU based but maybe it's not.

Besides that, after seeing the Crysis 2 differences I feel confused and I think that maybe, just maybe, joker454 is in the right that nothing beats a GPU at graphics rendering and it's the present and future of graphics.

It would be certainly interesting to know if Sony wanted to try either a GPGPU for NGP, or a Cell derivative, at some point in development.

Laa-Yosh · Mar 17, 2011

I don't see actual ingame numbers in the Lionhead presentation that are so shocking.
The big deal there is their content creation pipeline, with the dataset sizes and the streamlined conversion into runtime assets.

Cyan · Mar 17, 2011

I meant that 100 billion polygons are 100000000000 of polygons, which I have never heard of, even in the most leading games.

For some numbers, the more the better, in some cases...not so much. I am still in my early 30s but when I'm 75, people will realize by then I'm useless and hopeless.

liolio · Mar 17, 2011

There are parallel really interesting discussion on the forum one about "hardware MSAA" or the one called "Make the API disappear". It really get one to wonder but I believe that till texture sampling is out the picture the best bet for consoles designer is some throughput oriented CPUs cores match by a performant GPU.

My deep belief is that Larrabee or even something more akind to what Sweeney describes could be made right just now but nobody is willing to found it as it makes no commercial sense and that enterprises do business

Cyan · Mar 17, 2011

liolio said:
There are parallel really interesting discussion on the forum one about "hardware MSAA" or the one called "Make the API disappear". It really get one to wonder but I believe that till texture sampling is out the picture the best bet for consoles designer is some throughput oriented CPUs cores match by a performant GPU.

My deep belief is that Larrabee or even something more akind to what Sweeney describes could be made right just now but nobody is willing to found it as it makes no commercial sense and that enterprises do business

If Cell succeed somehow in the PS3, why not Larrabee? They are similar in some regards. Larrabee sounded very exciting and I wonder why Intel didn't try to shoehorn it into the PC space, which is not limited or tight precisely and see how it worked. Forcing people to buy laptops or desktop computers with Larrabee as the main processor would be a good idea, I think, granted they used a x86 emulator developed along with SO companies, like Microsoft and others. But maybe I am such a dreamy person....

They could also try to sell the technology to either Sony or Microsoft.

Laa-Yosh · Mar 17, 2011

Cyan said:
I meant that 100 billion polygons are 100000000000 of polygons, which I have never heard of, even in the most leading games.

But there are no exceptional polygon counts in the game - what he's talking about is the offline source assets they're working on. It'll then get decimated into a lowpoly game world with a set of large virtualized textures.

If you were to add up the poly counts of all the individual assets created in Zbrush for Gears of War 2, all the monsters, rocks, vehicles, and so on, it'd probably be a stupid insane number as well, with the average character around 20-50 million polygons. But it then gets baked into a normal nap and it'll never be seen in the actual game.

_phil_ · Mar 17, 2011

kagemaru said:
I played Heavy Rain, loved the demo, but you're wrong. In terms of detail and animation, there is no game out today that has character models that look as good as that tech demo.

This "tech demo" ran >60 on a real ps3 ,and it's the exact same model and data used in game .Texture resolution, geometry,animation system ,bones number and influences ,all is exactly the same.
It benefits a special care for the lighting ,nothing more.

liolio · Mar 17, 2011

Cyan said:
If Cell succeed somehow in the PS3, why not Larrabee? They are similar in some regards. Larrabee sounded very exciting and I wonder why Intel didn't try to shoehorn it into the PC space, which is not limited or tight precisely and see how it worked. Forcing people to buy laptops or desktop computers with Larrabee as the main processor would be a good idea, I think, granted they used a x86 emulator developed along with SO companies, like Microsoft and others. But maybe I am such a dreamy person....

They could also try to sell the technology to either Sony or Microsoft.

Well for Larrabee to have a market it has no choice but to perform ~as well as hardware design the dirext X Open GL fixed graphic pipeline in mind both in perf and power. That proved to be asking a lot. After I post I read again "end of the GPU road map" from Sweeney after all the talk about it here and that larrabee project stalled, this is also asking too much.

My belief (based on if not thin air but comments read here and there) is that general purpose hardware won't make it succesfully in the graphic realm till texture sampling is part of the picture. Sweeney get this right for sure on the other hand " he wants it all and he wants it now": std cpu with wide SIMD units, high clocks, coherent memory cache, etc. He should have get the memo now Larrabee project stalled. Could something a bit less "easy going" fit the bill in performance and power? Maybe. My feel is more who would found the project and take such a risk now that even Intel somehow failed? For a console manufacturer alone to found such R&D sounds a bit crazy as they will have more than their hand full with software alone.

Silent_Buddha · Mar 17, 2011

Arwin said:
Certainly possible, but wouldn't you agree that when you are looking at games exclusively, there are a lot more games on PS3 that use Cell extensively than that there are games using GPGPU even remotely extensively on PC? Because it sure looks that way to me ...

There's a problem with that comparison.

The PS3 absolutely HAS to use Cell extensively because they have no choice.

The PC doesn't HAVE to use GPGPU due to the robustness of both CPUs and GPUs in the PC space.

But even with that there are some very clear wins with GPGPU in games when the time is taken to utilize them. I believe it was Battleforge that was first on PC with some GPGPU code for their HDAO implementation which netted a relatively huge performance win for Dx11 class hardware.

Many games now also are taking advantage of either OpenCL or DirectCompute for performance wins with traditional tasks. Just Cause 2, Civilization 5, Metro 2033, and Stalker for example.

It doesn't help that in this day and age when most resources are spent on making sure a game works well on console and then later making sure it works on PC, there's often performance to spare and thus there's little use investing the resources to extract more performance from GPU compute. Looking at that small list above, for example, you'll note that it's mainly those developers that are either PC-centric (Civ 5) or interested in pushing the PC to some extent who are actually spending the time to exploit GPU compute.

As well, noting comments here and there from developers on this forum, GPU compute is an integral part of games going forward.

Anyway, the point being. If GPU compute was absolutely required to, lets say, maintain performance parity with console hardware, it would probably have been used far more extensively on PC by now. But as PC's have performance to spare when porting console games to PC it's more of a luxury for most developers although its importance is increasing. Or if thinking with regards to this thread overall. If the PS3 had a far more capable GPU, would much effort be expended by developers to find all those uses that Cell "must" do currently? Especially when platform parity is the holy grail of multiplatform developement, and hence why you see little effort by many developers to take advantage of things on PC?

Regards,
SB

3dilettante · Mar 17, 2011

Cyan said:
If Cell succeed somehow in the PS3, why not Larrabee? They are similar in some regards. Larrabee sounded very exciting and I wonder why Intel didn't try to shoehorn it into the PC space, which is not limited or tight precisely and see how it worked. Forcing people to buy laptops or desktop computers with Larrabee as the main processor would be a good idea, I think, granted they used a x86 emulator developed along with SO companies, like Microsoft and others. But maybe I am such a dreamy person....

Using Larrabee as a main processor would have been a disaster. The chip would have been worse than Atom for most software (if it could have run it without recompilation).

The chip itself was massive, but the likely competitors in its class would have been mid-to-low range GPUs, thanks in part to its long delay. Intel's message and company support seemed inconsistent at times. There were groups at Intel that did not appreciate selling giant chips at sub-Xeon prices, and its proponents were schizophrenic about what it was the design was supposed to do.

At least the first and second designs were scrapped due to problems, perhaps the third one will have corrected whatever shortfalls were found from early testing.

Arwin · Mar 17, 2011

Silent_Buddha said:
There's a problem with that comparison.

It wasn't meant as a comparison. It was stating a simple fact. I don't mind you wanting to qualify it, but that goes beyond what I was trying to say.

The PC doesn't HAVE to use GPGPU due to the robustness of both CPUs and GPUs in the PC space.

Just like they don't have to use the CPU much, or even low level API code. But it goes in the reverse direction as well. Super Stardust HD did some cool stuff using SPEs for collission detection (ditto Resistance 1 already with the bouncing bullets by the way), Uncharted with animation blending, or later sound occlusion, AI path finding, Havok for physics, etc. There are many more examples than people give credit for (I can make a pretty extensive list). These weren't necessary, they were just new possibilities. PC trailed here because APIs were more limiting, but you are right, there are also less reasons to want to do so (read economic advantages, thanks partly by the way to piracy)

But even with that there are some very clear wins with GPGPU in games when the time is taken to utilize them. I believe it was Battleforge that was first on PC with some GPGPU code for their HDAO implementation which netted a relatively huge performance win for Dx11 class hardware.

Many games now also are taking advantage of either OpenCL or DirectCompute for performance wins with traditional tasks. Just Cause 2, Civilization 5, Metro 2033, and Stalker for example.

I never argued against this!

It doesn't help that in this day and age when most resources are spent on making sure a game works well on console

Can you really give that advantage to the PS3 though? I think that for multi-platform titles, probably still at least half are developed on PC, the other 46% with the 360 in the lead. Maybe it is shifting towards the 360, but despite a few choice developers at least claiming to use the PS3 in the lead, it's a small, small percentage.

and then later making sure it works on PC, there's often performance to spare and thus there's little use investing the resources to extract more performance from GPU compute. Looking at that small list above, for example, you'll note that it's mainly those developers that are either PC-centric (Civ 5) or interested in pushing the PC to some extent who are actually spending the time to exploit GPU compute.

Similarly, there aren't that many multi-platform developers motivated to really push the PS3 - rather, I'm willing to bet that there are a group of say, 5 developers who develop the game engine either on PC or 360, and then there are perhaps one or two developers (not even necessarily full-time) working to optimise the PS3 side of things. Criterion (Burnout Paradise) being a rare example I'm sure.

If the PS3 had a far more capable GPU, would much effort be expended by developers to find all those uses that Cell "must" do currently? Especially when platform parity is the holy grail of multiplatform developement, and hence why you see little effort by many developers to take advantage of things on PC?

But the same can, and should be said about limited use of the advantages of Cell in multi-platform titles, is all I am saying. Advances on any of these platforms are as much coming from Nvidia pushing PhysX, say, as they are from first party studios on PS3, wouldn't you agree?

Regards,
SB[/QUOTE]

Silent_Buddha · Mar 17, 2011

Arwin said:
But the same can, and should be said about limited use of the advantages of Cell in multi-platform titles, is all I am saying. Advances on any of these platforms are as much coming from Nvidia pushing PhysX, say, as they are from first party studios on PS3, wouldn't you agree?

It seems I may have misjudged the intent or extent of what you said in what I originally quoted, so the rest I'm not commenting on as there's no major disagreements.

But I'm not sure I'd say Nvidia is necessarily the one pushing GPGPU in games other than CUDA being the basis for modern PhysX. Yes there's been some Nvidia sponsored CUDA implementations of things unrelated to physics, but not as much as Developer implemented OpenCL/Direct Compute things. Granted there's no way to know how much of a hand Nvidia had in those, but the argument could be made that with regards to OpenCL/Direct Compute AMD is pushing it harder as there is no compelling fallback for them as there is for Nvidia (CUDA).

In the software application space they certainly have been pushing quite hard on the other hand. But only as it relates to CUDA from what I've seen.

Anyway, at this point I think we're drifting a bit OT so you can have the last word if you wish.

Regards,
SB

liolio · Apr 6, 2011

The discussion in the "hardware MSAA" thread is really interesting.

Whereas Nick put out a good fight I think the 3dillettante made a really good point about how vector operations are per selves non generic. As I'm stubborn he got me wondering and the question remains: "OK CPU+wide SIMD may be to much of an headache (gross sum-up) but how about plain Vector processors?"

Nick makes some interesting points about how even texturing latencies could be hidden SMT+OoO and good pre-fetching. The down side is at which perfs per mm2 or/and Watts.

Looking at the future, with more complex simulations, more complex physics, cleverer rendering, possible blending of non polygon based effects, rendering tiny polygons (standard adaptive tesselation not micro polygons) what are the chances for vector processors to fit the bill?

What could the wanted characteristic of such vectors processors, in regard to the SIMD width (SPU or Larrabee like), to the number of instructions issue per cycles, the multi-threading implementation, Cache hierarchy and characteristics, how many registers, which clock speed (speed demon like SPU or "slow" like ATI products), etc?

I've the feeling that vectors processors never caught up in public space but now that we expect GPU to do a lot more than graphic could they time have come? (too bad SPU were not de facto proof of concept still they achieve both impressive perfs per mm2 and per watts). When was a last time somebody tried to build a modern vector processors (graphics aside)?

I also have a question Mfa while answering Nick spoke about "deferred texturing" could someone give a gross explanation of the concept?

jonabbey · Apr 7, 2011

liolio said:
I've the feeling that vectors processors never caught up in public space but now that we expect GPU to do a lot more than graphic could they time have come? (too bad SPU were not de facto proof of concept still they achieve both impressive perfs per mm2 and per watts). When was a last time somebody tried to build a modern vector processors (graphics aside)?

Aside from all modern Intel/AMD x86_64 chips, with SSE{234}, AVX, etc.?

liolio · Apr 7, 2011

jonabbey said:
Aside from all modern Intel/AMD x86_64 chips, with SSE{234}, AVX, etc.?

Not what I meant, SIMD unit in X86, POWERPC, ETC are SIMD units they are not Vector processors.
Here is the post of 3dilletante I was thinking of. He does a very good job at showing in which way at demonstrating that the needs for a standard CPU and for a vectors processor diverge. What is generic operations while handling vectors is not at all for a generic (scalar) architecture. what I get from their talk (I use get because I'm neither close to their level of understanding neither I want to make it sounds like it) is that you end up doing plenty of compromises or paying an arm and a leg to fit all needs of the design (either it's in die size, power, etc.).

3dilettante · Apr 7, 2011

The SIMD extensions to x86 are in the rather shallow end of the pool of vector architectures. I think many don't consider them "real" vector architectures.
They do operate on comparatively short (tiny even) vectors, but they have been criticized for their clunkiness when it comes to reading disparate locations in memory, arranging loaded values in their registers, and then writing the values to different locations.

Adding scatter/gather being discussed in that thread to some future x86 would significantly help with the read/write problem. Without it, there are only fixed strided accesses that make them much less worthwhile outside certain niches.
Bulldozer includes a more generalized ability to shuffle and permute vectors, which would be handy to see continue in future extensions.
The length of the vectors is still something that seems to keep x86 away from the grown-ups table of vector architectures, though AVX does go further with 8 elements (Larrabee with 16).

GPUs have SIMD-widths of 16, but the logical length of their batches ranges from 32 to 64. They also have the implicit capability to read and write to varying locations, though their truly generic gather and scatter operations outside of graphics do not have the same peak capability.

Some other vector machines had vector registers of width 64, but there were designs that could just chug their way through very, very long vectors over multiple cycles.

The question of being generic in the case of scalar vs vector can be debated along multiple axes.
It is possible to force a vector down to what is effectively scalar by telling it to ignore all but one lane.
This does raise an efficiency argument.
In the case of CPUs, there is also the question of what context these are being applied.
There are certain machine behaviors that scalar operations perform in terms of control, synchronization, and system calls that may not work (or make sense) if a vector instruction were put in their place, or could be prohibitive if they could be used in that capacity.

Things like atomicity, coherence, consistency, interrupts, and exception handling would most likely not be applied the same way to the vector pipeline. Some of them may be too nightmarish to apply, especially in the case of a vector extension to a scalar pipeline, where it wouldn't make too much sense to duplicate the effort.

There is a fundamental assumption with vector processing in that there is parallelism at the data level. It's not a universal condition, while a scalar operation is more general because a vector can be decomposed into multiple scalar operations.
In way, this is much more likely to get high utilization, because no vector hardware is left unused, but there is a lot more "busy work" in that utilization count.
The primary issue I tend to debate is how utilization is defined, and how much emphasis it should receive in certain circumstances. Often, I think there is a distinction to be made between utilization and utility.

liolio · Apr 7, 2011

3dilettante said:
The SIMD extensions to x86 are in the rather shallow end of the pool of vector architectures. I think many don't consider them "real" vector architectures.
They do operate on comparatively short (tiny even) vectors, but they have been criticized for their clunkiness when it comes to reading disparate locations in memory, arranging loaded values in their registers, and then writing the values to different locations.

Interesting you think of vector processor such as the ones described in this wikipedia entry? Indeed by those standards even GPU are pretty narrow vector machines. For some reasons when I think of modern vector processors I think of something closer in width of SPUs or the SIMD in our nowadays CPUs.

scatter/gather being discussed in that thread to some future x86 would significantly help with the read/write problem. Without it, there are only fixed strided accesses that make them much less worthwhile outside certain niches.
Bulldozer includes a more generalized ability to shuffle and permute vectors, which would be handy to see continue in future extensions.
The length of the vectors is still something that seems to keep x86 away from the grown-ups table of vector architectures, though AVX does go further with 8 elements (Larrabee with 16).

GPUs have SIMD-widths of 16, but the logical length of their batches ranges from 32 to 64. They also have the implicit capability to read and write to varying locations, though their truly generic gather and scatter operations outside of graphics do not have the same peak capability.

Some other vector machines had vector registers of width 64, but there were designs that could just chug their way through very, very long vectors over multiple cycles.

Am I right to think that the wider the vectors the more complicated it becomes for the hardware to support those read/write scatter/gather operations?

The question of being generic in the case of scalar vs vector can be debated along multiple axes.
It is possible to force a vector down to what is effectively scalar by telling it to ignore all but one lane.
This does raise an efficiency argument.
In the case of CPUs, there is also the question of what context these are being applied.
There are certain machine behaviors that scalar operations perform in terms of control, synchronization, and system calls that may not work (or make sense) if a vector instruction were put in their place, or could be prohibitive if they could be used in that capacity.

Things like atomicity, coherence, consistency, interrupts, and exception handling would most likely not be applied the same way to the vector pipeline. Some of them may be too nightmarish to apply, especially in the case of a vector extension to a scalar pipeline, where it wouldn't make too much sense to duplicate the effort.

I'm not sure I get that properly. As I get it's a blend of "once this go against really wide vector machine" and " some instructions/functionalities don't make sense (a bit like SPU can't do every things a PPU does even though you were to "waste" 3 of their "lanes" out of 4).

There is a fundamental assumption with vector processing in that there is parallelism at the data level. It's not a universal condition, while a scalar operation is more general because a vector can be decomposed into multiple scalar operations.
In way, this is much more likely to get high utilization, because no vector hardware is left unused, but there is a lot more "busy work" in that utilization count.
The primary issue I tend to debate is how utilization is defined, and how much emphasis it should receive in certain circumstances. Often, I think there is a distinction to be made between utilization and utility.

I get this part better or so I hope, utilization should not be spoken in insulation without regard to efficiency.
-----------

What I had in mind when I posted earlier after reading your talk with Nick, Rpg.314, JohnH and the others is may be what Nick calls for won't happen (I guess something like Larrabee) either for real technical problems or lack of utility of such a shift but could "plain" "narrow" (in regard to historic Vectors processor designs) could fit the bills?
I'm not saying that in the way "remove the X86 part, the scalar pipeline in larrabee and instantly it will fly". My belief is more, heterogeneity in modern chips is going nowhere anytime soon and may be never it could not make sense, (one size fit it all could prove a wrong instinctive belief of mankind... sorry for the bad joke...) is it really impossible for vector processors to handle "all of the not that general computations".
when I read about the problem you were raising (as well as rpg.314) while speaking of texturing especially cache I can help but think isn't this workable for a "generic" vector architecture.

I don't know how next iterations of larrabee will look like, but the more I read comments on the matter and on GPU too the more I feel it looks way too much like a "standard" CPU (in regard to cache especially), on the other hand I fell like it also tried to hard to look like a GPU (pretty wide but not but historical standards). In the end it failed to meat the clock speed Intel aim, compute density crumbled (I purposely put software aside).

I wonder if really narrow (think SPUs so four wide) would have been a better bet. It might end easier to design and may have reached the perf goals (in clock speed especially). It has to be less complicated a narrow core than wide one.
Could the cache hierarchy be different overall something closer to Xenos (360) (which L2 in my eyes is more of L3 as it has mostly the same characteristics as Intel L3 (in latency) so L2 with L3 characteristic (relatively "high" access latencies, working in the same way as SnB L3).
Could it have make sense to implement a tiny L0 data cache which characteristics would make it closer to actual texture caches in characteristic.

Could in its goal to push X86 Intel have passed on a opportunity to create a de facto standard (with the matching licensable ISA) for vector processors? I've an indistinct feeling that Intel took the problem from the wrong angle, could have they forgotten that sometime more is less? Instead of taking the challenge to compete against high end GPU while using a X86 front end, going for pretty wide (still narrow) SIMD and all the complications that this implies (hardware headache), would they have been better entering the market from below instead of trying to enter the market from above?
A fair answer that has been made to Nick in the aforementioned thread is "why there is not such a design on the market?". I guess the reason is obvious there is no market for such chips (and that for sure in mid and high end market). It's more disputable in the low end where actually 3D performances are a "non concern". The market Intel failed to see is social gaming that is flash accelerated. That a missed opportunity, I believe narrow generic VPU would do at least as well as modern GPU and would be more flexible than even Dx11 class gpu (or even Dx12 no matters what it pushes as they would be generic). They may have hurt GPGPU as well as those things could have been include to all their chips (and they ship a lot). Actually if we watch at the whole figure Intel weight is such that they may have been responsible for a paradigm shift. It's an herey here on 3D oriented forum but graphic are not everything in games, social gaming is growing at an impressive rate, may Intel have been doing other choices and graphics and possibilities offered by those games would be higher. Basically what I say may sound stupid but it could be sum-up by fighting AMD and Nvidia with... flash games :!:

Intel which is usually doing long term choices may be failed with larrabee (not only hard or software). They have cheer volume, may be their goal were wrong to begin with, instead of trying to match GPUs they should have though "OK ours CPUs are orders of magnitude (2? 3? ) than GPUs, our generic VPUs will lessen the gap to one order of magnitude (for Dx Opengl games) but they will be way more useful and flexible than GPUs ever will.
May be Nick is right not on performance but overall, may the aim for high end perf is limiting the reach of games and the truth is that Intel didn't play its strength which is volume?

Wow long post I hope it makes sense.

MfA · Apr 7, 2011

liolio said:
I also have a question Mfa while answering Nick spoke about "deferred texturing" could someone give a gross explanation of the concept?

I just meant that a small triangle renderer will not rasterize large blocks of pixels, so if you're counting on large continuous blocks of pixels being textured in one go to be able to make large block accesses to a texture cache you need to defer texturing. This can be done with deferred shading. Either with tile by tile rendering like PowerVR, or with temporary tile buffers like NVIDIA.

3dilettante · Apr 8, 2011

liolio said:
Interesting you think of vector processor such as the ones described in this wikipedia entry? Indeed by those standards even GPU are pretty narrow vector machines. For some reasons when I think of modern vector processors I think of something closer in width of SPUs or the SIMD in our nowadays CPUs.

GPUs are rather narrow compared to the long vector machines.
I'm not sure there is a hard line for vector machines, but GPUs have a few of the other facets of full vector machines that x86 does not yet have.

Am I right to think that the wider the vectors the more complicated it becomes for the hardware to support those read/write scatter/gather operations?

It can become more complicated. Part of the complication will stem from how many of these individual accesses will be done simultaneously, and what happens with regards to accesses to the same locations and whether the operation is interruptible or whether the architecture will care.

I'm not sure I get that properly. As I get it's a blend of "once this go against really wide vector machine" and " some instructions/functionalities don't make sense (a bit like SPU can't do every things a PPU does even though you were to "waste" 3 of their "lanes" out of 4).

It can go either way. There are things like synchronization operations that involved performing a read/modify/write operation as a single atomic event.
For a scalar operation, there isn't the possibility that it could be interrupted or contend with itself. Without carefully defining the corner cases, the instruction may not work. Even if it does, it could involve enough overhead to make it prohibitive.
There was a paper on atomic vector operations linked in the forum previously, I'm not sure where at the moment.

I get this part better or so I hope, utilization should not be spoken in insulation without regard to efficiency.

That is what has been argued. Care should be taken in that everyone is using the same definition for utilization and efficiency. Sometimes we can all use the same words but be talking about very different things.

What I had in mind when I posted earlier after reading your talk with Nick, Rpg.314, JohnH and the others is may be what Nick calls for won't happen (I guess something like Larrabee) either for real technical problems or lack of utility of such a shift but could "plain" "narrow" (in regard to historic Vectors processor designs) could fit the bills?

That topic isn't just about vector instructions, but overall design philosophy.
The focus on vectors in my earlier comment is that there are assumptions made in their functionality that do not apply universally, so they are not as generic as scalar ops can be.
Other parts of the argument don't necessarily concern themselves with vector versus scalar. The discussion on texturing includes discussion on the layout of the cache, with regards to how it can hinder full scatter/gather throughput in the form that Larrabee most likely implemented it.
This actually has something to do with trying to shoehorn vector capability onto a memory pipeline that is has as its basis a scalar design. I'll expound on this more in a bit.

I don't know how next iterations of larrabee will look like, but the more I read comments on the matter and on GPU too the more I feel it looks way too much like a "standard" CPU (in regard to cache especially), on the other hand I fell like it also tried to hard to look like a GPU (pretty wide but not but historical standards).

This is actually a criticism leveled at SIMD extensions in general, and x86 in particular (because it tended to be the worst offender).
The short vectors, the inflexible memory operations, the clunky permute capability are "vector" extensions for a design that emphasizes low-latency scalar performance.

Scatter/gather is not simple to perform at speed on the very same memory ports that the scalar ops use, and it is not simple to make it a first-class operation when there are some pretty hefty requirements imposed by the rules of operating on the scalar side (coherence, consistency, atomicity etc.) Very frequently, the vector operations tend to be more lax, but this also means they are not as generic as the scalar side.

Could in its goal to push X86 Intel have passed on a opportunity to create a de facto standard (with the matching licensable ISA) for vector processors?

Intel has had ~15 years to do it, but x86 is not a vector ISA. Its extensions were SIMD on the cheap, and they were roundly criticized for their lack of flexibility and restrictions in their use.
Each iteration has improved certain aspects, as transistor budgets expanded, but there are some strong constraints imposed by the scalar side.

This is also where there is a difference of opinion.
There is the position that there can be a core that can do everything as well as a throughput-oriented design while still being focused on latency-sensitive scalar performance, all while not blowing up the transistor and power budget.

This goes beyond a vector/scalar and concerns the overarching question of generality and specialization.

I've an indistinct feeling that Intel took the problem from the wrong angle, could have they forgotten that sometime more is less? Instead of taking the challenge to compete against high end GPU while using a X86 front end, going for pretty wide (still narrow) SIMD and all the complications that this implies (hardware headache), would they have been better entering the market from below instead of trying to enter the market from above?

The market below is far more power-conscious. It would take even longer to compete there. There are integrated designs that don't have unified shaders nor full programmability because the power usage was not acceptable.
Even for the high end, the chip was massive, and no real numbers showed up to indicate it would be competitive at the time of release.

Part of the problem is that Intel needed a compelling advantage and a unified message. The design was too delayed to be compelling, and Intel's message was never coherent (and there were signs that various divisions were not trying too hard to help it).

patsu · Apr 9, 2011

Hmm... I don't think the design goal is for Cell to become/replace a GPU. That wouldn't make sense. Kutaragi said so himself. It was meant to address the memory wall, and cover a wider range of tasks. Thus helping the GPU in areas where performance may fall short (as requirements increase), or implement entirely new graphics concepts (extremely tight integration between graphics and other OS functions).

To compare modern CPU and GPU with Cell, it would also be interesting to see what Cell elements IBM want in their next CPU.

Also, software and system architecture are key for a CPU. With an established software base that manages bandwidth carefully/explicitly, it would be interesting to see how an optical interconnect can benefit a network of Cells/cores approach. It is not so interesting to try to shoehorn Cell into a PC-like CPU-GPU setup, with a traditional CPU-GPU software library.

Cell/CPu architectures as a GPU (Frostbite Spinoff)

Cyan

orange

Laa-Yosh

I can has custom title?

Cyan

orange

liolio

Aquoiboniste

Cyan

orange

Laa-Yosh

I can has custom title?

_phil_

liolio

Aquoiboniste

Silent_Buddha

3dilettante

Arwin

Now Officially a Top 10 Poster

Silent_Buddha

liolio

Aquoiboniste

jonabbey

liolio

Aquoiboniste

3dilettante

liolio

Aquoiboniste

MfA

3dilettante

patsu

Similar threads