Return of Cell for RayTracing? *split*

And now strong parallel CPU calc power seems good... small CPUs, many, wide RAM bus, special CPUs functions for RT... Yes :)
 
Why were 2 of my posts moved here? They're commenting a rumor that has nothing to do with Cell..

Because they were directly part of the natural flow of how the discussion turned to cell.
 
Nope. More efficient to just throw hardware in there to do just that and only that. My belief is that as long as you have to have a traditional rendering pipeline and you need it to have maximum performance you need a GPU and if you need a GPU anyway, you're going to be better off making that bigger. Cell is not specialized enough to be an efficient GPU (or an even more specialized processor like a RT processor) and too specialized to be a good CPU. There's no more place for it today than there was 5 years ago and there wouldn't be a place for it in the next machines either. Maybe something like it would work in what comes after that if we see a transition completely away from rasterization.

i was not suggesting Cell should play the part of a GPU in the next PS5. I was proposing to use a individual Cell as an addition and not part of the APU. I know that the Cell is not good enough as a GPU but as an dedicated RT Chip he could do wonders.


From that paper, it appears a big benefit is in the SIMD ISA and generous register set.
They made some decisions in implementation to work around pain points in the architecture. Running a ray tracer in each SPE rather than running separate stages of the ray tracer's pipeline on different SPEs and passing along the results. Software caching, workarounds for serious branch penalties, favorable evaluation of software multithreading for the single-context SPEs.
The architecture they were trying to write onto the hardware was a SIMD architecture (possibly with scalar ISA elements), different forms of branch handling, better threading, and caches.

but would that prevent a more modern Cell to be a part of a next Gen Console? The paper itself suggests that there is need for other rasterisation solutions .
 
i was not suggesting Cell should play the part of a GPU in the next PS5. I was proposing to use a individual Cell as an addition and not part of the APU. I know that the Cell is not good enough as a GPU but as an dedicated RT Chip he could do wonders.

I know what you were saying. And no, it couldn't. Hardware designed to do RT and only RT would absolutely destroy it in performance, power usage and would take up much less space on the die. You are effectively arguing that since Cell was better at video decoding back in 2005 than a typical CPU at the time that adding Cell to modern GPUs to act as the VPU would do wonders for their ability to decode video. The dedicated fixed-function blocks in GPUs responsible for video decoding are much more performant and use up much less space and power then any evolution of Cell ever could and it's exactly the same thing with dedicated RT hardware.
 
I know what you were saying. And no, it couldn't. Hardware designed to do RT and only RT would absolutely destroy it in performance, power usage and would take up much less space on the die. You are effectively arguing that since Cell was better at video decoding back in 2005 than a typical CPU at the time that adding Cell to modern GPUs to act as the VPU would do wonders for their ability to decode video. The dedicated fixed-function blocks in GPUs responsible for video decoding are much more performant and use up much less space and power then any evolution of Cell ever could and it's exactly the same thing with dedicated RT hardware.
As we don't know how the blocks in the GPU are facilitating raytracing, it's hard to compare. The flexibility of a raytracing processor rather than a memory-access block thing may also be better in supporting different algorithms, although GPUs are now very versatile in compute. Cell could also see a RT acceleration structure added in place of a SPE - the original vision allowed for specialised heterogeneous blocks to be added.

At this point, all we can do for Cell, and moreso a Cell2 with suitable enhancements, is speculate performance because it hasn't seen the investment that GPU based RT has so existing examples like the IBM demo. AFAICS, the bottleneck in RT is mostly memory search and access, although complex shaders can add a significant per-ray computing requirements. I definitely think there's potential in a MIMD versus SIMD solution though. The code would be quite different and operate differently, making it very hard to compare without a test case.

It'd be nice if @AlexV could weigh in, given a little experience in such matters. ;)

Doing some research on PVR, all this noise nVidia is getting is ridiculous. Power are/were soooo far ahead but they've been overlooked because they are a mobile chipset used in a limited number of devices.

I wonder what a monster-sized Power GPU with raytracing would look like?
 
Well the reason PVR keep getting overlooked is that Tile Based Deferred Rendering is really good at certain things but would be rubbish at most of the things traditional raster based cards are good at. Or at least that was the reason posited back when I was lamenting the lack of coloured lighting in Quake 2 on my PVR1, now I'm sure TBDR stuff has come a long way since then but I have to assume there is a good reason no-one is licensing PVR tech to make a GPU today even if it is just a lack of Windows driver support these days. I thought I read a while back that recent NV cards do some TBDR in part of their pipeline these days anyway?
 
Doing some research on PVR, all this noise nVidia is getting is ridiculous. Power are/were soooo far ahead but they've been overlooked because they are a mobile chipset used in a limited number of devices.

I wonder what a monster-sized Power GPU with raytracing would look like?

Yes, unfortunately PVR was out of the desktop space at that time and had been for many years. Had they still been involved in desktop PC computing they may have been able to make some headway. But being as they were basically all in on mobile at the time, what mobile developer is going to risk resources on RT for a chip that may or may not get picked up for a mobile SOC? A discrete PC solution isn't reliant on a SOC manufacturer picking it to use in a commercial product.

It's just really unfortunate that they were basically in the wrong hardware space to get something like that pushed into real use.

While the pick up in the desktop space (not just gamers, but research and science as well) may still have been slower than what is happening with NV's Turing chip due to being far smaller than NV in terms of size and especially marketing and developer relations dollars, it would have seen far greater adoption than it did with them being seen as a mobile graphics provider.

Regards,
SB
 
As we don't know how the blocks in the GPU are facilitating raytracing, it's hard to compare. The flexibility of a raytracing processor rather than a memory-access block thing may also be better in supporting different algorithms, although GPUs are now very versatile in compute. Cell could also see a RT acceleration structure added in place of a SPE - the original vision allowed for specialised heterogeneous blocks to be added.

Essentially I'm looking at the precedent set by every compute workload I'm aware of. You would expect that when designing hardware to accomplish a specific task (and only that task) that that hardware would be set up to do that task in the optimal way. I just don't see how you could ever expect to get better performance per watt/area/$$$ from a more flexible design. I'm not arguing that you couldn't come up with some evolution or configuration of Cell that wouldn't exceed the performance of the RT cores + the CUDA cores + Tensor cores setup in the RTX cards in raytracing given similar budgets for area and power. I'm arguing that using Cell for raytracing would use up more of the overall available power/area allocated for the chip than the RT + Tensor cores do in Turing, for example, and the only way to accommodate this would be to cut back on the CUDA cores and compromise your rasterization performance. That's not tenable in the near future. You could build a design that was better at raytracing than Turing (or something similar), but I think it's very questionable that you could build a design that was both better at raytracing and not also deficient in rasterization performance.
 
Yes, unfortunately PVR was out of the desktop space at that time and had been for many years. Had they still been involved in desktop PC computing they may have been able to make some headway. But being as they were basically all in on mobile at the time, what mobile developer is going to risk resources on RT for a chip that may or may not get picked up for a mobile SOC? A discrete PC solution isn't reliant on a SOC manufacturer picking it to use in a commercial product.

It's just really unfortunate that they were basically in the wrong hardware space to get something like that pushed into real use.

While the pick up in the desktop space (not just gamers, but research and science as well) may still have been slower than what is happening with NV's Turing chip due to being far smaller than NV in terms of size and especially marketing and developer relations dollars, it would have seen far greater adoption than it did with them being seen as a mobile graphics provider.

Regards,
SB

Well it's too bad Apple has ditched them for their own graphics IP. There could've been some legit merit in Apple using PVR's embedded RT tech in specialized workstations and even in the iPad Pro for design professionals doing their work and showing it off to clients in realtime.

Now that Imagination is defacto owned by the Chinese, there is now a new inroad for the company to produce GPUs for China's massive market for all segments. I sincerely believe dedicated GPUs could be part of that, even if they are in the lower end range. Affordable RT hardware would be a boon to a market that at the lower end can't drop oodles of cash on RTX based Quadros for real time ray tracing.

Seems like a chance for PVR to really undercut into segments RTX probably won't cover for two to three generations.
 
Last edited:
I did see a reference to a DSP architecture that presaged this, the TI TMS320C80 MVP.
Single generalist master core and 4 DSPs. It didn't seem to catch on.

There were a bunch of attempts to kick-start parallel solutions to computing problems back in the 1980s. A few bespoke applications found uses for a small arrays of DSPs and Atari tried selling the world on fire with the Atari Transputer Workstation. Super, mainframes and mini computers were already heavily parallel.

As far as multithreading went, such techniques were needed for other hardware in the 8 years after Cell, and so reusing them would make sense generally without tacking on the complexity of the SPEs, a master core/bottleneck, and lack of caching.

Yup but multi-core/multi-threaded solutions likely wouldn't have found adopted had Intel been able to keep cranking up clock speeds. The fact they couldn't forced the issue. PowerPC was already woking towards before Intel really began to hit a ceiling - IBM knew the technical barrier was coming from their big iron business.
 
Well it's too bad Apple has ditched them for their own graphics IP. There could've been some legit merit in Apple using PVR's embedded RT tech in specialized workstations and even in the iPad Pro for design professionals doing their work and showing it off to clients in realtime.

Now that Imagination is defacto owned by the Chinese, there is now a new inroad for the company to produce GPUs for China's massive market for all segments. I sincerely believe dedicated GPUs could be part of that, even if they are in the lower end range. Affordable RT hardware would be a boon to a market that at the lower end can't drop oodles of cash on RTX based Quadros for real time ray tracing.

Seems like a chance for PVR to really undercut into segments RTX probably won't cover for two to three generations.
In part the story of PowerVR Ray Tracing is about Imagination being in the wrong market to see it adopted, but from a different perspective, it's about Caustics, the company that actually developed that tech, having been aquired by imagination and not AMD not Nvidia.
 
Last edited:
Which mirrors what's happened in the RT space. There's been so much investment in rasterisation, RT is at a technological disadvantage. Makes it very hard to compare the two in terms of absolute potential.
Rasterization is still very well suited to the width and burst-friendly cache and DRAM architectures, which Nvidia's research still seems to encourage using for the majority of what winds up on-screen. Cell's big assertion was that much of what had been put into CPU architecture in the last decade or so could be discarded, so in that regard wouldn't requiring a lot of investment be a sort of defeat? The reasons for why the individual elements Cell removed were so popular were very compelling ones, and the bet that standard architectures had run out of steam was proven wrong.

i was not suggesting Cell should play the part of a GPU in the next PS5. I was proposing to use a individual Cell as an addition and not part of the APU. I know that the Cell is not good enough as a GPU but as an dedicated RT Chip he could do wonders.

but would that prevent a more modern Cell to be a part of a next Gen Console? The paper itself suggests that there is need for other rasterisation solutions .

The paper indicated they got the best results using Cell in a way that they believed its design didn't necessarily intend.
The base idea of master PPE controlling SPEs was thrown out, and most of the task-based work distributors in the PS3 era came to similar conclusions.
The idea of having SPEs adopt individual kernels and pass data between themselves over the EIB was thrown out in favor of running the same overall ray-tracing pipeline on each one--which was something the small LS was not optimal for.
Various parts of the process benefited from having caches, so much so that a sub-optimal software cache still did better than hewing to the SPE's target workflow.
The long pipeline and bad branch penalties required extra work to get near the utilization that either a branch predictor like a CPU or a hold until resolve like a GPU could do--although in either case that's more hardware in terms of the CPU or having more context like a GPU.
The paper noted the SPE's SIMD ISA was generous for the time, although poorer on scalar elements.
Multi-threading of the SPE by software was mooted, something the SPE's oversized context and heavy context switch penalties were not targeting.

To top it off, the realities of getting to 4 GHz and above were such that perhaps the long pipeline as it was could have shrunk on future nodes with limited impact on realized clocks.

Is an architecture with threading, branch handling, caches, more modest pipeline, and other generalist features really Cell?

Yup but multi-core/multi-threaded solutions likely wouldn't have found adopted had Intel been able to keep cranking up clock speeds.
If Intel could, then so would everyone else and the 10-15 GHz processors would have been as good or better than the dual and quad cores that we got instead. Multi-core is an inferior method for scaling performance, but one that remained physically possible.
 
Is an architecture with threading, branch handling, caches, more modest pipeline, and other generalist features really Cell?
Yes. The concept was small cores, lots of 'em, and heterogeneous. The reason useful stuff was stripped out for Cell 1 was to make them small enough to get more on a die. At smaller lithographies, more can be invested per core to make it better based on how people actually need to use it, while still providing a huge CPU core count of 60+ on a die.
 
Yes. The concept was small cores, lots of 'em, and heterogeneous. The reason useful stuff was stripped out for Cell 1 was to make them small enough to get more on a die. At smaller lithographies, more can be invested per core to make it better based on how people actually need to use it, while still providing a huge CPU core count of 60+ on a die.

The track record for success producing things that are 1/2 way between a traditional multi-core CPU architecture and the GPU paradigm is spotty at best. Xeon Phi has it's niche, buy it's not exactly setting the world on fire.
 
If Intel could, then so would everyone else and the 10-15 GHz processors would have been as good or better than the dual and quad cores that we got instead. Multi-core is an inferior method for scaling performance, but one that remained physically possible.

Intel can't because 80x86 is heavily laden with legacy requirements. Nobody researching quantum computer applications is working with hardware anywhere near as slow [in terms of clock frequencies] as anything Intel sell commercially. :nope: Cooling remains a challenge, but not impossibly so. :nope:
 
Intel can't because 80x86 is heavily laden with legacy requirements. Nobody researching quantum computer applications is working with hardware anywhere near as slow [in terms of clock frequencies] as anything Intel sell commercially. :nope: Cooling remains a challenge, but not impossibly so. :nope:
Isn't that larrabee? (Failed, scrapped or not)
Lots of x86 cores but stripped back and dropping legacy baggage where possible as it's not a general processor having to run code from decades ago.

InterIntergly that was shown ray tracing when it was actually something being talked about.
 
80x86 is not 80 time x86 cores? Ohh it's 8086 et al chip family...

I'll bow out now as it's already over my head, but all this cell talk got me thinking about what Intel might bring to their gpu given its a clean slate design in the dx12 compute era.
This must have been a possibility, and something they experimented with before.
 
Ah the talk of high clock speed reminds me of the good old days (1992+) of Digital Equipment Corporation's (DEC) Alpha CPU using RISC (https://www.extremetech.com/computing/73096-64bit-cpus-alpha-sparc-mips-and-power and https://en.wikipedia.org/wiki/DEC_Alpha ). The king of high clock speed that outside of special cases failed to deliver comparable performance to much slower Intel architected CISC CPUs. IE - it was really good at very specific things but rubbish as a general purpose CPU.

I wanted one quite badly back then but couldn't justify the cost even after Microsoft incorporated support for it into a version of Windows NT. :(

Regards,
SB
 
The track record for success producing things that are 1/2 way between a traditional multi-core CPU architecture and the GPU paradigm is spotty at best. Xeon Phi has it's niche, buy it's not exactly setting the world on fire.
That's true, but that doesn't prove it'll never be the best move in future.
 
Back
Top