With all this talk of multi-core...

Big T_CpE · Jul 16, 2005

I'm just about to start working on my doctorate of computer engineering in reconfigurable computing this next fall...

Before this recent trend of multi-core processing that seems to be prevailent in the computer segment with AMD and Intel, and the videogame segment with Microsoft and Sony, I often wondered about the feasibility of a multicore environment...

As a design I've been toying around with, and that I'm wondering would be feasable....I have to ask this question to all those with more expertise than I have in the field:

I had always thought the way to "better" computing was through massively parallelized systems. In fact, one notion I'm still toying around with is the incorporation of many simple cores into one chip, one master and many, many slaves. For example, take a simple 8 or 16-bit core and use an advanced memory paging technique to access memory outside of their addressable "theoretical limits" and place them onto one chip. With each core being so basic in nature, it could be possible to put hundreds onto one die and to clock them very fast. While I know that the current chips far exceed the capabilites of designs of the past, why can't multiple inclusions of simpler chips exceed the newer ones?

I know, I know, many of you may be laughing at what I have to say, but couldn't it be possible to take these hundreds of 8 or 16-bit cores and distribute the work-load amongst them?

In my mind it's akin to the principle of classes in object oriented design, where you take a very compicated problem, and split it up into many simpler and more manageable parts.

Just a thought....

j^aws · Jul 16, 2005

Checkout this Sony patent which takes the idea further to the extreme. It's a cascaded array of serial bit execution units,

http://www.beyond3d.com/forum/viewtopic.php?t=12899

I suppose things like these would be a nightmare to compile to and debug!

Shifty Geezer · Jul 16, 2005

Surely you know you can't do a lot of useful work with only 8/16 bits. How are you going to handle 64/128 floating calcs? I'm sure that, though you may get like 100 Z80s in the space of a SPE, the SPE would totally decimate it in performance.

For a while the fastest supercomputer in the world was made of 12,000 1bit processors in parallel (read this in my college days), but they were custom processors. I don't think you could assemble any number of old processors of a given transistor count and match a modern processor of the same transistor count, as modern designs are so much smarter than the old tech. It'd be terrifying to think that over all these years the performance per transistor per clock ratio had never improved and a 3 GHz Z80 array could match a Cell

Also, why ask this is a console forum of a 3D graphics site, and not an academic newsgroup?

Big T_CpE · Jul 16, 2005

Damn, and here I thought I had some kind of new idea

I thought about it a bit more, and I've realized that this could be even better coupled for graphics processing work. For some reason my mind is on the Dreamcast, the PowerVR and its efficient tile based rendering scheme...

What about a GPU with hundreds of small cores running in parallel where each is only concerned with a rendering a small tile of pixels? Where each of them is unaware of the "entire" image, and only concentrates on what they perceive to be the entire "screen"....

Development in this area could lead to literally thousands of tiny cores...

And yeah, I have to agree, writing a compiler for this would be the downside (but when have real programmers ever had it EASY

), and debugging..well there'd have to be a paradigm shift in development to allow for that, but really now, think of the possibilities...

EDIT: I ask it here because this I thought this is a good enough place as any to talk about "computing" as it is...and I also understand that 8/16-bit operations cannot account for 64-bit floating point operations, yet I've always wondered if it could be feasable to "split" up the operation into smaller chunks, allowing for my idea to work...like I said again, just a thought...

j^aws · Jul 16, 2005

Big T_CpE said:
Damn, and here I thought I had some kind of new idea

We've all been there!

Big T_CpE said:
I thought about it a bit more, and I've realized that this could be even better coupled for graphics processing work. For some reason my mind is on the Dreamcast, the PowerVR and its efficient tile based rendering scheme...

What about a GPU with hundreds of small cores running in parallel where each is only concerned with a rendering a small tile of pixels? Where each of them is unaware of the "entire" image, and only concentrates on what they perceive to be the entire "screen"....

Development in this area could lead to literally thousands of tiny cores...

And yeah, I have to agree, writing a compiler for this would be the downside (but when have real programmers ever had it EASY ), and debugging..well there'd have to be a paradigm shift in development to allow for that, but really now, think of the possibilities...

EDIT: I ask it here because this I thought this is a good enough place as any to talk about "computing" as it is...and I also understand that 8/16-bit operations cannot account for 64-bit floating point operations, yet I've always wondered if it could be feasable to "split" up the operation into smaller chunks, allowing for my idea to work...like I said again, just a thought...

You know you're pretty much describing current GPUs which are already highly multi-core 16/32bit SIMD/MIMD processors!

Shifty Geezer · Jul 16, 2005

There's a balance between size of core and functionality. I think what we see now are the results of years of investigation by the GPUs creators into what ratio of size (number of cores) to functionality works best. Both ATi and nVidia have similar systems, neither has more, simpler cores, so this can safely be accepted as the optimum I guess.

The CPU arena is still rather new in this area. Cell is the first crack. It'll be interseting to see if AMD and Intel efforts add smaller, larger, or similarly sized 'SPU's. We might see something more akin to pure FP coprocessors added to a central core which is smaller than SPEs so there's more of them, but they would be limited in what they can do.

And regards new ideas, it's amazing how old modern ideas are! There's so much 'new stuff' that turns out to have first been investigated in the 50s. I think the point to recognise is people were just as smart in days of yore as they are now, and no-one overlooked a theory. The only way to hit a new idea is to find something that was technological impossible before (such as OLEDs) for all theories have been considered.

Big T_CpE · Jul 16, 2005

Yeah, the area in which I intend to study is reconfigurable computing, like I said above, so it would make sense to have a centralized core doing all housekeeping duties, and hundreds or even thousands of smaller cores that are "reconfigurable" for the task at hand...

Also, there was a mention of the ratio to size versus functionality in the way of the number of cores....what is that ratio, and how is it computed?

Hmmm...yeah ideas from the 50's.....reminds me of watching the old Star Trek episodes where all these grand ideas were introduced, and yet it took decades to actually bring them to fruition..(example: the "communicators" which are a precursor to modern day cell phones

)

j^aws · Jul 16, 2005

Big T_CpE said:
Yeah, the area in which I intend to study is reconfonfigurable computing, like I said above, so it would make sense to have a centralized core doing all housekeeping duties, and hundreds or even thousands of smaller cores that are "reconfigurable" for the task at hand...

You should read that Sony patent, it describes working on data varying between 1-32bits using thousands of single bit execution units that seem to be reconfigurable but it doesn't seem to describe a master core, though it probably needs one anyway...

Big T_CpE said:
Also, there was a mention of the ratio to size versus functionality in the way of the number of cores....what is that ratio, and how is it computed?

I think Shifty was describing the ratio of vertex/fragment shaders units on current GPUs. However, they're unified on Xenos, where you have 48 ALUs, each comprising of a vector4 and scalar unit...though no master core. The master/ slave you're describing sounds more like CELL...

Big T_CpE · Jul 16, 2005

I'm reading through the Sony patent (wow the sun's coming up - ahh the life of an insominiac

)..and you're right there's no mention of a master core to handle data contention between the slaves, etc...I must be missing something, how could this be handled without a master?

Regarding the master/slave relationship, yes this is alot like CELL, but what I was thinking wasn't exactly along those terms....CELL is supposed to be a general purpose inorder CPU surrounded by 7 (or 8?) smaller cores that are very specialized in their processing abilities. From what I understand, they are very limited in their reconfigurability and efficiency to handle different types of processing...

The idea I was proposing was more along the lines of identical general cores throughout the die, with one (or more) dedicated to housekeeping duties for the rest...

j^aws · Jul 16, 2005

Big T_CpE said:
I'm reading through the Sony patent (wow the sun's coming up - ahh the life of an insominiac )..and you're right there's no mention of a master core to handle data contention between the slaves, etc...I must be missing something, how could this be handled without a master?

Not sure, I have a feeling it needs one but the patent isn't complete...

Big T_CpE said:
The idea I was proposing was more along the lines of identical general cores throughout the die, with one (or more) dedicated to housekeeping duties for the rest...

I'm not sure what your target applications are here or your target manufacturing process? Why not have full SMP and use a spare core for the house-keeping? Asymmetric/heterogeneous multi-cores with OOOe? SIMD? MIMD? Not sure what you're ultimate market is!

Big T_CpE · Jul 16, 2005

I'm not sure what your target applications are here or you're target manufacturing process? Why not have full SMP and use a spare core for the house-keeping? Asymetric/heterogeneous multi-cores with OOOe? SIMD? MIMD? Not sure what you're ultimate market is!

So true, so true! If we're only talking about graphics work with each core working on only a subset (tile) of the overal image, then obviously we'd want an SIMD-based system where each core does the same graphics processing to different data sets throughout the screen...or for fun, an MISD (multiple instruction, single data)-based system, where we have each core only processing a certain aspect of rendering....

But that's really getting into the nitty-gritty of it.....The target application? Hmm, good question...my idea was to develop a system in my doctorate work to prove the feasablility of these cheap simple cores over the more complex (and sometimes over-specialized, in my opinion) variants we have today. I believe that the majority of tasks that we rely on a computer for today can be massively multithreaded to allow for massive parallelization - although it may not be easy to see just yet. So once again - the target application? Just about anything requiring a high degree of computation that can be parallelized...

EDIT: Yes, an SMP system is what I had in mind, but instead of a few complex cores as we're seing today, hundreds or thousands of small, identical, simple cores, with a few set aside to handle house-keeping over the rest.

j^aws · Jul 16, 2005

Big T_CpE said:
.
...
EDIT: Yes, an SMP system is what I had in mind, but instead of a few complex cores as we're seing today, hundreds or thousands of small, identical, simple cores, with a few set aside to handle house-keeping over the rest.

With your goal of a general purpose massively parallel system, I think the main issue here is your transistor budget and target manufacturing process are not feasible! Otherwise Xenos or CELL would be doing it!

Shifty Geezer · Jul 16, 2005

RISC v. CISC in the multicore chip world. If you assume no overhead troubles, you could take a few different cores and work out how many trannys you'd need for such and such features. Then work out their peak rates of certain workloads. Then work out how much you'd get out of a given die space.

You could take existing cores, Z80, 286, 68000, ARM, and work out their peak performances on different calculations and see what the returns are. eg. How many single precision float adds and multiplies could one Z80 do per clock? Scale that up to die space. Repeat with other larger, more complex cores.

This of course doesn't consider processors designed for such operation. I'm sure if you go digging you'll find plenty of ideas on this matter. As I say, look at supercomuters. They've gone from multicore to megacore, to massive populations of tiny cores and back to multicore. There's rel companies vying to make the biggest, baddest, cheapest computers already. They spend loads of research capital investigating chip designs and system designs. There are reasons why they settle on any particular system, and those reasons are probably available for the public if you go looking.

Guden Oden · Jul 16, 2005

Big T_CpE said:
What about a GPU with hundreds of small cores running in parallel where each is only concerned with a rendering a small tile of pixels?

This was actually developed by some company that later went bankrupt and got bought out before they could finish the product. They did have prototypes tho, and the first generation was - I think - 1024 simple VLIW processors, each with a couple kilobytes of SRAM (around 16 I believe) arranged in an array.

The GPU rendered the scene entirely in software, using program applets distributed to each core in sequences to do various special effects including antialiasing, depth of focus etc etc. In theory you could program the cores to do any effect.

This was a couple years ago now and I can't remember the name of the company... "Cyber Something", I suppose. Or "Something Cyber".

They also had some sort of algorithm to do hidden surface removal on-chip too by the way so they didn't have to struggle with overdraw. Looked really cool on paper, but I dunno if it could have competed performance-wise with the more traditional GPUs of yesterday (and today) that have tons of hardwired logic in them and stricter limitations on programmability... Their method was to continually increase the number of cores to scale performance along with CMOS process improvements to bring the number of cores to 2k, 4k etc. I guess all the headaches we have today with shrinking transistors further and further might have been a big stumbling-block for these guys if they'd survived.

Shifty Geezer · Jul 16, 2005

1000 cores and 16 megs of SRAM?! If the same size as a GPU of yesteryear, say (150 million transistors) that 15,000 transistors a core. That's really not a lot at all. I guess they had no branch prediction

scooby_dooby · Jul 16, 2005

sorry for a dumb Q, but at what time does a "processor" become a 'core'?

the edram in Xenos has 192 logic processors, what's the diff?

Big T_CpE · Jul 16, 2005

Jaws said:
With your goal of a general purpose massively parallel system, I think the main issue here is your transistor budget and target manufacturing process are not feasible! Otherwise Xenos or CELL would be doing it!

Maybe I'm not following you on this, but why exactly do you project that trasistor budget and manufacturing process aren't feasible?

Also, the argument that Xenos or CELL would be doing this if possible, is a bit skewed, in my opinion. Xenos as a design takes a complex core (even if they are somewhat stripped down from their original POWER designs) and implants three of them on a die. What I had in mind was more simplistic cores, and many more than 3.

CELL is also not what I had in mind because of the the SPE's are too specialized like DSP's, as my idea was for reconfigurable cores.

Guden Oden said:
This was actually developed by some company that later went bankrupt and got bought out before they could finish the product. They did have prototypes tho, and the first generation was - I think - 1024 simple VLIW processors, each with a couple kilobytes of SRAM (around 16 I believe) arranged in an array.

The "Cyber Something" complany seems like they were really onto something here (my idea of course!

). It's always great to find out your bright idea is already well into the development stages by someone else

This company went bankrupt, so which wise GPU designer bought them out? Now I'm curious.

Megadrive1988 · Jul 16, 2005

over the next 10 years, we are going to go from multi-core (a few cores) which is now becoming the current-technology (PCs, X360, PS3) to MANY-core CPU computing (PCs, PS4) - look at Intel's Platform 2015 roadmap, and also watch out for future developments of CELL and others (others like Sun Microsystem's Niagra).

btw, I do agree with Big T_CpE and what he said about a GPU with lots of little cores, and also about PowerVR's tile-based rendering. If somehow a processor can be designed that is massively parallal and takes over both traditional CPU and traditional GPU tasks.

I am expecting the announcement (within the next 5 years) of some kind of very ambitious PROCESSOR (taking over the role of both CPU+GPU) that has dozens if not hundreds of small unified cores than can be re-tasked in realtime to work on CPU-type things or graphics-related things, any way the programmer wants. In the present, in 2005, we see that ATI is moving to a unified shader architecture where functional units can do either vertex-processing or pixel-processing. Well, the next step, IMO, is to have a unified *processor* architecture that can do all the things that a CPU and a GPU would do. of course, on the back end of the rendering pipeline, and in other places, you still have some specialised cores / functional units, but only where it makes sense. much of the rest of the transistor budget goes into unified computing cores and caches/eDRAM (hopefully SRAM or a new type of ultra-lowlatency memory). what I am describing sounds alot like Intel's Platform 2015, and is the next step beyond the current CELL architecture.

then on the system / console level, you might have between 3 and 6 of these new PROCESSORS (for lack of a better word) depending on the performance needed. 3 for a console or computer, 6 for an expensive workstation, for instance.

I expect to hear about such a processor, within the next 3 to 5 years, and I expect products based around such a processor to appear on the market within the next 5 to 8 years.

j^aws · Jul 16, 2005

Big T_CpE said:
Jaws said:

With your goal of a general purpose massively parallel system, I think the main issue here is your transistor budget and target manufacturing process are not feasible! Otherwise Xenos or CELL would be doing it!

Click to expand...

Maybe I'm not following you on this, but why exactly do you project that trasistor budget and manufacturing process aren't feasible?

Also, the argument that Xenos or CELL would be doing this if possible, is a bit skewed, in my opinion. Xenos as a design takes a complex core (even if they are somewhat stripped down from their original POWER designs) and implants three of them on a die. What I had in mind was more simplistic cores, and many more than 3.

CELL is also not what I had in mind because of the the SPE's are too specialized like DSP's, as my idea was for reconfigurable cores.

Well, Xenos is not the X360 CPU but it's the GPU!

It has 48 vec4+scalar execution units. It has equivalent programmable 32bit GFLOPs compared to CELL. They both have similar die sizes and transistor counts. They both were aiming for general purpose, yet highly parallel execution units. They both spent hundreds of millions on R&D.

My point is that they did seem to have similar goals to yourself and they ended up with these IC's. Of course there maybe be other viable solutions but they may not be as general purpose as you seem to require...and being general purpose does have a transistor count cost...

EDIT:

On the subject of general purpose, would this IC need to be general purpose enough to run an OS?

ShootMyMonkey · Jul 16, 2005

There was a post from an AMD architect almost a year ago on comp.arch that posed a theoretical CMP chip made up of 486-like cores (not including cache) in a die about the same size as an Opteron core. This was back around the time that Sun's Niagara was brand new, so that was the hot topic.

Mitch Alsup said:
Let us postulate a fair comparison, and since I happen to be composing this, lets use data I am familliar with. Disclaimer: all data herein is illustrative.

The core size of an Athlon or Opteron is about 12 times the size of
the data cache (or instruction cache) of Athlon or Opteron. I happen
to know that one can build a 486-like processor* in less area than
than the data cache of Athlon, and that this 486-like core could run
between 75% and 85% of the frequency of Opteron.

[*] 7 stage pipeline, 1-wide, in-order, x86 to SSE3 instruction set.

Let us pretend Opteron is a 1.0 IPC machine, and that the 485-like processoris a 0.5 IPC machine. (At this point you see that we have spent the last15 years in microprocessor development getting that last factor of 2 and it has cost us around 12X in silicon real estate...)

CPUs IPC/CPU Frequency IPC*Freq IPC*Freq*CPU
Opteron 1 1.0 2.4 GHz 2.4 2.4
486-like 12 0.5 2.0 GHz 1.0 12.0

If you really want to get into the game of large thread count MPs;
smaller slower less complicated in-order blocking cores delivers
more performance per area and more performance per Watt than any
of the current SMT/CMP hype.

Lets look at why:

Uniprocessor Best Case Typical Case Worst Case
DRAM access time*: 42 ns 58 ns 120+ns
CPU cycles @ 2.0 GHz 84 116 240

MultiProcessor
DRAM access time*: 103 ns** 103 ns 500 ns
CPU cycles @ 2.0 GHz 206 206 1000

[*] as seen in the execution pipeline
[**] best case is coherence bound not memory access time bound.

One needs a very large L2 cache to usefully ameliorate these kinds
of main memory latencies. Something on the order of fraction of 1%.
L2 Cache miss rates on commercial workloads: 64 GBytes of main memory1 TByte commercial data base, thousands of disks in multiple RAID channels, current Data Base software....
L2 miss
Miss Rate CPI cost
1 MB 5%+ 10.3
2 MB 4%-ish 8.2
4 MB 3%-ish 6.2
8 MB 2%-ish 4.1

So the fancy OoO core goes limping along at 0.2 MIPS while the itty bitty
486-like core goes limping along at 0.17 MIPS. And you get 12 of them!
So, the measly 5X advantage above, becomes a 10X advantage in the face of bad cache behavior.

Now if I were to postulate sharing the FP/MMX/SSE units between two
486-like cores, I can get 15 of them in the same footprint as the
Opteron core.

I can also postulate what the modern instruction set additions hav done to
processor area: Leave out MMX/SSE and the 486-like size drops to 1/18 of an Opteron core.

The problem at this instant in time is that very few benchmarks have
enough thread level parallelism to enable a company such as Intel or
AMD to embark on such a (radical) path.

Mitch
#include <std.disclaimer>

I'm not so sure about the empirical figure of 0.5 IPC for a 486, but other than that, everything seems to make perfect sense.

With all this talk of multi-core...

Big T_CpE

j^aws

Shifty Geezer

uber-Troll!

Big T_CpE

j^aws

Shifty Geezer

uber-Troll!

Big T_CpE

j^aws

Big T_CpE

j^aws

Big T_CpE

j^aws

Shifty Geezer

uber-Troll!

Guden Oden

Senior Member

Shifty Geezer

uber-Troll!

scooby_dooby

Big T_CpE

Megadrive1988

j^aws

ShootMyMonkey

Similar threads