NEC claims auto-multithreading compiler placed on FPGA

Carl B

Friends call me xbd
Legend
This is one of those things where you almost have to wonder if it deserves to be put under console technology and not some other sub-section of the forum, but still... with all the programming model discussions we have around here - and of late centered on the problems facing the parralelization of code - I figured console technology would be the place.

Anyway essentially what NEC has claimed is that not only have they developed a means of auto-compiling and threading software, but that their method is orders of magnitude more efficient, and can be executed in hardware via placement on an FPGA. Architectures on which they have tested have not (to my knowledge) yet been revealed, but it's still compelling.

It seems almost *too* good to be plausible for real world useage, yet still NEC isn't a company whose statements I would toss outright. Unfortunately as with most developments in Japan, I doubt we'll have comprehensive insight into the situation until the transation crew gets on the scene, aka One.

Anyway but Xbitlabs in the meantime....

...Parallelization with conventional multi-processor technology requires the manual modification of application source programs. Manual labor increases the development and verification cost for software development, which is in turn made more complex by the growing size and complexity of the software itself. Therefore, multi-processor technology, which can automatically parallelize application programs without manual modification, has been long sought after in this field.

The new technology of NEC is a compiler that can be implemented on a field programmable logic array (FPGA) and that handles parallelization better than software compilers designed for the same type of work. NEC claims that an application that was tailored for execution on a 4 processor machine manually for 4 months runs only 95% faster compared to the same application without optimizations on a computer with 1 processor, whereas the same application parallelized automatically in less than 3 minutes gives 183% performance gain over single-processor machine.

The distinctive feature of the new technology is the ability of the automatic parallelizing compiler that utilizes profile (execution history) information to aggressively exploit parallelization patterns, which are effective for accelerating the speed of application programs. In addition, although the parallelization is speculative, the speculation is almost always completely accurate, according to NEC. The speculation hardware works as a safety net by handling any rare misses, guaranteeing the correctness of the execution. This ensures that the compiler is not conservative in decisions concerned with these cases, resulting in an increase in the amount of parallelism exploited. The parallelism exploitation is supported by the speculative execution hardware that realizes efficient handling of detection of incorrect execution orders caused by the parallel execution of the program parts, cancellation of the incorrectly executed part, and re-execution of it...

Article
 
Just to confirm, they're not programming an FPGA, but using an FPGA as a compiler to create parallized source-code for conventional multiprocessor systems, right? It would be akin to having an FPGA in a PS3 or XB360 devkit to create parallized code from the (presumably single thread) source.
 
Shifty Geezer said:
Just to confirm, they're not programming an FPGA, but using an FPGA as a compiler to create parallized source-code for conventional multiprocessor systems, right? It would be akin to having an FPGA in a PS3 or XB360 devkit to create parallized code from the (presumably single thread) source.

Yeah that's the impression I'm getting. It seems along the lines of Intel's Mitosis project in a sense, and that's what comes to mind when I try to think of any analogue along the same lines. Definitely interested in finding out more about what NEC's up to here.
 
Last edited by a moderator:
If this is true...then all MS and Sony need do is add an FPGA to their devkits and licensce NEC's technique(s) for highly optimized parallel code? Sounds way to good to be true...so quickly into the parallel paradigm shift.

If MS and Sony developers could continue to produce serial code and this solution cleaned up after them with such efficiently it really would be the ultimate...what's the word...panacea.

It would be nice to know what chips this solution has been tested on. I also wonder if this solution can handle heterogenous cores such as what Cell would present to it. I wouldn't seem impossible, but it still seems a good question to me.

Maybe a better question to ask is IF this is for real...how much is NEC gonna charge for it...such a solution would seem valuable to just about...everyone going parallel in the future. Then again...if using HW like a FPGA and exectution info to aggressively compile is what you need to auto-parrallelize...what's to stop the other major players from doing the same thing on their own...at least beyond the near term.
 
Last edited by a moderator:
I think the thing to keep competetitors at bay in terms of copy-catting is the method of implementation, going beyond the FPGA. I mean, this can't be easy to duplicate. And I would have to think that until now NEC's testing has been primarily on their own line of chips, it just seems that would make sense. We definitely need more info, that's for sure.

But indeed as I mentioned above, NEC isn't the only one working on a compiler implementation in hardware, Intel is as well for their own future chips with Mitosis, essentially an on-die hardware compiler.

It makes an interesting read if nothing else: Mitosis

But NEC's possible FPGA format made me sit up in my chair today, even knowing about Mitosis.
 
Like to hear some of the resident devs comment on this. Sounds sort of like something Deano was talking about in his blog a month ago or so.
 
Synergy34 said:
Like to hear some of the resident devs comment on this. Sounds sort of like something Deano was talking about in his blog a month ago or so.

I remember the discussion but not the blog entry itself - wasn't it along the lines of skepticism that such a thing could ever be achieved? And that stance might yet be bourne out. Obviously though the man himself can weigh in at his own choosing, here I am too lazy to even look up the blog entry. :)

On the topic in general though, I don't think we can expect devs to provide insight on this just because they're devs - or if so not do so with authority per se - everyone is more or less equally in the dark here. It's just going to have to be a sort of slow puzzle-piece gathering excercise to see exactly what this is all about, and how 'real' it is.

I'm *hoping* some of the Japanese tech sites write something up on this and then One or someone else can give us some more to work with, because otherwise we're just operating on the purely theoretical.
 
Last edited by a moderator:
Unless someone beats me to it, I'll look it up but I don't have his blog addy here at work. I go home for lunch, I'll do it then.

What I recall I think he said something like the biggest improvement developing for mutil threaded CPU's would come from someone creating a better compiler. just off the top of my head.
 
Here's the English press release.
http://www.nec.co.jp/press/en/0512/1901.html

That FPGA is merely a prototype of NEC's new mulitcore microprocessor architecture called 'Pinot'.
http://portal.acm.org/citation.cfm?id=1099547.1100541

You need both the binary translator software and the newly developed Pinot architecture multicore processor, though Pinot is the addition of 4 circuits and 3 instructions to an existent processor architecture. In short, this auto parallelization technology is not applicable to existent processors unless you add the hardware extension to them.
 
NEC believes that its automatic parallelization technology is the first to be brought to a stage of practical use. This is supported by the fact that NEC has succeeded in operating this technology on a field-programmable gate array (FPGA). Moreover, its implementation has confirmed that only a marginal hardware extension is required and that application program speed is actually accelerated.
So in summary, it's a compiler tech combined with hardware features, and the hardware needs to be present in the CPUs. So Cell and XeCPU won't benefit. Well Cell could if it's reworked to feature the Pinot featureset but that's unlikely for PS3. At the end of the day, sounds very interesting, but won't have any relevance on this gen of consoles (next-gen isn't next-gen now it's released, right? It's current-gen and past-gen now, right? :???: )
 
So something similar to Mitosis then for certain NEC (or other) processors featuring the specified hardware and instruction set.

Ah well, thanks though One! :)
 
Last edited by a moderator:
Fundementally its bloody hard to program for next gen consoles, on PS3 you have lots of tasks because of the hardware design with all the headaches of concurrent programing (synchronisation, seperate memory pools, etc). X360 is a little easier, partly because its got fewer hardware threads and that they are symetrical, but also because its got MS Visual Studio compilers for it, OpenMP makes it easier to create worker tasks to distrubute across hardware threads. Now for this generation, Cell isn’t too complex, so doing all the thread in a manner similar to Herb’s WindowAPI example (near the end of that presentation) is okay (obviously PS3 doesn’t use Windows API, it uses CellOS API calls that do a similar thing) but what happens when we get hardware threads in double figures?
If a future console was to have 30+ hardware threads, the winner of that generation won’t be who has the most flops but who can write a compiler that helps you out. I’m don’t believe it will ever be ‘automatic’ (at least for the foreseeable future) but being able to add ‘active’ to a class etc. would be a god send.

Here's what Deano wrote.
 
Meybe it still possible to Nintendo, once they are working with nintendo, it would really help small devs and make it the easyest of the 3 to dev I suposse.
 
one said:
this auto parallelization technology
Reading the English translation, this seems like a much more accurate way to characterize what is going on than multithreading. It's doing some kind of sampling of the program execution history and using that as the basis for speculative execution that it spreads among the cores present. It seems like having lots of execution units and really good OOOE, except the execution units don't have to share resources they otherwise would on a single core. The big speedup showdown they quote was for one specific task and not a general improvement, you can come up with big gains for auto-vectorization too, but usually that's not so great. I wonder how it'll work in the real world on a system running lots of threads, maybe for a system that's doing mainly one thing (sequencing DNA or something) it could be good.
 
I seriously doubt this means that manual multi-threading is unneeded though. I bet you if you ran that manually multi-threaded program through their compiler, the results would be much better then the compiler alone. Additionally, this doesn't address situations where you want to use multi-threading for responsiveness reasons. Like when you want to do a long hardcore calculation, but maintain interactivity and program responsiveness as well. Rather, it'll likely be what OoO did to coding, in that it will speed up general code, but finely tuned code will still have a significant advantage.
 
Sounds like a very interesting peice of tech, but I do wonder just how practical it is. 'small modifications' to existing designs is something of a contradiction when dealing with 100 million transistor chips.

I would also wonder how well it would work.

There are a lot of cases in a program where a peice of code will rely on a computed value, but that value does not rely on previous state. This would be a simple case of pre-computing that value on another thread, but I would wonder how fine grained this could get. Go too far and you probably get something more comparable to out of order execution logic I guess.

the quote "but being able to add ‘active’ to a class etc. would be a god send." is interesting. I'm currently in the process of working on various ways to multi-thread my current project (nothing special... just for fun), this gives me a few ideas :) Just recently got a few things like scene graph recursion working well over multiple threads, it's just object->object locking and state sharing is the real tricky bit. I do have code that does non-multi-thread fixed frequency updating... that could be an interesting place to start... :)

Maybe something along the lines of an interaction tree, of objects that might potentially interact with each other, both passivly and activly... get all the unconnected sets and bang, you have your thread groups. tricky to do it dynamically thought
 
Back
Top