How long before we see a tensilica cpu plus powerVR together

Brimstone

B3D Shockwave Rider
Veteran
ARM cpu cores have PowerVR integrated, but why not Tensilica? From what I've read about the Tensilica's Xtensa cpu, it's more powerful and energy efficent than any other comparible cpu. A recent press release stated

Nearly 9X Performance Advantage per MHz Over ARM1020E; Over 5X as Fast as SuperH 64-bit Core



http://www.tensilica.com/html/pr_2004_05_18a.html

EEMBC_optimized.gif



A Xtensa cpu combined with a Power VR core would seem to be the ultimate technology combo for a portable gaming device. Even Fortune magazine declared Tensilica one of the cool companies of 2004.

SANTA CLARA, Calif. – May 17, 2004 – Today’s issue of Fortune Magazine lists Tensilica, Inc. among the “14 High-Tech Companies We Love†in their “Cool Companies 2004†series. According to Fortune,

“Tensilica…offers a solution: an innovative architecture around which specialized chips can be designed, plus tools to design them(By the way the sounds a lot like the Sony, IBM, and Toshiba CELL is supposed to be. A modular design with tools to help create a chip for a specific task. Although tensilica already offers that today.). Companies incorporate only the features they need, squeezing years from design time and making the resulting chips ten to 100 times faster, smaller, or less power-hungry than standard chips.â€

Fortune continued,

“Renowned innovation professor Clayton Christensen at Harvard Business School regularly hails Tensilica as a ‘disruptive technology,’ along the lines of Linux, eBay, and Amazon.â€

http://www.tensilica.com/html/pr-2004_05_17.html
 
First, i must say that i didn't take a lot of time to read their white papers, but all that stuff sounds like yet another Transmeta.

At least Tensilica has been awarded one of the cool companies of 2004 by Fortune magazine <keanu>Whoa</keanu>
 
Those graphs look impressive, but we don't know what task the CPUs are actually DOING to produce those results. The workloads could be hand-picked to show off their product in its best possible light.
 
More praise here

No company pushes the multiple-CPU idea harder than Tensilica. One of its customers has over 150 processors on a single ASIC, and the average is six per chip.

Tensilica's Xtensa processor is a bit of intellectual property (read: you pay your money, you get a disk) that represents a 32-bit RISC core. It has an extensible instruction set. Clever tools analyze your C code and automatically generate instructions to improve, often drastically, system performance.

The company previewed its Xtensa LX core, which adds a number of speed-enhancing features. One of the most eye-popping is a configurable I/O channel that supports transfers of up to 1,024 channels each 1,024 bits wide"in a single clock. At 350MHz that's some 350 terabits per second. Another addition draws on the processor's extensible architecture to essentially eliminate I/O ports; data is inherently transferred via queues as part of an instruction's operation. A+B could automatically go to a device or location, without the usual intermediate store operation. That's pretty cool.

Even cooler, though, was the SozBot demo at Tensilica's booth. These one-pound-or-less robots competed in an orgy of destruction reminiscent of the Roman Colosseum. Robots deployed flame-throwers, scoopers, and circular saws that tore through each other like a demolition derby. The crowd went wild; it's always fun to see someone else's technology destroyed in a shower of sparks.


http://www.embedded.com/shared/printableArticle.jhtml?articleID=20000085
Tensilica Tackles Bottlenecks
Tom R. Halfhill - Senior Editor {05/31/2004}

Rarely has the ascendance of embedded processors been more evident than at the recent Embedded Processor Forum, where several companies announced products and features that must seem like tantalizing fantasies to the architects of staid PC processors. One company in particular, Tensilica, is continuing to pursue a farsighted corporate vision of architectural flexibility and automated design.

At EPF 2004, Tensilica announced new versions of its configurable microprocessor core and optional DSP engine, which are licensed as soft intellectual property (IP). When combined with the company’s previously announced VLIW-like instruction extensions and next-generation development tools, they will redefine the possibilities for embedded processors.

The new Xtensa LX is a major upgrade of Tensilica’s existing configurable processor core, the Xtensa V. Xtensa LX tackles three challenges vexing today’s CPU architects: the architectural limitations on compute efficiency, the bottlenecks on I/O bandwidth, and rising power consumption. For SoC developers, Xtensa LX preserves the advantages of a customizable CPU architecture while laying the groundwork for future development tools that will further automate the task of creating an optimized SoC design.

Tensilica also announced at EPF a new configurable DSP engine called Vectra LX. Designed specifically for the Xtensa LX processor—Tensilica already offers a DSP engine for earlier Xtensa cores—Vectra LX uses 64-bit instruction words containing three issue slots for ALU, multiply-accumulate, and load/store operations. In all, Vectra LX supports about 200 instructions for 16-bit fixed-point signal processing. Vectra LX is included with Xtensa LX and adds a level of DSP performance unprecedented in a synthesizable RISC processor.

All this probably seems too good to be true. However, Tensilica can back up its claims with independently certified benchmark results. Xtensa LX clobbers every other benchmarked processor in its class—and even some processors out of its class. For instance, in the EEMBC consumer suite, Xtensa LX achieved the highest out-of-the-box ConsumerMark score ever recorded by a licensable CPU core: 171.6 when simulated at 330MHz. That’s more than three times higher than the previous out-of-the-box champ, the Philips TriMedia TM5250, which scored a ConsumerMark of 51.3 when simulated at 500MHz.

Tensilica also submitted Xtensa LX to Berkeley Design Technology Inc. for DSP benchmarking. Result: an optimized Xtensa LX core and Vectra LX DSP engine, simulated at 370MHz, easily outran every other licensable DSP or CPU core ever tested by BDTI. Xtensa LX scored a BDTIsimMark2000 of 6,150—about 70% higher than the previous champ, the CEVA-X1620 DSP, which was simulated at 450MHz.

To achieve these extraordinary benchmark results with a small RISC processor, Tensilica has introduced some groundbreaking new technology and development tools. We believe it’s only a matter of time before Tensilica’s approach to configurability and design automation exerts more influence over the whole industry.

Microprocessor Report readers can access the full story (7 pages; 3 graphics) here: www.mdronline.com/mpr/h/2004/0531/182201.html. To find out more about Microprocessor Report, please visit: www.mdronline.com.

http://www.mdronline.com/watch/watch_Issue.asp?Volname=Issue+#174&on=1
 
data is inherently transferred via queues as part of an instruction's operation. A+B could automatically go to a device or location, without the usual intermediate store operation. That's pretty cool.
Sounds like a data flow machine derivative.
 
Simon F said:
data is inherently transferred via queues as part of an instruction's operation. A+B could automatically go to a device or location, without the usual intermediate store operation. That's pretty cool.
Sounds like a data flow machine derivative.

Is that good or bad in your opinion? :)
 
PC-Engine said:
Very impressive technolgy. :oops:

BTW PC-Engine, the company that built the PC Engine, Hudson Soft, licensed this technology a while back. The design is called POEMS (Portable Entertainment Mixed Solution).

Launching POEMS Chip-Based Products

After completing development on Hudson's 32-bit single chip POEMS semiconductor, Hudson has been working hard to streamline the production line and lay the groundwork for the launch of POEMS products. We expect our partners to ship a number of POEMS products in time for the year-end sales showdown.

http://www.hudson.co.jp/corp/eng/management/direct2.html


The original press release that Hudson licensed Xtensa back in 2002. Also NEC uses these cpu's as well.

http://www.tensilica.com/html/pr_2002_10_15.html
 
Nec makes me think of GRID. So NEC licenced this tech then came up with Grid?

The fact that "other people" are getting godd results out of new technology can only be good for Cell.
 
PC-Engine said:
Simon F said:
data is inherently transferred via queues as part of an instruction's operation. A+B could automatically go to a device or location, without the usual intermediate store operation. That's pretty cool.
Sounds like a data flow machine derivative.

Is that good or bad in your opinion? :)
Oh, there's the potential for quite a lot of positive features - it should make register contention less of an issue but I suspect that you'd need quite a different flavour of instruction set.
 
Simon F said:
data is inherently transferred via queues as part of an instruction's operation. A+B could automatically go to a device or location, without the usual intermediate store operation. That's pretty cool.
Sounds like a data flow machine derivative.

To me dataflow machines always suggest a certain rigidity in the flow of data, Id think more something like TTA.
 
MfA said:
Simon F said:
data is inherently transferred via queues as part of an instruction's operation. A+B could automatically go to a device or location, without the usual intermediate store operation. That's pretty cool.
Sounds like a data flow machine derivative.

To me dataflow machines always suggest a certain rigidity in the flow of data, Id think more something like TTA.
TTA? I hadn't seem that acronym before but Google came to the rescue.

In a sense that's what I meant. The only "practical" Dataflow type systems that I'd seen mentioned seemed to use something like that.
 
Two major innovations improve I/O throughput in Xtensa LX processors: an option for a second load/store unit and designer-defined ports and queues.

Designers using the Xtensa LX processor can choose one or two 128-bit wide load/store units. Most standard embedded processors have only a single narrow (32- or 64-bit) load/store unit. However, many applications benefit from two load/store units for data-intensive inner loops -- a standard feature of many high-end DSP processors. The Xtensa LX processor's optional second load/store unit provides greater sustained general-purpose I/O bandwidth and an XY-style memory access for DSP applications. Additionally, at 128 bits, it's much wider and can accommodate much more data than standard load/store units.

The true breakthrough in I/O is the capability to add designer-defined ports and queues, which allow the Xtensa LX processor to communicate as fast and as flexibly as RTL blocks. Ports are wires that directly connect two Xtensa LX processors or an Xtensa LX processor to external RTL. Port connections can be arbitrarily wide, allowing wide data types to be transferred easily without the need for multiple load/store operations. As many as one million signals (1024 1024-bit-wide ports) can be used, and while this is an outrageous number, far exceeding the performance demands of real systems today (providing 350 terabits/sec of direct data flow per processor in a 130 nm CMOS process), this clearly demonstrates that old notions of the I/O bottlenecks inherent in a processor-based solution are now obsolete.

While ports are ideal to quickly convey control and status information, queues provide a high-speed mechanism to transfer streaming data. Input queues and output queues operate to the programmer's viewpoint like traditional processor registers -- with the notable exception that data is always available without the need to load or store the data before and after computation. Queues can sustain data rates as high as one transfer every clock cycle or over 350 Gbits/sec for each queue added to an Xtensa LX processor. Custom instructions can perform multiple queue operations per cycle, perhaps combining inputs from two input queues with local data and sending the computed values to two output queues. The high bandwidth and low control overhead of queues allows the Xtensa LX processor to be used in applications with extreme data rates.

Ports and queues specified by the designer are automatically added to the Xtensa LX processor and are 100% fully modeled by Tensilica's Xtensa Processor Generator. The full behavior of the port or queue, just like any other modification made to the Xtensa LX processor, is automatically reflected in the custom software development tools, instruction set simulator, bus functional model and EDA scripts -- within about an hour. And because it's automated using Tensilica's patented technology, it's pre-verified and correct by construction -- no need to re-verify the processor.

http://www10.edacafe.com/nbc/articles/view_article.php?section=CorpNews&articleid=125002
 
Seems the processor itself is traditional, apart from its more intimate relation with I/O and custom execution units.

BTW, what is very important in this segment is the performance per Watt ... mention of which seems absent.
 
Lower Power Consumption

Tensilica has automated the insertion of fine-grain clock gating for every functional element of the Xtensa LX processor including functions conceived of and created by the designer. Clock gating is a very effective power reduction technique that turns shuts down the power to parts of the logic that are not in use on a particular clock cycle. Because automatic insertion of clock gating is only available for restricted RTL design coding styles, manual, error-prone post-layout tuning of clock circuits is often required for standard RTL design.

The Xtensa LX processor's new architecture dramatically lowers power consumption in large configurations with many designer-defined functions. But even without designer modification, the Xtensa LX processor is designed to use power very efficiently. The minimum configuration of the Xtensa LX processor dissipates a miserly 0.05 mW/MHz in a representative 130 nm process technology. By comparison, the smallest member of the ARM synthesizable processor family, the ARM7TDMI-S, burns 0.11 mW/MHz in 130 nm technology -- twice the power consumption of the Xtensa LX.

The "LX" in the new Xtensa core I think refers to the VLIW LX design by HP and STMicroelectronics. They see the VLIW Lx as " a convergence of DSP and microcontroller".

http://www.embedded.com/2000/0010/0010feat6.htm

What Tenscilia seems to be doing is taking the HP VLIW Lx design for an embedded processor and putting their spin to it.

While STMicro started investigating VLIW, Hewlett-Packard's R&D division, HP Labs, was working on its own VLIW technology in several forms. One of those was the LX project started in 1994 by Josh Fisher and Paolo Faraboschi of HP Labs in Cambridge, Mass. The HP team, however, was uncertain about how it was going to get the technology to market.


"We came in with a technology that we had been developing for some time and the aspects were reasonably far along, but we didn't have anything above the instruction set," said HP fellow Josh Fisher, who coined the term VLIW in the 1980s.
...

STMicro's job was to take the instruction set and compiler technology from HP Labs and craft a microarchitecture that would be true to the spirit of VLIW. To do that, the company focused its design activities at a central location in Cambridge, Mass., and drew from five other design teams in Europe and the United States.


The task was not trivial. To keep the microarchitecture from bogging down the compiler, STMicro would have to refrain from using hardware-assist features, something other companies espousing VLIW had used to meet performance requirements of certain target applications.


But the STMicro/HP Labs team did not want to use this as a crutch because doing so would hamper the compiler's ability to schedule code and result in more conservative performance estimates, said Faraboschi, HP Labs' LX project manager and principal research scientist.


Working closely with HP Labs, STMicro crafted the microarchitecture so that it would be malleable enough to let a designer add or subtract basic architectural functional units like adders, multipliers, registers and register ports. In this way, a designer can analyze the three key parameters to any microprocessor — performance, power dissipation and die area — before the final architecture is frozen.


...



This kind of performance is comparable to a digital signal processor but with the advantage of being able to program it in C. "We did an internal comparison between DSP and this technology, and we found that they were probably equivalent, though the ST200 [test chip] had a slight edge in MPEG decode. The difference is that one is programmed in assembler, and one in C," Bramley said.


This is because DSP architectures usually aren't compiler-friendly. "DSPs and compilers traditionally don't like each other very much," Faraboschi said. "The way they express performance is through handcrafting assembler files. Here, we're talking about close to 100 percent performance at the C level."


http://www.embedded.com/showArticle.jhtml?articleID=9900461
 
Hmm..interesting. I was aware of POEMS awhile back, but I didn't know it was related to this technology. Thanks for the info.
 
Back
Top