View Full Version : Software radios via GPGPU - is it viable?
This is only a curiosity question, having read a little bit about software radios in the last few hours. It seems to me that this kind of algorithm would be quite adaptable for GPGPU processing - as far as I can see, two of the most important operations for it are FIRs, which are basically FMACs, and FFTs.
A key datapoint I cannot seem to find, however, is whether 3G standards and beyond can be properly and efficiently implemented with integer units. If it could, that would clearly give a disadvantage to GPUs, and an advantage to traditional 24-bit DSPs.
I guess the other point is whether fixed-function hardware could be used to further accelerate processing. One of the most promising solutions for software radio as recently showcased by IMEC, http://www.eetimes.com/rss/showArticle.jhtml?articleID=197005462&cid=RSSfeed_eetimes_newsRSS - 7.7mm² on 90nm, with support for: "802.11a/b/g/n, Bluetooth, ZigBee, UMTS, HSDPA, digital audio broadcast (DAB), DVB-H, and Korea's emerging digital multimedia broadcast (DMB) protocol." - I'd be very curious how much, if any, of that implementation is actually truly fixed-function...
silent_guy
18-Feb-2007, 18:10
This is only a curiosity question, having read a little bit about software radios in the last few hours. It seems to me that this kind of algorithm would be quite adaptable for GPGPU processing - as far as I can see, two of the most important operations for it are FIRs, which are basically FMACs, and FFTs.
A key datapoint I cannot seem to find, however, is whether 3G standards and beyond can be properly and efficiently implemented with integer units. If it could, that would clearly give a disadvantage to GPUs, and an advantage to traditional 24-bit DSPs.
UWB WLAN, WiMax and their wired friends can and are all implemented with fixed point, but 24-bits is often not enough. I don't see why it wouldn't be possible to use floating point, though.
I guess the other point is whether fixed-function hardware could be used to further accelerate processing.
It's too early to say for newer standards, but most 802.11a/b/g architectures have migrated from being somewhat DSP based to almost entirely fixed function + an ARM for the MAC. The cost pressure there is incredible and a DSP just a little bit too big. Some have even dropped the ARM a go for a fixed function MAC.
If you were refering to non-calculation fixed function hardware: that is much needed too, especially to get decent error coding performance. TI DSPs have special Reed-Solomon instructions. Others do it entirely with fixed function logic. It's complex stuff that would be extremely slow with a regular instruction set.
Edit: Also don't forget dedicated hardware for encryption/decryption.
One of the most promising solutions for software radio as recently showcased by IMEC, http://www.eetimes.com/rss/showArticle.jhtml?articleID=197005462&cid=RSSfeed_eetimes_newsRSS - 7.7mm² on 90nm, with support for: "802.11a/b/g/n, Bluetooth, ZigBee, UMTS, HSDPA, digital audio broadcast (DAB), DVB-H, and Korea's emerging digital multimedia broadcast (DMB) protocol." - I'd be very curious how much, if any, of that implementation is actually truly fixed-function...
Maybe I'm missing something, but it looks like all they were presenting at ISSCC was an RF transceiver. Still an impressive feat and a necessary component for software defined radio, but they're not even half way there.
It seems like IMEC is targetting the mobile device. Others are flipping that around and targetting the base station. IEEE Spectrum magazine recently published their 10 winner and loser technologies of 2007 and software defined radio was a winner. Specifically a company targetting cellular base stations. Off the shelf servers are used to simultaneously process CDMA and GSM.
http://spectrum.ieee.org/jan07/4833
Interestingly, another winning technology was a software based fuel control system for cars. A Brazilian company has software that allows a car to operate on any mixture of gasoline and ethanol.
http://spectrum.ieee.org/jan07/4834
Programmability and software controlled technology seems to be picking up steam.
Since I haven't addressed the question of the thread I'll mention another winning technology is GPGPU related (RapidMind (http://spectrum.ieee.org/jan07/4837)) so if you blend the two maybe you can get software defined radio via GPGPU.
UWB WLAN, WiMax and their wired friends can and are all implemented with fixed point, but 24-bits is often not enough. I don't see why it wouldn't be possible to use floating point, though.I guess that depends a little bit on the why 24-bits isn't enough. Would the extra 8 bits of exponent help here? I'd assume they would, but I don't know enough about this kind of workload to say so reliably.
It's too early to say for newer standards, but most 802.11a/b/g architectures have migrated from being somewhat DSP based to almost entirely fixed function + an ARM for the MAC. The cost pressure there is incredible and a DSP just a little bit too big. Some have even dropped the ARM a go for a fixed function MAC.Hmm. Interesting. So there are WiFi implementations out there which are nearly 100% fixed-function? It's fairly unsurprising that they switched from an ARM to something more specialized if they used it nearly exclusively for the MAC, I guess.
TI DSPs have special Reed-Solomon instructions. Others do it entirely with fixed function logic. It's complex stuff that would be extremely slow with a regular instruction set.
Edit: Also don't forget dedicated hardware for encryption/decryption.Ah yeah, I see the point there. Just briefly looking at how Reed-Solomon works makes me want to pity any non-customized DSP that would be forced to implement that... :)
Maybe I'm missing something, but it looks like all they were presenting at ISSCC was an RF transceiver. Still an impressive feat and a necessary component for software defined radio, but they're not even half way there.Gah, indeed, I misread that. It indeed looks like that's only the transceiver.
My initial reasoning was that you didn't need as much 3D or video processing while doing some of that workload, so that area efficiency because you'd simply be reusing some functional blocks for different purposes - but clearly, that doesn't make sense for a variety of workloads going forward, so I better at least partially scrap that idea, hehe.
I guess if I look at Reed-Solomon or encryption, the standards used there are unlikely to change much in the coming years - or am I horribly wrong if so? So I guess what I'm wondering is the proportion of fixed-function blocks that are shared between the standards, and what an optimal programmable architecture for the non-shared blocks would look like. That's a pretty big question though, so I doubt anyone in the industry has an ideal answer - or, if they do, they'd be keeping it to themselves... :)
Since I haven't addressed the question of the thread I'll mention another winning technology is GPGPU related (RapidMind (http://spectrum.ieee.org/jan07/4837)) so if you blend the two maybe you can get software defined radio via GPGPU.Yeah, my original train of thought is that given that workload, if you can get software defined radio to work on an off-the-shelf server, then surely given the kind of workload, you could get some extra acceleration for it on a GPU.
The handheld side of things is interesting too, definitely - although there, perf/area and perf/watt are going to matter even more, since the budgets (power and dollar-wise) are much more limited than for a cellular base staiton...
One area where I'd really like software radios to pick up is for southbridges. That would certainly make that domain slightly more interesting again, and it'd help a lot to make Wireless USB mainstream. At the rate southbridges are evolving, they risk become a commodity eventually - their die size is becoming quite negligible on the latest processes, really.
silent_guy
18-Feb-2007, 22:27
I guess that depends a little bit on the why 24-bits isn't enough. Would the extra 8 bits of exponent help here? I'd assume they would, but I don't know enough about this kind of workload to say so reliably.
You really need the mantissa bits. Exponent is not sufficient. When you're trying to squeezing 100Mbps out of a very noisy environment, you want to extract all the pieces of real information in there you can get.
We had customized multiplier widths linked to customized accumulate. Say 17 bits x 25 bits with 30 bits accumulate. (Though that weren't the real numbers.) Reducing those widths by 1 bit resulted in measurable theoretical performance loss.
Hmm. Interesting. So there are WiFi implementations out there which are nearly 100% fixed-function?
Yes. A few years ago, some Asian design houses developed their own fixed function IP and started selling that. Performance wasn't the best, but it was more than good enough to tunnel an 8 MBit DSL feed and it was small and cheap. This is what matters most for major parts of the world. Suddenly WLAN wasn't so exclusively high tech anymore and prices crashed.
I guess if I look at Reed-Solomon or encryption, the standards used there are unlikely to change much in the coming years - or am I horribly wrong if so? So I guess what I'm wondering is the proportion of fixed-function blocks that are shared between the standards, and what an optimal programmable architecture for the non-shared blocks would look like. That's a pretty big question though, so I doubt anyone in the industry has an ideal answer - or, if they do, they'd be keeping it to themselves... :)
RS is indeed fairly standard. Also, a fixed function block doesn't necessarily mean that it's not programmable. These days, you can design a microcoded 3-stage engine with, say, 256 instruction words and 128 words of RAM in a fraction of a mm2. If you design the right super-specialized instruction set for that, it'd be hard not to be able to support all the required variants.
We have used programmable engines just to replace a complete fixed function DMA engine. At the cost of very little area, there's suddenly so much more you can do. Very often, the customer has no idea that a particular block is, in reality, highly programmable.
You really need the mantissa bits. Exponent is not sufficient. When you're trying to squeezing 100Mbps out of a very noisy environment, you want to extract all the pieces of real information in there you can get.
We had customized multiplier widths linked to customized accumulate. Say 17 bits x 25 bits with 30 bits accumulate. (Though that weren't the real numbers.) Reducing those widths by 1 bit resulted in measurable theoretical performance loss.Interesting. And beyond that number of bits, I'd guess it's diminishing returns? So FP64 to take an extreme case would help, but it wouldn't help much at all?
Yes. A few years ago, some Asian design houses developed their own fixed function IP and started selling that. Performance wasn't the best, but it was more than good enough to tunnel an 8 MBit DSL feed and it was small and cheap.Hmm! :) That's a fun little bit of industry insight indeed!
RS is indeed fairly standard. Also, a fixed function block doesn't necessarily mean that it's not programmable. These days, you can design a microcoded 3-stage engine with, say, 256 instruction words and 128 words of RAM in a fraction of a mm2. If you design the right super-specialized instruction set for that, it'd be hard not to be able to support all the required variants.The question for such a thing, though, is just how much of an additional overhead that programmability is. One unit taking 0.25mm² instead of 0.05mm² on a 50mm² die isn't going to matter much. But if you need an array of 40 such units, suddenly that might change a bit.
One interesting point I've seen some people mention is that one of the key drivers behind programmability for GPUs, besides developers asking for it, is that its relative overhead is lower if your execution units are bigger in the first place. The arguement being, for an INT8 pipeline, programmability costs you a lot; but for a FP32 pipeline, it's relatively speaking less costly.
I'd imagine the dynamics at work here are fairly similar; it's obviously not viable to add even basic programmability to a minuscule amount of silicon. On the other hand, for a larger unit, it might make a lot of sense...
silent_guy
19-Feb-2007, 19:11
Interesting. And beyond that number of bits, I'd guess it's diminishing returns? So FP64 to take an extreme case would help, but it wouldn't help much at all?
Very often, the numbers are ultimately dictated by the precision of the A/D convertor and the algorithms that work on the numbers. So above a certain number of bits, you get zero increase in return. This is basically the spot you design for.
The question for such a thing, though, is just how much of an additional overhead that programmability is. One unit taking 0.25mm² instead of 0.05mm² on a 50mm² die isn't going to matter much. But if you need an array of 40 such units, suddenly that might change a bit.
These days, most functional blocks require a lot of memory anyway, even if they are not programmable at all, so the increase isn't as extreme as your example. :wink:
Say your RS encoder needs a RAM of 4096 8-bit words plus a whole bunch of datapath logic. In that case, adding a 256x16 instruction RAM isn't the end of the world.
One interesting point I've seen some people mention is that one of the key drivers behind programmability for GPUs, besides developers asking for it, is that its relative overhead is lower if your execution units are bigger in the first place. The arguement being, for an INT8 pipeline, programmability costs you a lot; but for a FP32 pipeline, it's relatively speaking less costly.
That was probably true in the early days. Although I don't really know much area the big the pure calculation part of a shader really is. (Only the CELL papers provide hint at the size of an FP32 MADD, for example.)
I'd imagine the dynamics at work here are fairly similar; it's obviously not viable to add even basic programmability to a minuscule amount of silicon. On the other hand, for a larger unit, it might make a lot of sense...
Well, my point is actually that it is viable to add programmability to very small pieces of logic. In some cases, we have an execution machine with an instruction RAM of just 24 instructions. It doesn't have a stack or your typical register store or even jumps (it uses predication instead) and the pipeline is entirely exposed, so you have to manually account for pipeline delays. But there is an assembler and you can actually do pretty complicated stuff with it.
It's basically a programmable statemachine. This is very common nowadays: bugs in a statemachine are much more prone to corner case bugs, harder to verify and harder to ECO when things go wrong.
But it's still orders of magnitude more efficient than a big iron CPU or DSP that has to be flexible enough to do anything you throw at it.
silent_guy
19-Feb-2007, 19:24
Ah yeah, I see the point there. Just briefly looking at how Reed-Solomon works makes me want to pity any non-customized DSP that would be forced to implement that... :)
Drifting off topic, here (http://staging.spectrum.ieee.org/mar04/3957)is a great article about Turbo Coding. The interesting part is that it was invented by a couple engineers trying to solve a problem instead of some high theoretician throwing around math.
They basically stumbled into the holy grail of error coding: the ability to get arbitrarily close to the Shannon limit of a channel, if you're just willing to throw enough hardware at it.
I remember reading the article on Turbo Codes. Very interesting. I think I might still have that issue around somewhere.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.