Packet shader

pascal

Veteran
New use for shaders :)

http://www.technologyreview.com/communications/26096/

Virtual Router Smashes Speed Records
Software-driven networking will enable new internet protocols.

By Christopher Mims
Monday, August 23, 2010


Researchers in South Korea have built a networking router that transmits data at record speeds from components found in most high-end desktop computers. A team from the Korea Advanced Institute of Science and Technology created the router, which transmits data at nearly 40 gigabytes per second--many times faster than the previous record for such a device.

The techniques used by the researchers could lead to a number of breakthroughs, including the use of cheaper commodity chips, such as those made by Intel and Nvidia, in high-performance routers, in place of custom-made hardware. The software developed by the researchers could also serve as a testbed for novel networking protocols that might eventually replace the decades-old ones on which the Internet currently runs.

Most routers use custom hardware to route data as it passes between computer networks. Software routers perform the same tasks using commodity hardware--by mimicking the behavior of a hardware router in software. Commercial software routers from companies such as Vyatta can typically only attain transfer data at speeds of up to three gigabytes per second. That isn't fast enough to take advantage of the full speed of a typical network card, which operates at 10 gigabytes per second.

"We started with the humble goal of being the first to get a PC router to 10 [gigabytes per second], but we pushed it to 40," says Sue Moon, leader of the lab in which the research was conducted. Her students Sangjin Han and Keon Jang developed software called PacketShader that made this possible. PacketShader uses a computer's graphics processing unit (GPU) to help process packets of data sent across a network.

Modern routers are rarely dumb switches anymore. They are often called upon to manipulate packets in a number of different ways as they pass through. GPUs are ideal for this purpose because they can process data in parallel, which means they can handle several packets of data at once. According to Moon, a GPU is much faster at handling some packet-processing tasks, such as authenticating or encrypting all of the packets in a stream. When the GPU takes over these tasks, it gives the central processing unit (CPU) breathing room to handle other things that are more serial in nature, such processing several packets in turn to detect attempts to break into a network.

Mark Handley, a professor of networked systems at University College London, points out that for basic packet forwarding, which isn't likely to overwhelm a computer's CPU, there is no advantage to strapping the GPU onto the system. However, he agrees that the GPU is very well suited to encrypting or authenticating packets.

Gianluca Iannaccone, an engineer at Intel Labs Berkeley who is familiar with PacketShader, says it could slash the number of physical machine needed to comprise a terabit-per-second software router to one-third of what his research has previously indicated would be required.

"One terabit is the entry point for enterprise-grade routers--the routers in the core of the Internet," says Iannaccone. His work on a system called RouteBricks points to a future in which routers aren't the specialized hardware they are now, but instead function as software running on pools of servers. Lash enough software routers together that run at 40 gigabytes per second, and you get what is essentially a single-terabit router. Using such a system, routers might some day run completely in software.

"We can expect killer apps out of this," says KyoungSoo Park, another professor at the Korea Advanced Institute of Science and Technology who was involved with the project. "You can build an interesting packet- or network-management system on top of a PC-based software router that can't be implemented with a hardware router. Ultimately, you can experiment with new protocols that are not used in today's Internet."
 
what is capable of sending data at that rate, a 40 drive striped array ?

This is meant to be placed at exchange points and similar places where you are connected to other routers you peer with and your transit traffic AS. If you are an ISP that means all the traffic from/to all your customers in that region goes through these routers. If you are a tier-2 or tier-1 ISP then 40Gb/s is nothing.

There's a typo in the text, there's no network interface operating at 10GiB/s yet, there are some prototypes but it's not finished. Most carrier grade routers have 10GigE ethernet (which you can group together if you want) and SONET interfaces. It's not uncommon to have groups of 4 x 10GigE interfaces and the router's ASICs can carry this kind of traffic without problems.

It would be interesting to see how well it can handle MPLS and QoS. I take all the routing decissions (most likely BGP) are made by the host CPU, but there's a lot more to routing than forwarding packets.
 
I wonder how they reach 40GB/s by using GPUs to process packets when PCIe only manages a tiny fraction of that speed. Are authentication and encryption so processing intensive that those packets fit within the 16GB/s (theoretical) capacity of PCIe, leaving the CPU to deal with the remaining 24GB/s?

Sounds like you'd need several CPUs for that kind of speed. Even a socket 13xx i7 doesn't have much more theoretical bandwidth than that, and of course much less in practice.

Custom graphics cards with extra I/O ports might be one answer, but you'd still have to get the data to the GPU and its RAM somehow, and the only large-capacity interface in there is PCIe, which isn't very heavy-duty really. Rambus' Redwood interface in PS3 weighs in at 60GB/s IIRC just as a comparison, and PS3 launched years ago now so it's not exactly cutting-edge stuff.

Multiple GPUs per system then? Some chipsets allow for 2 full x16 interfaces and another x8 port. Maybe enough to cover the needs, if the system's internal I/O doesn't bottleneck when running all PCIe lanes at full tilt (I would suspect that's a distinct possibility)...
 
I wonder how they reach 40GB/s by using GPUs to process packets when PCIe only manages a tiny fraction of that speed. Are authentication and encryption so processing intensive that those packets fit within the 16GB/s (theoretical) capacity of PCIe, leaving the CPU to deal with the remaining 24GB/s?

The article was wrong (which has been corrected). The correct value is 40Gbps, instead of 40GB/s.
You can read the paper here: http://www.ndsl.kaist.edu/papers/packetshader.pdf

It seems that they use a single GTX 480 to perform forwarding, OpenFlow switching, and IPsec. IPsec is obviously very good for the GPU. Forwarding, on the other hand, would depend on the packet size (when packet size is larger, less works need to be done, so it's more likely to be bandwidth limited).
 
The article was wrong (which has been corrected). The correct value is 40Gbps, instead of 40GB/s.
That does make a lot more sense yes.

A GTX480 for routing... Surely dedicated logic would be a lot more power efficient for that, even at 40gbit/s...?

Probably not cheaper though, but if you have the need for such bandwidth I'd think you'd rather eat the higher up-front cost for lower running cost and (much!) greater reliability. Placing all your communication load on an ASIC which readily reaches 90-100C during use in a large data center doesn't seem like a good long-term strategy. ;)
 
That does make a lot more sense yes.

A GTX480 for routing... Surely dedicated logic would be a lot more power efficient for that, even at 40gbit/s...?

Probably not cheaper though, but if you have the need for such bandwidth I'd think you'd rather eat the higher up-front cost for lower running cost and (much!) greater reliability. Placing all your communication load on an ASIC which readily reaches 90-100C during use in a large data center doesn't seem like a good long-term strategy. ;)

Not to mention the cpu+chipset+ram+hdd needed for this which are prolly not needed in routers.

Could make routers go sw only in the future though.
 
Not to mention the cpu+chipset+ram+hdd needed for this which are prolly not needed in routers.

High end Juniper routers use a PC (two actually) for the REs (Routing Engines) which run a heavily modified version of FreeBSD and make the routing decisions and configure the hardware ASICs that do the actual heavy lifting.

Could make routers go sw only in the future though.

If you mean hardware accelerated by the GPU then yes, maybe.
 
in high end packet moving devices CRS-1's /Nexus 7k / juniper T series the issue isn't interface or ASIC forwarding rates, its backplane thoughtput and being able to manage load across the box (from ingress to egress). Take a nexus 7k (only becuase i know that product really well).

a nexus 7k has 10 slots, two are for the "brain" (supervisor modules) and 8 for interfaces.
there are upto 5 backplane modules each module is a full mesh between the 10 slots and allows 40Gbit/s(usable) between each slot, so a max thoughput of upto 200Gbit per slot.

if we take the total agreegate thats:

10x9/2=45
45 X 200 =9Tbit of traffic that can traverse the backplane

if we look at it line card centric
9 X 200 = 1.8Tbit of backplane bandwidth to just one line card.

now think about that in regards to a single 10gig link on that line card thats only an over subscription of 180:1. if that ever actually happened then in one second 1.79 Tbit of traffic would be tail dropped. tail dropping traffic is bad, tail dropping that much traffic is network destorying stuff. So the big part of all this is actually gracefully dropping (weight random early detection or something else along those lines) or queueing packets on the ingress line cards before trasmitting if the egress port is full and queueing packets. This all has to be done at line rate otherwise the problem just snowballs unit massive tail drop is reached again.

the cost and complexity in a large aggegation device isn't the 10G/40G/100G forwarding rate of a port. its the massive oversubscription that is needed in a full distributed meshed backplane/fabric in relation to indervidual forwarding ports line rates.

then add to the fact that most service provider aggregation switches/routers contain ASICS with 1million + forwarding entries per port that need to be looked up at line rate and i personally dont see this "break though" solving anything.

rpg.314 routers used to be software only, we are actually at the point now where routers will only do something in hardware and if it cant it will drop the packet. up until very recently these packets would be "process switched" but in a low latency line rate world there just isn't the time to send packets to a software engine to be forwarded.
 
Damnation,
How much power do these routers you speak of draw? A single GTX 480 can pull numbers approaching 400W max load just for the graphics card itself - although I doubt this packet shader would actually succeed in engaging the whole chip in quite the same way that say, Furmark for example, would...
 
Damnation,
How much power do these routers you speak of draw? A single GTX 480 can pull numbers approaching 400W max load just for the graphics card itself - although I doubt this packet shader would actually succeed in engaging the whole chip in quite the same way that say, Furmark for example, would...

Heh, the 10-slot Nexus 7k chassis that he is talking about can host up to 3 power supplies, each of which is specced for up to 7500W load. The chassis itself requires 2400W just to power up. And then between 200 and 600W for each slot depending on the type of module.

But that GPU is just a packet engine, you'll need a lot more than that to get your packets in and out of the box quickly.

Still it's an interesting concept and I would get a kick out of seeing the Nexus 1000v virtual switch ported over to CUDA.
 
Damnation,
How much power do these routers you speak of draw? A single GTX 480 can pull numbers approaching 400W max load just for the graphics card itself - although I doubt this packet shader would actually succeed in engaging the whole chip in quite the same way that say, Furmark for example, would...

a lot of this comes down to what interface types you use for example a 1000base SX sfp is about 7watts , an SFP+ uses about 1, this is because a whole bunch of stuff in sfp+ has been moved and centralised to the line card and not on each and every module.

a nexus 7010 has different Power supply redundacy modes, most people run gird redundacy with power supply redundancy which puts max load at around 9kW.

in terms of actual power usage (these are max numbers so consider this under full load which is likely anyway) :

with redundant supervisor engines and 3 fabric modules (current standard deployment)

fan usage :300 watts
fabric : 165 watts
32 X 10gig line card : 750watts
48 X 1gig line card : 400 watts
2X supervisor module : 420 watts (210 each)

the nexus platform is actually the first cisco switching platform to provide realtime power usage reporting but i work for an intergrator and at the moment dont have access to any production customer gear to go and look at exact figures.

the two supervisor modules run a dualcore C2D's and have 4gig of ram. the supervisor does all the "control plane" and "management plane" work, routing prootocols, spanning tree, building of the forwarding tables (call a RIB) etc.

the line cards get a copy of the RIB called a FIB and that is used to decided where to forward a frame to.

back in the software routing days things like MPLS could ave a big impact on power usage as the MPLS tag lookup is much smaller and much more simple then a full layer 3 lookup but now things like that dont matter.

its really hard to determine how much power a port uses because a frame takes the following path though a switch:

ingress -> ingress acl -> ingress qos -> nat -> forwading lookup(for dest interface) -> Virtual output que -> frabic traversal -> forwarding lookup (for dest mac) -> egress acl -. egress qos -> egress

cheers
 
Some seriously heavy-duty hardware then in other words, 300W power budget just for fans, holy smokes. :D Water cooling probably would be a lot more efficient at those levels, although it would require more complicated infrastructure of course with many separate waterblocks, piping, redundant pumps etc... Perhaps just burning 300W constantly on fans would be cheaper in the long run, lol.

I wonder if the 40mbit/s routing claimed in the OP's link really is comparable to the routing capability of these boxes. I suppose not, seeing as pixel shaders only are fast and efficent at relatively straight-forward computing tasks...
 
with high throughput routers they don’t actually really do any computational work, its all just quick look ups on pre "compiled" (for lack of a better word) tables. This can actually be very limiting compared to older routers because you cant make a forwarding decision on something unless the hardware supports it.

its possible that a throughput co processor based off something like a GPU could be quite useful for certain layers in a network design ( edge routers & distribution layer) but that said the most deployed router in the world (cisco 7200) doesn't even support floating point :oops: so i guess GPU uarchs actually have quite a high overhead for routing as its all int based.


cheers
 
the two supervisor modules run a dualcore C2D's and have 4gig of ram. the supervisor does all the "control plane" and "management plane" work, routing prootocols, spanning tree, building of the forwarding tables (call a RIB) etc.

I had seen Cisco using x86 crap on some SAN switches but didn't know they were using it on core routers now. Talk about disappointing. :cry:
 
Back
Top