h323decoder Primer

This article was the one I posted to another site way back in late 1990 when I was at Ascend Communications. I just found this article at an archive site today. As I don’t know how long this archive site continues its service, I decided to restore it on my personal site. Another reason for doing this was that I wanted to express my great appreciation to Yuzo Yamashita, one of my ex-colleagues at Ascend Communications who contributed a wrapper for my H.323 decoder and I would like to offer my sincere condolences for his loss in 2012.

h323decoder Primer

Motonori Shindo (mshindo@ascned.co.jp)

What is h323decoder?

The best way to understand how H.323 protocol works would be to examine the actual packets between a gatekeeper, gateways, and/or terminals. H.323-related protocols such as H.225.0 and H.245 use PER (Packed Encoding Rule) for encoding packets. While this encoding scheme generates very compact message, it’s very hard to decode for human (I tried it once and finally gave it up:-)). h323decoder is a tool to decode H.323 packets and show you the result in more human-friendly way. This tool will alleviate your pain to decode H.323 packets by hand and hopefully help you understand the protocols themselves better.

Generally speaking, making an H.323 decoder is NOT a trivial task. It basically requires an ASN.1 compiler and a PER decoder. Fortunately there’s an ambitious project called ‘OpenH323’ and this project has an open-source ASN.1 compiler and PER encoder/decoder. I made use of OpenH323’s code and wrote a simple H.323 decoder. I could not have been able to write this h323decoder without OpenH323 Project. I would like to say thank you all who contributed to this project!

How do I get the h323decoder?

You probably don’t need to get an h323decoder. There’s a CGI version of this decoder, so all you have to have is a browser (e.g. Netscape, Internet Explorer, etc.). [Sorry, this is under construction now.]

For those who want to have this program as a standalone program so that you can run it on your UNIX machine, please get this. This is a patch to openH323 code (0.8-alpha1). Please consult README file under h323decoder directory for more detail. To compile openH323 source code, you need to have a corresponding PWLib library. For the latest openH323 source code and PWLib library, please obtain them from here. To compile pwlib_min_1.10 under FreeBSD, it requres a patch which you can download from here. h323decoder may also compile and work under Windows, but I haven’t tried it myself. If someone succeeded, please let me know.

How do I use the h323decoder?

h323decoder takes a hex dump of the payload you want to decode as an input. For example, hex dump that is generated by the command ‘tcpdump -x’ can be an input, but other forms of a hex dump would also be accepted.

Honestly, h323decoder is not sophisticated at all, especially in terms of a human interface. Therefore, users of this decoder must be intelligent on behalf it:-) For example, since this decoder takes only H.323 payload and don’t strip any IP, TCP/UDP, TPKT header, you have to delineate these headers and strip them accordingly by yourself.

The decoder can handle three types of H.323 messages:

  • RAS/H.225.0 message
  • Q.931/H.225.0 message
  • H.245 Media System Control message

The decoder doesn’t automatically detect which type of message you’re trying to decode. Instead, you must specify it explicitly.

Once you extracted the payload that h323decoder can process and determined the type of a message you want to decode, the rest of the job is easy. If you’re using the CGI version of this decoder via a browser, select the type of the message from a pulldown menu, and enter the hex dump into the text area, then click the “submit” button. If you’re using a standalone version of h323decoder, you must specify the type of the message by the command line argument (‘ras’, ‘q931’, or ‘msc’), and then pass an hex dump to the decoder through the standard input. Let’s suppose you have an hex dump of RAS message in hexdump.txt file, then

will print out the result to the standard output.

Please note that current h323decoder can handle only one message at a time. You can not let it decode multiple messages in a sequence. There’s a perl script that can be used with h323decoder. Please refer to the end part of this document.

How do I extract an H.323 payload?

As I said before, determine the type of message you are about to decode. The first message you will come across might be a RAS/H.225.0 message.

Since RAS/H.225.0 uses UDP and port number 1719, you can easily find RAS/H.2250.0 messages in the packet trace. Knowing that IP header is (usually) 20 bytes long, and UDP header is 8 bytes long, it’s easy to extract a RAS message out of the IP packet. If you had a bad luck, it could be fragmented. In that case, you’ll have to reassemble them accordingly.

In case of Q.931/H.225.0 message, things get more complicated. Q.931/H.225.0 is a TCP message and usually uses a port number 1720, so finding a Q.931/H.225 message in the packet trace is relatively easy. Since TCP provides us a “stream” of packet and there is no logical boundaries in the stream, there must be a way to delineate the Q.931/H.225.0 messages on top of TCP. TPKT header is used for this purpose. Q.931/H.225.0 message is wrapped by TPKT header, which consists of 2 octets (‘0003’) and another 2 octets that represents the total length of the message (including the length of TPKT header itself).

Following to the TPKT header, you’ll see some Q.931 Information Elements (IEs). While this part contains some important information, H.323-specific data is stored as a Q.931’s ‘User-User Information Element’. This IE starts with ‘0x7e’ followed by 2-octet length field and finally a protocol discriminator which is always ‘0x05’. Forgetting all these details of Q.931 packet format, there’s a rule of thumb that will work most of the time; “Find ‘0x7e’ and get rid of the next 3 octets. The rest is (most likely) a Q.931/H.225.0 payload that can be fed into the h323decoder.”

In case of H.245 message, it’s hard to tell if it’s an H.245 message or not by just looking at the packet, because it doesn’t use a fixed port number. To know what port number is used for H.245 message, you need to examine the content of the previous Q.931/H225.0 messages exhcanged between the gateways and/or the terminals. But, practically, you may assume that the packet is H.245 if it’s not a RAS/H.225.0 nor a Q.931/H.225.0 message. H.245 uses TCP as the transport and encapsulate the packet using TPKT header in the same way as in Q.931/H.225.0 as I described before.

Since both UDP (RAS/H.225.0) and TCP (Q.931/H.225.0 and H.245) run on IP, you must take the possibility of a fragmentation and an out-of-order delivery of the packet into account when you try to decode the H.323 packets. In addition to that, please be aware that TCP doesn’t guarantee that a PDU (Q.931/H.225.0 or H.245 message) is put onto the network as a single packet (i.e. it can be chopped), therefore you may need to concatenate them until they makes sense. Similary, there may be a case where two PDUs are sent to the network as a single TCP packet (in this case, you’ll have to chop it into two or more H.323 PDUs by hand). Furthermore, there may be chance that TCP timeout occurs and a retransmission takes place. In this case, you have to throw away the duplicated packets, again, by hand.

Though the task I described above may look too hard, it isn’t that bad as you might imagine. If you get used to it, you just glance at the hex dump and can extract the H.323 payload relatively easily! Please give it a shot!

Could I have some examples?

Sure! Here’s a packet snooped by tcpdump. The following examples are all taken using Lucent’s (formally Ascend) MultiVoice system.

max2000.ascend.co.jp.1719 > navis.ascend.co.jp.1719: udp 80
4500 006c 6a02 0000 4011 f0ca c0a8 8911
caf6 0b04 06b7 06b7 0058 0000 2590 ffe2
0860 006d 0061 0078 0032 0030 0030 0030
0201 0044 a010 44a0 0240 0600 6d00 6100
7800 3200 3000 3000 3004 8036 8658 a33a
00c0 a889 1100 0000 a033 3200 c07b 601f
7e03 002b 211d d481 4734 a818

Since this is an UDP packet with port number 1719, this is a RAS message. First 20 octets are IP header (colored as light green) and next 8 octets UDP header (colored as purple). The rests (colored as red) are actual RAS/H.225.0 message and can be fed into h323decoder. The result of decoding is:

Next example is as follows:

max2000.ascend.co.jp.15003 > max2000-2.ascend.co.jp.1720: P 1:5(4) ack 1 win 4380
4500 002c 6c02 0000 4006 7b52 c0a8 8911
c0a8 8915 3a9b 06b8 0289 4f8c c3b8 4ed0
5018 111c 61a5 0000 0300 009e 0000
max2000.ascend.co.jp.15003 > max2000-2.ascend.co.jp.1720: P 5:159(154) ack 1 win 4380
4500 00c2 6d02 0000 4006 79bc c0a8 8911
c0a8 8915 3a9b 06b8 0289 4f90 c3b8 4ed0
5018 111c 1690 0000 0802 3332 0504 0388
90a5 7004 a131 3137 7e00 8705 40f8 0600
0891 4a00 0100 c0a8 8911 3a9c 0240 0600
6d00 6100 7800 3200 3000 3000 3004 8036
8658 a33a 28c0 b500 0014 224d 756c 7469
566f 6963 6520 4761 7465 7761 7920 666f
7220 7468 6520 4d41 5820 3630 3030 0437
2e30 2e30 0001 0100 44a0 c0a8 8915 06b8
0000 c07b 601f 7e03 002b 211d d481 4734
a800 0c07 00c0 a889 1100 0000 0503 4e55

Since this is a TCP packet with a port number 1720, this is a Q.931/H.225.0 message. Note that these are two IP packets but these form a single Q.931/H.225.0 message. Similar to the first example, the first 20 octet is an IP header (colored as light green) and next 20 octets is a TCP header (colored as purple). Next 4 octets ‘0300 009e’ in the first IP packet is a TPKT header (colored as orange). Next ‘0000’ is a garbage (See the TCP sequence number of this packet) and you should ignore them. (These garbage octets are printed out because the minimum MTU size of Ethernet is 46 octets, and tcpdump prints out these octets anywa). Preceding octets up to ‘7e00 8705′ are Q.931 IEs. The rest part starting from ’40f8 …’ (colored as red) are the Q.931/H.225.0 message. Only the part that is colored as red should be fed into the h323decoder. The result of decoding this hex dump is:

Final example is an H.245 message.

max2000-2.ascend.co.jp.15009 > max2000.ascend.co.jp.15004: P 1:90(89) ack 1 win 4380
4500 0081 e14d 0000 4006 05b2 c0a8 8915
c0a8 8911 3aa1 3a9c c3ba 2276 028b 29e9
5018 111c 20d7 0000 0300 004e 0270 0106
0008 8175 0002 800d 0000 3c00 0100 0001
0000 0100 0003 8000 0020 c03b 8000 0108
a817 6f40 0002 2200 0740 0003 09f8 0def
404a 3700 5040 0100 0080 0001 0100 0000
0201 0001 0003 0300 000b 0100 4680 824b

This is a TCP packet and uses ephemeral port number (1024 or greater) at both ends. H.245 messages usually look this way. IP header and TCP header are colored as light green and purple respectively. The next 4 octets are a TPKT header (colored as orange). But carefully examining the length field (‘004e’) in the TPKT header, this IP packets actually contains more than one H.245 messages (in fact, two)! This is yet another TCP difficulty as I described before. These two messages are a terminal capability set request and a master/slave determination request defined in H.245. Here’s the result of decoding of these messages:

6. What is ipdecode for?

ipdecode is a perl script that can be used with h323decoder. This script can intelligently examine the output of tcpdump, strip the unnecessary headers, and then pass it to h323decoder appropriately. Using this script, you don’t have to worry about the tedious procedure I described above. You can simply do as follows:

This will probably give you a desired result.

ipdecode is written and contributed by Yuzo Yamashita <yyamashita@ascend.co.jp> who is one of my coworkers.

Motonori Shindo <mshindo@ascend.co.jp>
Tuesday, September 28, 1999 11:30:58 AM

NAT still can be tricky

The following discussion applies to both NAPT and NAT, hence I refer to it simply as NAT hereafter.

Viptela provides a way to configure the authentication method for IPsec as you can see below:

ah-sha1-hmac and sha1-hmac must be self-explanatory but what the hell is ah-no-id?

In general, IPsec AH won’t work if there is a NAT somewhere between the endpoints as the fields that NAT would overwrite are protected by AH. That said, AH can still be used in Viptela’s solution even if NAT is present because the architecture allows to determine the IP address and port number before and after the translation by NAT. This works as long as the fields that are overwritten by NAT are limited to IP address and port number, but it will break if NAT overwrites other fields such as Identification field (I simply call it “ID field” hereafter) in IP header. “ah-no-id” parameter is available to work around this situation by ignoring ID field when computing AH.

In the beginning, I was skeptical about such a NAT implementation that overwrites the ID field really exists but it turned out that there are a few NAT implementations out there that specifically do this, notably Apple AirMac Extreme and AirMac Express. Luckily, I had AirMac Express handy, so I did some test by

and compare the packet before and after NAT translation (please note that SYN flag by -S option is set here because most NAT implementation is “stateful” and simply drops the packet if SYN packet doesn’t proceed).

Before NAT Translation (AirPort Express):

After NAT Translation (AirPort Express):

As you can see, ID field is overwritten from 12345 to 14428.

Let’s compare this with other NAT implementation. I took VyOS as an example.

Before NAT Translation (VyOS):

After NAT Translation (VyOS):

In this case, ID field (12345) is kept intact before and after the NAT translation.

Up until recently, I’ve believed that no intervening device should modify ID field as this field is used for reassembling which will only happen at the ultimate destination of IP datagram and hence NAT implementation that rewrites ID field is consider to be “misbehaving” if not “violating”. This has changed after I read RFC6864 “Updated Specification of the IPv4 ID Field”.

This RFC classifies datagrams into two kinds:

Datagrams that are not yet fragmented and for which further fragmentation has been inhibited
Datagrams that are either already been fragmented or for which fragmentation remains possible

Historically, ID field had other usages than reassembling (e.g. de-duplication). This RFC clearly limits the usage of ID field only for fragmentation and reassembling and relaxing some requirements imposed on ID field as it has no meaning for atomic datagrams. However, for non-atomic datagram, this RFC still mandates (just like RFC791 does) that the value in ID field maintains the uniqueness within MDL (Max Datagram Lifetime, typically 2 minutes).

This requirement puts a challenge on middle-box like NAT. Specifically, this RFC says:

NATs/ASMs/rewriters present a particularly challenging situation for
fragmentation. Because they overwrite portions of the reassembly
tuple in both directions, they can destroy tuple uniqueness and
result in a reassembly hazard. Whenever IPv4 source address,
destination address, or protocol fields are modified, a
NAT/ASM/rewriter needs to ensure that the ID field is generated
appropriately, rather than simply copied from the incoming datagram.

>> Address-sharing or rewriting devices MUST ensure that the IPv4 ID
field of datagrams whose addresses or protocols are translated
comply with these requirements as if the datagram were sourced by
that device.

Now I fully agree with this RFC; in order to guarantee the uniqueness of the value of ID field, whenever the device like NAT that rewrites IP address, it should generate a unique ID as if the packet was sent from that device and should not simply copy the ID field from the original packet. Although majority of NAT implementations today would behave like VyOS based on my past experience, I think Apple AirMac Extreme/Express is more compliant to the specification and well-behaving.

Sorry Apple, I’ve been treating you bad..

OVN (Open Virtual Network) Introduction

OVN (Open Virtual Network) is an open source software that provides virtual networking for features like L2, L3 and ACL, etc. OVN works with OVS (Open vSwitch) which has been adopted widely. OVN has been developed by the same community as that of OVS and it is treated almost like a sub-project of OVS. Just like OVS, the development of OVN is 100% open and discussion is being made on a public mailing list and IRC. While VMware and RedHat are the primary contributors today, it is open to everybody who wishes to contribute to OVN.

The target platform for OVN is Linux-derived hypervisors such as KVM and Xen and containers. DPDK is also supported. As of this writing, there is no ESXi support. Because OVS has been ported to Hyper-V (still in progress, though), OVN may support Hyper-V in the future.

It is important to note that OVN is CMS (Cloud Management System) agnostic and takes OpenStack, Docker and Mesos into the scope. Among these CMPs, OpenStack would have the highest significance for OVN. A better integration with OpenStack (compared to Neutron OVS plugin as of today) is probably one of the driving factors of OVN development.

OVN share the same goal as OVS, that is, supporting large scale deployment consisting of thousands of hypervisors with production quality.

Please see the diagram depicted below showing the high level architecture of OVN.

OVN Architecture

OVN Architecture

As you can see, OVN has two new components (processes). One is “ovn-northd” and another is “ovn-controller”. As its name implies, ovn-northd provides a northbound interface to CMS. As of this writing, only one ovn-northd exists in one OVN deployment. However, this part will be enhanced in the future to support some sort of clustering for redundancy and the scale-out.

One of the most unique architectures of OVN is probably the fact that the integration point with CMS is a database (DB). Although many people would expect that RESTful API is the integration point as a northbound interface, ovn-northd is designed to interact with CMS via DB.

ovn-northd uses two databases. One is called “Northbound DB”, which holds a “desired state” of virtual network. More specifically, it holds information about logical switch, logical port, logical router, logical router port, and ACL. It is worth noting that Northbound DB doesn’t hold any “physical” information. An entity sitting north side of ovn-northd (typically CMS) writes information to this Northbound DB to interact with ovn-northd. Another database that ovn-northd takes care of is “Southbound DB” which holds runtime state such as physical, logical and binding information. Specifically, Southbound DB includes information related to chassis, datapath binding, encapsulation port binding, and logical flows. One of the important roles of ovn-northd is to read Northbound DB, translates it to logical flows and write them to Southbound DB.

On the other hand, ovn-controller is a distributed controller, which will be installed on every single hypervisor. ovn-controller reads information from Southbound DB and configures ovsdb-server and ovs-vswitchd running on the same host accordingly. ovn-controller translates logical flow information populated in Southbound DB to physical flow information and then install it to OVS.

Today OVN uses OVSDB as a database system. Inherently database can be anything and doesn’t have to be OVSDB. However, as OVSDB is an intrinsic part of OVS, on which OVN always depends, and developers of OVN know the characteristics of OVSDB very well, they decided to use OVSDB as a database system for OVN for the time being.

Basically OVN provides three features; L2, L3 and ACL (Security Group).

As an L2 feature, OVN provides a logical switch. Specifically, OVN creates an L2-over-L3 overlay tunnel between hosts automatically. OVN uses Geneve as a default encapsulation. Considering the necessity of metadata support, multipath capability and hardware acceleration, Geneve would be the most desired encapsulation of choice. In case where hardware acceleration on the NIC for Geneve is not available, OVN allows to use STT as the second choice. In general, HW-VTEP doesn’t support Geneve/STT today, OVN uses VXLAN when talking to HW-VTEPs.

In terms of L3 feature, OVN provides so called a “distributed logical routing”. L3 features provided by OVN are not centralized meaning each host executes L3 function autonomously. L3 topology that OVN supports today is very simple. It routes traffic between logical switches directly connected to it and to the default gateway. As of this writing it is not possible to configure a static route other than the default route. It is simple but sufficient to support basic L3 function that OpenStack Neutron requires. NAT will be supported soon.

OVS plugin in OpenStack Neutron today implements Security Group by applying iptables to tap interface (vnet) on Linux Bridge. Having both OVS and Linux Bridge at the same time makes the architecture somewhat complex.


Conventional OpenStack Neutron OVS plugin Architecture (An excerpt from http://docs.ocselected.org/openstack-manuals/kilo/networking-guide/content/figures/6/a/a/common/figures/under-the-hood-scenario-1-ovs-compute.png)

Since OVS 2.4, OVS has been integrated with “conntrack” feature available on Linux, so it is possible to implement stateful ACL by OVS without relying on iptables. OVN takes advantages of this OVS & conntrack integration to implement the ACL.

Since conntrack integration is an OVS feature, one can use OVS+conntrack without OVN. However, OVN allows you to use stateful ACL without explicit awareness of conntrack because OVN compiles logical ACL rules to conntrack-based rules automatically, which would be appreciated by many people.

I will go into a bit more detail about L2, L3 and ACL features of OVN in the subsequent posts.

NetFlow on Open vSwitch

Open vSwitch (OVS) has been supporting NetFlow for a long time (since 2009). To enable NetFlow on OVS, you can use the following command for example:

When you want to disable NetFlow, you can do it in the following way:

NetFlow has several versions. V5 and V9 are the most commonly used versions today. OVS supports NetFlow V5 only. NetFlow V9 is not supported as of this writing (and very unlikely to be supported because OVS already supports IPFIX, a direct successor of NetFlow V9).

NetFlow V5 packet consists of a header and flow records (up to 30) following the header (see below).

NetFlow V5 header format NetFlow V5 header format

NetFlow V5 flow record format NetFlow V5 flow record format

NetFlow V5 cannot handle IPv6 flow records. If you need to monitor IPv6 traffic, you must use sFlow or IPFIX.

Comparing to NetFlow implementation on typical routers and switches, the one on OVS has a couple of unique points that you should keep in mind. I will describe such points below.

Most NetFlow-capable switches and routers support so called a “sampling”, where only subset of packets are processed for NetFlow (there are couple of ways how to sample the packet but it is beyond the scope of this blog post). NetFlow on OVS doesn’t support sampling. If you need to sample the traffic, you need to use sFlow or IPFIX instead.

Somewhat related to the fact that NetFlow on OVS doesn’t do sampling, it is worth noting that “byte count (dOctets)” and “packet count (dPkts)” fields in a NetFlow flow record (both are 32bit fields) may wrap around in case of elephant flows. In order to circumvent this issue, OVS sends multiple flow records when bytes or packets count exceeds 32bit maximum so that it can tell the collector with the accurate bytes and packets count.

Typically, NetFlow-capable switches and routers have a per-interface configuration to enable/disable NetFlow in addition to a global NetFlow configuration. OVS on the other hand doesn’t have a per-interface configuration. Instead it is a per-bridge configuration that allows you to enable/disable NetFlow on a per-bridge basis.

Most router/switch-based NetFlow exporters allow to configure the source IP address of NetFlow packet to be exported (and if it is the case, loopback address is a reasonable choice for this address). OVS, however, doesn’t have this capability. The source IP address of NetFlow packet is determined by the IP stack of host operating system and it is usually an IP address associated with the outgoing interface. Since NetFlow V5 doesn’t have a concept like “agent address” in sFlow, most collectors distinguish the exporters by the source IP address of NetFlow packets. Because OVS doesn’t allow us to configure the source IP address of NetFlow packets explicitly, we’d better be aware that there is a chance for the source IP address of NetFlow packets to change when the outgoing interface changes.

Although it is not clearly described in the document, OVS, in fact, supports multiple collectors as shown in the example below. This configuration provides redundancy of collectors.

When flow-based network management is adopted, In/Out interface number included in a flow record has a significant importance because it is often used to fitter the traffic of interest. Most commercial collector products have such a sophisticated filtering capability. Router/switch-based NetFlow exporter uses SNMP’s IfIndex to report In/Out interface number. NetFlow on OVS on the other hand uses OpenFlow port number instead to report In/Out interface number in the flow record. OpenFlow port number can be found by ovs-ofctl command as follows:

In this example, OpenFlow port number of “eth1” is 1. Some In/Out interface number have a special meaning in OVS. An interface local to the host (interfaces that is labeled as “LOCAL” in the example above) is represented as 65534. Output interface number 65535 is used for broadcast/multicast packets whereas most router/switch-based NetFlow exporters use “0” for both these two cases.

Those who are aware that IfIndex information was added to “Interface” table in OVS relatively recently may think that using IfIndex instead of OpenFlow port number is the right thing to do. May be true, but it is not that simple. For example, tunnel interface created by OVS doesn’t have an IfIndex, so it is not possible to export a flow record for traffic that traverses over tunnel interfaces if we simply chose IfIndex as NetFlow In/Out interface number.

NetFlow V5 header has a field called “Engine ID” and “Engine Type”. How these fields are set in OVS by default depends on the type of datapath. If OVS is run in a user space using netdev, Engine Type and Engine ID are derived from a hash value based on the datapath name and Engine ID and Engine Type are the most significant and least significant 8 bit of the hash value respectively. On the other hand, in the case of Linux kernel datapath using netlink, IfIndex of datapath is set to both Engine ID and Engine Type. You can find the IfIndex of OVS datapath by the following command:

If you don’t like the default value of these fields, you can configure them explicitly as shown below:

The example above shows the case where Engine Type and Engine ID are set at the same time when NetFlow configuration is created for the first time. You can also change NetFlow-related parameters after NetFlow configuration is created by doing like:

In general, typical use case of Engine Type and Engine ID is to distinguish logically separated but physically a single exporter. Good example of such a case is Cisco 6500, which has MSFC and PFC in a single chassis but having separate NetFlow export engines. In OVS case, it can be used to distinguish two or more bridges that is generating NetFlow flow records. As I mentioned earlier, the source IP address of NetFlow packet that OVS exports is determined by the standard IP stack (and it is usually not the IP address associated to the bridge interface in OVS that is NetFlow-enabled.) Therefore it is not possible to use the source IP address of NetFlow packets to tell which bridge is exporting the flow records. By setting a distinct Engine Type and Engine ID on each bridge, then you can tell it to the collector. To my knowledge, not many collectors can use Engine Type and/or Engine ID to distinguish multiple logical exporters however.

There is another use case for Engine ID. As I already explained that OVS uses OpenFlow port number as an In/Out interface number in NetFlow flow record. Because OpenFlow port number is a per bridge unique number, there is a chance for these numbers to collide across bridges. To get around this problem, you can set “add_to_interface” to true.

When this parameter is set to true, the 7 most significant bits of In/Out interface number is replaced with the 7 least significant bits of Engine ID. This will help interface number collision happen less likely.

Similar to typical router/switch-based NetFlow exporters, OVS also has a concept of active and inactive timeout. You can explicitly configure active timeout (in seconds) using the following command:

If it is not explicitly specified, it defaults to 600 seconds. If it is specified as -1, then active timeout will be disabled.

While OVS has an inactive timeout mechanism for NetFlow, it doesn’t have an explicit configuration knob for it. When flow information that OVS maintains is removed from the datapath, information about those flows will also be exported via NetFlow. This timeout is dynamic; it varies depending on many factors like the version of OVS, CPU and memory load etc., but typically it is 1 to 2 seconds in recent OVS. This duration is considerably shorter than that of typical router/switch-based NetFlow exporters which is 15 seconds in most cases.

As with most router/switch-based NetFlow exporters, OVS exports flow records for ICMP packet by filling the number calculated by “ICMP Type * 256 + Code” to source port number in a flow record. Destination port number for ICMP packets is always set to 0.

NextHop, source and destination of AS number, source and destination of Netmask are always set to 0. This is an expected behavior as OVS is inherently a “switch”.

While there are some caveats as I described above, NetFlow on OVS is a very useful tool if you want to monitor traffic handled by OVS. One advantage of NetFlow over sFlow or IPFIX is that there are so many open source or commercial collectors available today. Whatever flow collector you choose most likely supports NetFlow V5. Please give it a try. It will give you a great visibility about the traffic of your network.

BGP4 library implementation by Ruby

A while back, I posted a simple BGP4 library implemented by Ruby to GitHub. I wrote this code when I was working at Fivefront, which was selling GenieATM, a NetFlow/sFlow/IPFIX collector.


Using this library, you can easily write a code that speaks BGP4 as shown below:

I wrote this code primarily because we were so “poor” at that time (we were a startup with just 5 people) and couldn’t afford to buy a decent physical router such as Cisco or Juniper 🙂 Another reason was that we needed to inject large number of routes (e.g. 1,000,000 routes) to GenieATM, in which case Zebra did do a good job as it has to keep all routes in memory. These situations motivated me to write a stateless BGP code for route injection.

Over time, I added more features as needed and the code became more generic. Sometimes it gave me a great help, for example, when I needed to debug an issue introduced by a subtle difference of how BGP4 Capability Announcement is made by Juniper and Force10 (now DELL). Juniper announces capability by listing multiple Capability Advertisement that includes only one capability. On the other hand, Force10 announces single Capability Advertisement that includes multiple capabilities. This BGP4 library was useful to simulate these two different behaviors.

Because of the intended use case of this library code, the abstraction level of BGP was set low intentionally. It could have been abstracted more, but then it would in turn loose the flexibility of crafting a BGP message arbitrarily.

Please note that there are some limitations and constrains because this code was primarily developed to test GenieATM, a flow collector. First, while BGP session can be established from either peer in general, this code will never initiate BGP session from itself. Instead, it simply expects the peer to establish a session (a.k.a. passive mode). Second, no matter what messages are sent by the peer, it doesn’t do anything with it. In essence, this BGP4 library aims to experiment how the peer BGP implementation behaves when various kinds of BGP messages are sent.

Not to mention, this code is not a complete implementation of BGP4. Compared to other complete BGP4 implementations like Zebra/Quagga, this code is just a toy. That said, I decided to make this code available to public hoping that someone who has a similar use case as mine may find this code useful. I would like to say thanks to Tajima-san, who suggested me to publish this code to Github.

A book about NSX

It has been my desire that I would like to contribute something “visible” to the world and my dream came true (at least to some extent). I am pleased to announce that I wrote a book about VMware NSX with a couple of my colleagues, which can be found here.

VMware NSX Illustrated

VMware NSX Illustrated

This is not a translation of an existing book, instead it is written from scratch. We apology that this books is available only in Japanese for the time being, while we are getting several requests outside Japan for the translation of this book into other languages. We are not certain if such a work will take place, but we do hope it will. That said,  you can purchase it today  from www.amazon.co.jp (sorry it is not available on www.amazon.com) if you’re interested.

The table of contents looks like as follows:

  • Chapter 01 Technical Background and Definitions
  • Chapter 02 Standardization
  • Chapter 03 Some challenges in networking today
  • Chapter 04 Network Virtualization API
  • Chapter 05 Technical Details about NSX
  • Chapter 06 OpenStack and Neutron
  • Chapter 07 SDDC and NSX

among which I was responsible for the first half of Chapter 05 and some columns. While I have contributed many articles to magazines and books in the past, it was the first time for me that my name was explicitly listed as authors, so it was a profound experience.

What drove us to publish this book was the experiences in our day to day activities. Because network virtualization is a relatively new concept, we’ve often come across a situation where people don’t understand it very well or even misunderstand it sometimes. In a sense it was our fault because we haven’t been able to provide enough information about NSX publicly particularly in Japan market. To change this situation the best way we thought was to write a book about NSX and we made it!

It was not an easy journey though. It was almost an year ago when we first came up with an idea to write this book and we had many chances to give it up. I would like to say thank you for all co-authors (i.e. Mizumoto-san, Tanaka-san, Yokoi-san, Takata-san and Ogura-san) for their efforts that made this book available at vForum Tokyo 2014, which was tough I know. I am also very thankful to Mr. Maruyama (Hecula Inc.) and Mr. Hatanaka (Impress Inc.) for their night and day devotion for editing this book.

Publishing the book is not our goal. We hope that this book helps people understand network virtualization better and get more traction on NSX.

Geneve on Open vSwitch

A few weeks back, I posted a blog (sorry it was done only in Japanese) about a new encapsulation called “Geneve” which is being proposed to IETF as an Internet-Draft. Recently the first implementation of Geneve became available for Open vSwitch (OVS) contributed by Jesse Gross, a main author of Geneve Internet Draft, and the patch was upstream to a master branch on github where the latest OVS code resides. I pulled the latest source code of OVS from github and played with Geneve encapsulation. The following part of this post explains how I tested it. Since this effort is purely for testing Geneve and nothing else, I didn’t use KVM this time. Instead I used two Ubuntu 14.04 VM instances (host-1 and host-2) running on VMware Fusion with the latest OVS installed. In terms of VMware Fusion configuration, I assigned 1 Ethernet NIC on each VM which obtains an IP address from DHCP provided by Fusion by default. In the following example, let’s assume that host-1 and host-2 obtained an IP address and respectively. Next, two bridges are created (called br0 and br1), where br0 connecting to the network via eth0 while br1 on each VM talking with each other using Geneve encapsulation.

Geneve Test with Open vSwitch Geneve Test with Open vSwitch

OVS configuration for host-1 and host-2 are shown below:

Once this configuration has been done, now ping should work between br1 on each VM and those ping packets are encapsulated by Geneve.

Let’s take a closer look how Genve encapsulated packets look like using Wireshark. A Geneve dissector for Wireshark became available recently (this is also a contribution from Jesse, thanks again!) and merged into the latest master branch. Using this latest Wireshark, we can see how Geneve packet looks like as follows:

Geneve Frame by Wireshark

Geneve Frame by Wireshark

As you can see, Geneve uses 6081/udp as its port number. This is a port number officially assigned by IANA on Mar.27, 2014. Just to connect two bridges together by Geneve tunnel, there’s no need to specify a VNI (Virtual Network Identifier) specifically. If VNI is not specified, VNI=0 will be used as you can see in this Wireshark capture.

On the other hand if you need to multiplex more than 1 virtual networks over a single Geneve tunnel, VNI needs to be specified. In such a case, you can designate VNI using a parameter called “key” as an option to ovs-vsctl command as shown below:

The following is a Wireshark capture when VNI was specified as 5000 (0x1388):

Geneve Frame with VNI 5000 by Wireshark

Geneve Frame with VNI 5000 by Wireshark

Geneve is capable of encapsulating not only Ethernet frame but also arbitrary frame types. For this purpose Geneve header has a field called “Protocol Type”. In this example, Ethernet frames are encapsulated so this filed is specified as 0x6558 meaning “Transparent Ethernet Bdiging”.

As of this writing, Geneve Options are not supported (more specifically, there is no way which Geneve Options to be added to Geneve header). Please note that Geneve Options are yet to be defined in Geneve’s Internet Draft. Most likely a separate Internet Draft will be submitted to define Geneve Options sooner or later. As such a standardization process progresses, Geneve implementation in OVS will also evolve for sure.

Although Geneve-aware NIC which can perform TSO against Geneve encapsulated packets is not available on the market yet, OVS is at least “Geneve Ready” now. Geneve code is only included in the latest master branch of OVS at this point, but it will be included in the subsequent official release of OVS (hopefully 2.2). When that happens you can play with it more easily. Enjoy!