Overlay technologies and me

The other day, someone told me about this article (This article is in Japanese. A related article in English is here). It was a great surprise for me to know that Pat Gelsinger, CTO of Intel as of the writing of this article, CEO of VMware today, had envisioned the overlay network concept in 2004. Considering that he was at Intel at that time, I imagine that the environment he envisioned was desktop and notebook computers connecting to multiple overlay networks to access various services. The overlay networks haven’t reached our desktop environment to the extent that Pat might have envisioned. However, Cloud and datacenter today have become very close to his vision. It is quite surprising that Pat had such foresight in 2004.

Admittedly, I don’t have such an insight as Pat does, but I’ve been working with many overlay network technologies to date. Let’s take a look back at the overlay network technologies I have been involved in.

I joined Ascend Communication in 1997. Ascend dominated in the remote access server market, and most of the ISPs were using it. I joined Ascend as a SE without any experience working as a network engineer. For the first year, I’ve always asked myself a question of how much I was contributing to the company. I was struggling among the very talented people who can easily debug PPP with the hex dump. Then I realized that I should do something different from what other people were doing. I chose Ascend Tunneling Management Protocol (ATMP) as a start, which was the first encounter for me with the overlay network technology. Dial-up based tunneling protocols were in chaos at that time because of many vendor-led protocols such as ATMP, PPTP, L2F. To settle such a chaotic situation, IETF decided to create a vendor-neutral tunneling protocol called L2TP. Thanks to Ascend, one of the companies leading this standardization effort, I was able to access the early implementation of L2TP by Ascend. Roughly at the same time, a volunteer-based group called “vpnops,” who were testing various VPN technologies and sharing the operational insight, called for an interoperability testing for L2TP. It was so lucky that I was invited to such interoperability testing. To debug the L2TP protocol during the interoperability test, I wrote an L2TP dissector for tcpdump, which was later merged to the trunk of tcpdump thanks to itojun-san.

Through these experiences, I ended up having a decent knowledge of L2TP. In the meantime, NTT was in preparation for the first flat-rate internet access service in Japan called “FLETS ISDN,” which could leverage L2TP as a basis of its architecture. I was seemingly the person who had the most experience with L2TP at Ascend Japan, and I was engaged in this project as a consequence. This project was indeed a pivot point of my life as a network engineer, and I can’t thank more to people who helped me in this project. In the end, I was able to speak all control messages that the L2TP protocol defines from memory (LoL).

One thing that I feel regret when I was at Ascend was that I wasn’t able to be involved in IPsec as a technology. Ascend had a prototype, but it didn’t come out as a product. Because I was so interested in IPsec, I decided to leave Ascend and joined CoSine Communication. IPsec was not the only reason but undoubtedly one of the biggest reasons I joined CoSine. What CoSine was trying to do was to create an overlay-type VPN by implementing pure virtual routers and connecting them with an IPsec tunnel. Considering it was the time when VRF emerged as a part of BGP/MPLS VPN, what CoSine was trying to do was very advanced. Nevertheless, the quality of software CoSine had was not very good, which troubled many people involved. Please accept an apology.

In 2000, VPN was categorized into two models. One was a “peer model,” and another was called an “overlay model.” The most representative form of peer model VPN was the so-called BGP/MPLS VPN (RFC2547/RFC4364). On the other hand, the VPN model that CoSine was adopting was an overlay-model. It is worth noting that “overlay” here doesn’t mean the one in a data plane (i.e., the presence/absence of a tunneling/encapsulation). Instead, it means how routing information from customer sites is handled. In a peer-model, routing information from customer sites and the backbone in the service provider are treated equally, thus called a “peer” model. In an overlay model, a service provider is oblivious about the routing information from customer sites. Routing information from customer sites is typically exchanged over a tunnel between the customer edge by running a routing protocol on the tunnel.

We sometimes had a flame war about the type of VPNs over the IETF mailing list. Most of the time, peer model advocates predominated over the overlay model advocates. Typical claim from peer model advocates was “Overlay model won’t scale” while overlay model advocates claimed it would. Putting aside which statement from each group was correct, BGP/MPLS VPN was clearly the winner from a commercial perspective.

In 2011, I came across Nicira (please read this how it all happened; sorry it’s only written in Japanese). Coincidentally, Nicira was also a company based on an overlay technology. The main idea of NVP, software being developed by Nicira, was to decouple network functions from the underlying hardware. NVP realizes the virtual network by connecting virtual switches with a tunneling protocol called STT (Stateless Transport Tunneling). VMware acquired Nicira in 2012. The core part of NVP was incorporated into NSX, one of the mainstream products of VMware, and it became a 1 billion dollar run-rate business today.

After I left VMware, I joined a company called Viptela, an SD-WAN startup in 2016. SD-WAN is also based on overlay technology. Most of the SD-WAN products use IPsec tunnels as a data plane. The flexibility and ubiquitousness of SD-WAN come not only from the presence of “controller” but also from the overlay network architecture. In the context of the peer-model vs. overlay-model I described above, SD-WAN falls into the category of overlay-model VPN because the customer routing information is exchanged by the CPEs. Service providers are not involved there. Recalling the harsh “bashing” against the overlay model VPN in 2010, I can’t help from feeling somewhat delighted to see that overlay technology like SD-WAN has been deployed at a large scale and working just fine 15 years later from the flame war.

Performance overhead always comes with the overlay technologies as an inevitable tax. Sometimes people criticize the overlay technologies because of such overhead. I think such criticism is short-sighted. Upon the emergence of new overlay technology, it is often done by a new encapsulation, which hardware doesn’t do an excellent job in handling such packets. Therefore, it leads to performance degradation. However, such a performance issue is often resolved over time. Encapsulation, in particular, is a relatively easy problem for hardware to solve. In the long run, we would appreciate the benefit of “abstraction” realized by the overlay technologies much more than the tax we need to pay.

Let’s take a few examples (but not from the area of networking). Modern computers and the operating systems use a virtual memory system by mapping the memory from a virtual address space to a physical address space. This mapping is possible thanks to the mechanism such as MMU and TLB that the CPU has. There is no one today who doubts the benefit of this virtual memory system just because of the potential performance penalty incurred. It is so apparent that the advantage is way more than the disadvantages. Let’s think about another example. In the early days, it was believed that x86 CPU could not be virtualized without sacrificing the performance significantly due to its inherent architecture. (BTW, VMware was the first company who proved that it could be done with reasonable performance). However, with the advent of virtualization assistance features introduced on CPUs such as Intel VT and AMD-V, it became possible to virtualize the x86 processor without significant performance degradation. The same thing is very likely to happen to the overlay network. The abstraction first brings innovations, and then the associated overhead is remedied with the help of hardware later.

I have to confess, however, that I have not been choosing the technologies and companies for the last 20 years under a clear insight or philosophy I described above. Instead, it was more of a coincidental artifact. Or, I just like the abstraction by the overlay technology and a sense of excitement that comes out of it.

Up until recently, the network has been built and managed by “touching” the networking gears as they have been always physically accessible. Today, we are in a Cloud Era, however. Not to mention, there are switches and routers in the Cloud, but they are no longer accessible from us. Then, how can we build and manage the network end-to-end in an environment where not all network gears are accessible? It seems very natural to me that we should build the network by software as an overlay and manage it end-to-end in the Cloud Era.

With that all being said, I want to continue to be actively involved in overlay technologies.

h323decoder Primer

This article was the one I posted to another site way back in late 1990 when I was at Ascend Communications. I just found this article at an archive site today. As I don’t know how long this archive site continues its service, I decided to restore it on my personal site. Another reason for doing this was that I wanted to express my great appreciation to Yuzo Yamashita, one of my ex-colleagues at Ascend Communications who contributed a wrapper for my H.323 decoder and I would like to offer my sincere condolences for his loss in 2012.

h323decoder Primer

Motonori Shindo (mshindo@ascned.co.jp)

What is h323decoder?

The best way to understand how H.323 protocol works would be to examine the actual packets between a gatekeeper, gateways, and/or terminals. H.323-related protocols such as H.225.0 and H.245 use PER (Packed Encoding Rule) for encoding packets. While this encoding scheme generates very compact message, it’s very hard to decode for human (I tried it once and finally gave it up:-)). h323decoder is a tool to decode H.323 packets and show you the result in more human-friendly way. This tool will alleviate your pain to decode H.323 packets by hand and hopefully help you understand the protocols themselves better.

Generally speaking, making an H.323 decoder is NOT a trivial task. It basically requires an ASN.1 compiler and a PER decoder. Fortunately there’s an ambitious project called ‘OpenH323’ and this project has an open-source ASN.1 compiler and PER encoder/decoder. I made use of OpenH323’s code and wrote a simple H.323 decoder. I could not have been able to write this h323decoder without OpenH323 Project. I would like to say thank you all who contributed to this project!

How do I get the h323decoder?

You probably don’t need to get an h323decoder. There’s a CGI version of this decoder, so all you have to have is a browser (e.g. Netscape, Internet Explorer, etc.). [Sorry, this is under construction now.]

For those who want to have this program as a standalone program so that you can run it on your UNIX machine, please get this. This is a patch to openH323 code (0.8-alpha1). Please consult README file under h323decoder directory for more detail. To compile openH323 source code, you need to have a corresponding PWLib library. For the latest openH323 source code and PWLib library, please obtain them from here. To compile pwlib_min_1.10 under FreeBSD, it requres a patch which you can download from here. h323decoder may also compile and work under Windows, but I haven’t tried it myself. If someone succeeded, please let me know.

How do I use the h323decoder?

h323decoder takes a hex dump of the payload you want to decode as an input. For example, hex dump that is generated by the command ‘tcpdump -x’ can be an input, but other forms of a hex dump would also be accepted.

Honestly, h323decoder is not sophisticated at all, especially in terms of a human interface. Therefore, users of this decoder must be intelligent on behalf it:-) For example, since this decoder takes only H.323 payload and don’t strip any IP, TCP/UDP, TPKT header, you have to delineate these headers and strip them accordingly by yourself.

The decoder can handle three types of H.323 messages:

  • RAS/H.225.0 message
  • Q.931/H.225.0 message
  • H.245 Media System Control message

The decoder doesn’t automatically detect which type of message you’re trying to decode. Instead, you must specify it explicitly.

Once you extracted the payload that h323decoder can process and determined the type of a message you want to decode, the rest of the job is easy. If you’re using the CGI version of this decoder via a browser, select the type of the message from a pulldown menu, and enter the hex dump into the text area, then click the “submit” button. If you’re using a standalone version of h323decoder, you must specify the type of the message by the command line argument (‘ras’, ‘q931’, or ‘msc’), and then pass an hex dump to the decoder through the standard input. Let’s suppose you have an hex dump of RAS message in hexdump.txt file, then

% cat hexdump.txt | h323decoder ras

will print out the result to the standard output.

Please note that current h323decoder can handle only one message at a time. You can not let it decode multiple messages in a sequence. There’s a perl script that can be used with h323decoder. Please refer to the end part of this document.

How do I extract an H.323 payload?

As I said before, determine the type of message you are about to decode. The first message you will come across might be a RAS/H.225.0 message.

Since RAS/H.225.0 uses UDP and port number 1719, you can easily find RAS/H.2250.0 messages in the packet trace. Knowing that IP header is (usually) 20 bytes long, and UDP header is 8 bytes long, it’s easy to extract a RAS message out of the IP packet. If you had a bad luck, it could be fragmented. In that case, you’ll have to reassemble them accordingly.

In case of Q.931/H.225.0 message, things get more complicated. Q.931/H.225.0 is a TCP message and usually uses a port number 1720, so finding a Q.931/H.225 message in the packet trace is relatively easy. Since TCP provides us a “stream” of packet and there is no logical boundaries in the stream, there must be a way to delineate the Q.931/H.225.0 messages on top of TCP. TPKT header is used for this purpose. Q.931/H.225.0 message is wrapped by TPKT header, which consists of 2 octets (‘0003’) and another 2 octets that represents the total length of the message (including the length of TPKT header itself).

Following to the TPKT header, you’ll see some Q.931 Information Elements (IEs). While this part contains some important information, H.323-specific data is stored as a Q.931’s ‘User-User Information Element’. This IE starts with ‘0x7e’ followed by 2-octet length field and finally a protocol discriminator which is always ‘0x05’. Forgetting all these details of Q.931 packet format, there’s a rule of thumb that will work most of the time; “Find ‘0x7e’ and get rid of the next 3 octets. The rest is (most likely) a Q.931/H.225.0 payload that can be fed into the h323decoder.”

In case of H.245 message, it’s hard to tell if it’s an H.245 message or not by just looking at the packet, because it doesn’t use a fixed port number. To know what port number is used for H.245 message, you need to examine the content of the previous Q.931/H225.0 messages exhcanged between the gateways and/or the terminals. But, practically, you may assume that the packet is H.245 if it’s not a RAS/H.225.0 nor a Q.931/H.225.0 message. H.245 uses TCP as the transport and encapsulate the packet using TPKT header in the same way as in Q.931/H.225.0 as I described before.

Since both UDP (RAS/H.225.0) and TCP (Q.931/H.225.0 and H.245) run on IP, you must take the possibility of a fragmentation and an out-of-order delivery of the packet into account when you try to decode the H.323 packets. In addition to that, please be aware that TCP doesn’t guarantee that a PDU (Q.931/H.225.0 or H.245 message) is put onto the network as a single packet (i.e. it can be chopped), therefore you may need to concatenate them until they makes sense. Similary, there may be a case where two PDUs are sent to the network as a single TCP packet (in this case, you’ll have to chop it into two or more H.323 PDUs by hand). Furthermore, there may be chance that TCP timeout occurs and a retransmission takes place. In this case, you have to throw away the duplicated packets, again, by hand.

Though the task I described above may look too hard, it isn’t that bad as you might imagine. If you get used to it, you just glance at the hex dump and can extract the H.323 payload relatively easily! Please give it a shot!

Could I have some examples?

Sure! Here’s a packet snooped by tcpdump. The following examples are all taken using Lucent’s (formally Ascend) MultiVoice system.

max2000.ascend.co.jp.1719 > navis.ascend.co.jp.1719: udp 80
4500 006c 6a02 0000 4011 f0ca c0a8 8911
caf6 0b04 06b7 06b7 0058 0000 2590 ffe2
0860 006d 0061 0078 0032 0030 0030 0030
0201 0044 a010 44a0 0240 0600 6d00 6100
7800 3200 3000 3000 3004 8036 8658 a33a
00c0 a889 1100 0000 a033 3200 c07b 601f
7e03 002b 211d d481 4734 a818

Since this is an UDP packet with port number 1719, this is a RAS message. First 20 octets are IP header (colored as light green) and next 8 octets UDP header (colored as purple). The rests (colored as red) are actual RAS/H.225.0 message and can be fed into h323decoder. The result of decoding is:

admissionRequest {
        requestSeqNum = 65507
        callType = pointToPoint <>
        callModel = gatekeeperRouted <>
        endpointIdentifier =  7 characters {
          006d 0061 0078 0032 0030 0030 0030        max2000
        }
        destinationInfo = 2 entries {
          [0]=e164 "117"
          [1]=e164 "117"
        }
        srcInfo = 2 entries {
          [0]=h323_ID  7 characters {
            006d 0061 0078 0032 0030 0030 0030        max2000
          }
          [1]=e164 "0353257007"
        }
        srcCallSignalAddress = ipAddress {
          ip =  4 octets {
            c0 a8 89 11
          }
          port = 0
        }
        bandWidth = 160
        callReferenceValue = 13106
        conferenceID =  16 octets {
          00 c0 7b 60 1f 7e 03 00 2b 21 1d d4 81 47 34 a8     {` ~  +!   G4
        }
        activeMC = FALSE
        answerCall = FALSE
        canMapAlias = FALSE
        callIdentifier = {
          guid =  16 octets {
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
          }
        }
        willSupplyUUIEs = FALSE
      }

Next example is as follows:

max2000.ascend.co.jp.15003 > max2000-2.ascend.co.jp.1720: P 1:5(4) ack 1 win 4380
4500 002c 6c02 0000 4006 7b52 c0a8 8911
c0a8 8915 3a9b 06b8 0289 4f8c c3b8 4ed0
5018 111c 61a5 0000 0300 009e 0000
max2000.ascend.co.jp.15003 > max2000-2.ascend.co.jp.1720: P 5:159(154) ack 1 win 4380
4500 00c2 6d02 0000 4006 79bc c0a8 8911
c0a8 8915 3a9b 06b8 0289 4f90 c3b8 4ed0
5018 111c 1690 0000 0802 3332 0504 0388
90a5 7004 a131 3137 7e00 8705 40f8 0600
0891 4a00 0100 c0a8 8911 3a9c 0240 0600
6d00 6100 7800 3200 3000 3000 3004 8036
8658 a33a 28c0 b500 0014 224d 756c 7469
566f 6963 6520 4761 7465 7761 7920 666f
7220 7468 6520 4d41 5820 3630 3030 0437
2e30 2e30 0001 0100 44a0 c0a8 8915 06b8
0000 c07b 601f 7e03 002b 211d d481 4734
a800 0c07 00c0 a889 1100 0000 0503 4e55
4c4c

Since this is a TCP packet with a port number 1720, this is a Q.931/H.225.0 message. Note that these are two IP packets but these form a single Q.931/H.225.0 message. Similar to the first example, the first 20 octet is an IP header (colored as light green) and next 20 octets is a TCP header (colored as purple). Next 4 octets ‘0300 009e’ in the first IP packet is a TPKT header (colored as orange). Next ‘0000’ is a garbage (See the TCP sequence number of this packet) and you should ignore them. (These garbage octets are printed out because the minimum MTU size of Ethernet is 46 octets, and tcpdump prints out these octets anywa). Preceding octets up to ‘7e00 8705′ are Q.931 IEs. The rest part starting from ’40f8 …’ (colored as red) are the Q.931/H.225.0 message. Only the part that is colored as red should be fed into the h323decoder. The result of decoding this hex dump is:

{
        h323_uu_pdu = {
          h323_message_body = setup {
            protocolIdentifier = 0.0.8.2250.0.1
            h245Address = ipAddress {
              ip =  4 octets {
                c0 a8 89 11
              }
              port = 15004
            }
            sourceAddress = 2 entries {
              [0]=h323_ID  7 characters {
                006d 0061 0078 0032 0030 0030 0030        max2000
              }
              [1]=e164 "0353257007"
            }
            sourceInfo = {
              vendor = {
                vendor = {
                  t35CountryCode = 181
                  t35Extension = 0
                  manufacturerCode = 20
                }
                productId =  35 octets {
                  4d 75 6c 74 69 56 6f 69 63 65 20 47 61 74 65 77   MultiVoice Gatew
                  61 79 20 66 6f 72 20 74 68 65 20 4d 41 58 20 36   ay for the MAX 6
                  30 30 30                                          000
                }
                versionId =  5 octets {
                  37 2e 30 2e 30                                    7.0.0
                }
              }
              gateway = {
              }
              mc = FALSE
              undefinedNode = FALSE
            }
              }
              [1]=e164 "0353257007"
            }
            sourceInfo = {
              vendor = {
                vendor = {
                  t35CountryCode = 181
                  t35Extension = 0
                  manufacturerCode = 20
                }
                productId =  35 octets {
                  4d 75 6c 74 69 56 6f 69 63 65 20 47 61 74 65 77   MultiVoice Gatew
                  61 79 20 66 6f 72 20 74 68 65 20 4d 41 58 20 36   ay for the MAX 6
                  30 30 30                                          000
                }
                versionId =  5 octets {
                  37 2e 30 2e 30                                    7.0.0
                }
              }
              gateway = {
              }
              mc = FALSE
              undefinedNode = FALSE
            }
            destinationAddress = 1 entries {
              [0]=e164 "117"
            }
            destCallSignalAddress = ipAddress {
              ip =  4 octets {
                c0 a8 89 15
              }
              port = 1720
            }
            activeMC = FALSE
            conferenceID =  16 octets {
              00 c0 7b 60 1f 7e 03 00 2b 21 1d d4 81 47 34 a8     {` ~  +!   G4
            }
            conferenceGoal = create <>
            callType = pointToPoint <>
            sourceCallSignalAddress = ipAddress {
              ip =  4 octets {
                c0 a8 89 11
              }
              port = 0
            }
          }
        }
        user_data = {
          protocol_discriminator = 5
          user_information =  4 octets {
            4e 55 4c 4c                                       NULL
          }
        }
      }

Final example is an H.245 message.

max2000-2.ascend.co.jp.15009 > max2000.ascend.co.jp.15004: P 1:90(89) ack 1 win 4380
4500 0081 e14d 0000 4006 05b2 c0a8 8915
c0a8 8911 3aa1 3a9c c3ba 2276 028b 29e9
5018 111c 20d7 0000 0300 004e 0270 0106
0008 8175 0002 800d 0000 3c00 0100 0001
0000 0100 0003 8000 0020 c03b 8000 0108
a817 6f40 0002 2200 0740 0003 09f8 0def
404a 3700 5040 0100 0080 0001 0100 0000
0201 0001 0003 0300 000b 0100 4680 824b
69

This is a TCP packet and uses ephemeral port number (1024 or greater) at both ends. H.245 messages usually look this way. IP header and TCP header are colored as light green and purple respectively. The next 4 octets are a TPKT header (colored as orange). But carefully examining the length field (‘004e’) in the TPKT header, this IP packets actually contains more than one H.245 messages (in fact, two)! This is yet another TCP difficulty as I described before. These two messages are a terminal capability set request and a master/slave determination request defined in H.245. Here’s the result of decoding of these messages:

request terminalCapabilitySet {
        sequenceNumber = 1
        protocolIdentifier = 0.0.8.245.0.2
        multiplexCapability = h2250Capability {
          maximumAudioDelayJitter = 60
          receiveMultipointCapability = {
            multicastCapability = FALSE
            multiUniCastConference = FALSE
            mediaDistributionCapability = 1 entries {
              [0]={
                centralizedControl = FALSE
                distributedControl = FALSE
                centralizedAudio = FALSE
                distributedAudio = FALSE
                centralizedVideo = FALSE
                distributedVideo = FALSE
              }
            }
          }
          transmitMultipointCapability = {
            multicastCapability = FALSE
            multiUniCastConference = FALSE
            mediaDistributionCapability = 1 entries {
              [0]={
                centralizedControl = FALSE
                distributedControl = FALSE
                centralizedAudio = FALSE
                distributedAudio = FALSE
                centralizedVideo = FALSE
                distributedVideo = FALSE
              }
            }
          }
          receiveAndTransmitMultipointCapability = {
            multicastCapability = FALSE
            multiUniCastConference = FALSE
            mediaDistributionCapability = 1 entries {
              [0]={
                centralizedControl = FALSE
                distributedControl = FALSE
                centralizedAudio = FALSE
                distributedAudio = FALSE
                centralizedVideo = FALSE
                distributedVideo = FALSE
              }
            }
          }
          mcCapability = {
            centralizedConferenceMC = FALSE
            decentralizedConferenceMC = FALSE
          }
          rtcpVideoControlCapability = FALSE
          mediaPacketizationCapability = {
            h261aVideoPacketization = FALSE
          }
        }
        capabilityTable = 4 entries {
          [0]={
            capabilityTableEntryNumber = 1
            capability = receiveAudioCapability g711Ulaw64k 60
          }
          [1]={
            capabilityTableEntryNumber = 2
            capability = receiveVideoCapability h261VideoCapability {
              qcifMPI = 3
              temporalSpatialTradeOffCapability = FALSE
              maxBitRate = 6000
              stillImageTransmission = FALSE
            }
          }
          [2]={
            capabilityTableEntryNumber = 3
            capability = receiveAudioCapability g7231 {
              maxAl_sduAudioFrames = 8
              silenceSuppression = FALSE
            }
          }
          [3]={
            capabilityTableEntryNumber = 4
            capability = receiveVideoCapability h263VideoCapability {
              sqcifMPI = 4
              qcifMPI = 16
              cifMPI = 16
              maxBitRate = 19000
              unrestrictedVector = FALSE
              arithmeticCoding = FALSE
              advancedPrediction = FALSE
              pbFrames = FALSE
              temporalSpatialTradeOffCapability = FALSE
              errorCompensation = FALSE
            }
          }
        }
        capabilityDescriptors = 1 entries {
          [0]={
            capabilityDescriptorNumber = 0
            simultaneousCapabilities = 2 entries {
              [0]=2 entries {
                [0]=1
                [1]=3
              }
              [1]=2 entries {
                [0]=2
                [1]=4
              }
            }
          }
        }
      }

request masterSlaveDetermination {
        terminalType = 70
        statusDeterminationNumber = 8538985
      }

6. What is ipdecode for?

ipdecode is a perl script that can be used with h323decoder. This script can intelligently examine the output of tcpdump, strip the unnecessary headers, and then pass it to h323decoder appropriately. Using this script, you don’t have to worry about the tedious procedure I described above. You can simply do as follows:

# tcpdump -s 1500 -x | ipdecode -Ph323

This will probably give you a desired result.

ipdecode is written and contributed by Yuzo Yamashita <yyamashita@ascend.co.jp> who is one of my coworkers.

Motonori Shindo <mshindo@ascend.co.jp>
Tuesday, September 28, 1999 11:30:58 AM

NAT still can be tricky

The following discussion applies to both NAPT and NAT, hence I refer to it simply as NAT hereafter.

Viptela provides a way to configure the authentication method for IPsec as you can see below:

security
 ipsec
  authentication-type ah-sha1-hmac sha1-hmac ah-no-id
 !
!

ah-sha1-hmac and sha1-hmac must be self-explanatory but what the hell is ah-no-id?

In general, IPsec AH won’t work if there is a NAT somewhere between the endpoints as the fields that NAT would overwrite are protected by AH. That said, AH can still be used in Viptela’s solution even if NAT is present because the architecture allows to determine the IP address and port number before and after the translation by NAT. This works as long as the fields that are overwritten by NAT are limited to IP address and port number, but it will break if NAT overwrites other fields such as Identification field (I simply call it “ID field” hereafter) in IP header. “ah-no-id” parameter is available to work around this situation by ignoring ID field when computing AH.

In the beginning, I was skeptical about such a NAT implementation that overwrites the ID field really exists but it turned out that there are a few NAT implementations out there that specifically do this, notably Apple AirMac Extreme and AirMac Express. Luckily, I had AirMac Express handy, so I did some test by
[shell]$ sudo hping3 -c 1 -N 12345 -S -s 11111 -p 22222 8.8.8.8[/shell]
and compare the packet before and after NAT translation (please note that SYN flag by -S option is set here because most NAT implementation is “stateful” and simply drops the packet if SYN packet doesn’t proceed).

Before NAT Translation (AirPort Express):
[shell]06:21:06.670952 IP (tos 0x0, ttl 64, id 12345, offset 0, flags [none], proto TCP (6), length 40)
10.0.1.3.11111 > 8.8.8.8.22222: Flags [S], cksum 0xa76d (correct), seq 868543788, win 512, length 0[/shell]
After NAT Translation (AirPort Express):
[shell]15:20:52.361257 IP (tos 0x0, ttl 63, id 14428, offset 0, flags [none], proto TCP (6), length 40)
10.156.250.80.44210 > 8.8.8.8.22222: Flags [S], cksum 0x2c38 (correct), seq 868543788, win 512, length 0[/shell]
As you can see, ID field is overwritten from 12345 to 14428.

Let’s compare this with other NAT implementation. I took VyOS as an example.

Before NAT Translation (VyOS):
[shell]08:37:57.157399 IP (tos 0x0, ttl 64, id 12345, offset 0, flags [none], proto TCP (6), length 40)
10.200.1.11.11111 > 8.8.8.8.22222: Flags [S], cksum 0x5623 (correct), seq 7492057, win 512, length 0[/shell]
After NAT Translation (VyOS):
[shell]08:37:57.157410 IP (tos 0x0, ttl 63, id 12345, offset 0, flags [none], proto TCP (6), length 40)
221.245.168.210.11111 > 8.8.8.8.22222: Flags [S], cksum 0xdb2d (correct), seq 7492057, win 512, length 0[/shell]
In this case, ID field (12345) is kept intact before and after the NAT translation.

Up until recently, I’ve believed that no intervening device should modify ID field as this field is used for reassembling which will only happen at the ultimate destination of IP datagram and hence NAT implementation that rewrites ID field is consider to be “misbehaving” if not “violating”. This has changed after I read RFC6864 “Updated Specification of the IPv4 ID Field”.

This RFC classifies datagrams into two kinds:

Atomic
Datagrams that are not yet fragmented and for which further fragmentation has been inhibited
Non-Atomic
Datagrams that are either already been fragmented or for which fragmentation remains possible

Historically, ID field had other usages than reassembling (e.g. de-duplication). This RFC clearly limits the usage of ID field only for fragmentation and reassembling and relaxing some requirements imposed on ID field as it has no meaning for atomic datagrams. However, for non-atomic datagram, this RFC still mandates (just like RFC791 does) that the value in ID field maintains the uniqueness within MDL (Max Datagram Lifetime, typically 2 minutes).

This requirement puts a challenge on middle-box like NAT. Specifically, this RFC says:

NATs/ASMs/rewriters present a particularly challenging situation for
fragmentation. Because they overwrite portions of the reassembly
tuple in both directions, they can destroy tuple uniqueness and
result in a reassembly hazard. Whenever IPv4 source address,
destination address, or protocol fields are modified, a
NAT/ASM/rewriter needs to ensure that the ID field is generated
appropriately, rather than simply copied from the incoming datagram.

Specifically:
>> Address-sharing or rewriting devices MUST ensure that the IPv4 ID
field of datagrams whose addresses or protocols are translated
comply with these requirements as if the datagram were sourced by
that device.

Now I fully agree with this RFC; in order to guarantee the uniqueness of the value of ID field, whenever the device like NAT that rewrites IP address, it should generate a unique ID as if the packet was sent from that device and should not simply copy the ID field from the original packet. Although majority of NAT implementations today would behave like VyOS based on my past experience, I think Apple AirMac Extreme/Express is more compliant to the specification and well-behaving.

Sorry Apple, I’ve been treating you bad..

OVN (Open Virtual Network) Introduction

OVN (Open Virtual Network) is an open source software that provides virtual networking for features like L2, L3 and ACL, etc. OVN works with OVS (Open vSwitch) which has been adopted widely. OVN has been developed by the same community as that of OVS and it is treated almost like a sub-project of OVS. Just like OVS, the development of OVN is 100% open and discussion is being made on a public mailing list and IRC. While VMware and RedHat are the primary contributors today, it is open to everybody who wishes to contribute to OVN.

The target platform for OVN is Linux-derived hypervisors such as KVM and Xen and containers. DPDK is also supported. As of this writing, there is no ESXi support. Because OVS has been ported to Hyper-V (still in progress, though), OVN may support Hyper-V in the future.

It is important to note that OVN is CMS (Cloud Management System) agnostic and takes OpenStack, Docker and Mesos into the scope. Among these CMPs, OpenStack would have the highest significance for OVN. A better integration with OpenStack (compared to Neutron OVS plugin as of today) is probably one of the driving factors of OVN development.

OVN share the same goal as OVS, that is, supporting large scale deployment consisting of thousands of hypervisors with production quality.

Please see the diagram depicted below showing the high level architecture of OVN.

OVN Architecture
OVN Architecture

As you can see, OVN has two new components (processes). One is “ovn-northd” and another is “ovn-controller”. As its name implies, ovn-northd provides a northbound interface to CMS. As of this writing, only one ovn-northd exists in one OVN deployment. However, this part will be enhanced in the future to support some sort of clustering for redundancy and the scale-out.

One of the most unique architectures of OVN is probably the fact that the integration point with CMS is a database (DB). Although many people would expect that RESTful API is the integration point as a northbound interface, ovn-northd is designed to interact with CMS via DB.

ovn-northd uses two databases. One is called “Northbound DB”, which holds a “desired state” of virtual network. More specifically, it holds information about logical switch, logical port, logical router, logical router port, and ACL. It is worth noting that Northbound DB doesn’t hold any “physical” information. An entity sitting north side of ovn-northd (typically CMS) writes information to this Northbound DB to interact with ovn-northd. Another database that ovn-northd takes care of is “Southbound DB” which holds runtime state such as physical, logical and binding information. Specifically, Southbound DB includes information related to chassis, datapath binding, encapsulation port binding, and logical flows. One of the important roles of ovn-northd is to read Northbound DB, translates it to logical flows and write them to Southbound DB.

On the other hand, ovn-controller is a distributed controller, which will be installed on every single hypervisor. ovn-controller reads information from Southbound DB and configures ovsdb-server and ovs-vswitchd running on the same host accordingly. ovn-controller translates logical flow information populated in Southbound DB to physical flow information and then install it to OVS.

Today OVN uses OVSDB as a database system. Inherently database can be anything and doesn’t have to be OVSDB. However, as OVSDB is an intrinsic part of OVS, on which OVN always depends, and developers of OVN know the characteristics of OVSDB very well, they decided to use OVSDB as a database system for OVN for the time being.

Basically OVN provides three features; L2, L3 and ACL (Security Group).

As an L2 feature, OVN provides a logical switch. Specifically, OVN creates an L2-over-L3 overlay tunnel between hosts automatically. OVN uses Geneve as a default encapsulation. Considering the necessity of metadata support, multipath capability and hardware acceleration, Geneve would be the most desired encapsulation of choice. In case where hardware acceleration on the NIC for Geneve is not available, OVN allows to use STT as the second choice. In general, HW-VTEP doesn’t support Geneve/STT today, OVN uses VXLAN when talking to HW-VTEPs.

In terms of L3 feature, OVN provides so called a “distributed logical routing”. L3 features provided by OVN are not centralized meaning each host executes L3 function autonomously. L3 topology that OVN supports today is very simple. It routes traffic between logical switches directly connected to it and to the default gateway. As of this writing it is not possible to configure a static route other than the default route. It is simple but sufficient to support basic L3 function that OpenStack Neutron requires. NAT will be supported soon.

OVS plugin in OpenStack Neutron today implements Security Group by applying iptables to tap interface (vnet) on Linux Bridge. Having both OVS and Linux Bridge at the same time makes the architecture somewhat complex.

under-the-hood-scenario-1-ovs-compute
Conventional OpenStack Neutron OVS plugin Architecture (An excerpt from http://docs.ocselected.org/openstack-manuals/kilo/networking-guide/content/figures/6/a/a/common/figures/under-the-hood-scenario-1-ovs-compute.png)

Since OVS 2.4, OVS has been integrated with “conntrack” feature available on Linux, so it is possible to implement stateful ACL by OVS without relying on iptables. OVN takes advantages of this OVS & conntrack integration to implement the ACL.

Since conntrack integration is an OVS feature, one can use OVS+conntrack without OVN. However, OVN allows you to use stateful ACL without explicit awareness of conntrack because OVN compiles logical ACL rules to conntrack-based rules automatically, which would be appreciated by many people.

I will go into a bit more detail about L2, L3 and ACL features of OVN in the subsequent posts.

NetFlow on Open vSwitch

Open vSwitch (OVS) has been supporting NetFlow for a long time (since 2009). To enable NetFlow on OVS, you can use the following command for example:

[shell]# ovs-vsctl — set Bridge br0 netflow=@nf — –id=@nf create NetFlow targets=\”10.127.1.67\”[/shell]

When you want to disable NetFlow, you can do it in the following way:

[shell]# ovs-vsctl — clear Bridge br0 netflow[/shell]

NetFlow has several versions. V5 and V9 are the most commonly used versions today. OVS supports NetFlow V5 only. NetFlow V9 is not supported as of this writing (and very unlikely to be supported because OVS already supports IPFIX, a direct successor of NetFlow V9).

NetFlow V5 packet consists of a header and flow records (up to 30) following the header (see below).

NetFlow V5 header format NetFlow V5 header format

NetFlow V5 flow record format NetFlow V5 flow record format

NetFlow V5 cannot handle IPv6 flow records. If you need to monitor IPv6 traffic, you must use sFlow or IPFIX.

Comparing to NetFlow implementation on typical routers and switches, the one on OVS has a couple of unique points that you should keep in mind. I will describe such points below.

Most NetFlow-capable switches and routers support so called a “sampling”, where only subset of packets are processed for NetFlow (there are couple of ways how to sample the packet but it is beyond the scope of this blog post). NetFlow on OVS doesn’t support sampling. If you need to sample the traffic, you need to use sFlow or IPFIX instead.

Somewhat related to the fact that NetFlow on OVS doesn’t do sampling, it is worth noting that “byte count (dOctets)” and “packet count (dPkts)” fields in a NetFlow flow record (both are 32bit fields) may wrap around in case of elephant flows. In order to circumvent this issue, OVS sends multiple flow records when bytes or packets count exceeds 32bit maximum so that it can tell the collector with the accurate bytes and packets count.

Typically, NetFlow-capable switches and routers have a per-interface configuration to enable/disable NetFlow in addition to a global NetFlow configuration. OVS on the other hand doesn’t have a per-interface configuration. Instead it is a per-bridge configuration that allows you to enable/disable NetFlow on a per-bridge basis.

Most router/switch-based NetFlow exporters allow to configure the source IP address of NetFlow packet to be exported (and if it is the case, loopback address is a reasonable choice for this address). OVS, however, doesn’t have this capability. The source IP address of NetFlow packet is determined by the IP stack of host operating system and it is usually an IP address associated with the outgoing interface. Since NetFlow V5 doesn’t have a concept like “agent address” in sFlow, most collectors distinguish the exporters by the source IP address of NetFlow packets. Because OVS doesn’t allow us to configure the source IP address of NetFlow packets explicitly, we’d better be aware that there is a chance for the source IP address of NetFlow packets to change when the outgoing interface changes.

Although it is not clearly described in the document, OVS, in fact, supports multiple collectors as shown in the example below. This configuration provides redundancy of collectors.

[shell]# ovs-vsctl — set Bridge br0 netflow=@nf — –id=@nf create NetFlow targets=\[\”10.127.1.67:2055\”,\”10.127.1.68:2055\”\][/shell]

When flow-based network management is adopted, In/Out interface number included in a flow record has a significant importance because it is often used to fitter the traffic of interest. Most commercial collector products have such a sophisticated filtering capability. Router/switch-based NetFlow exporter uses SNMP’s IfIndex to report In/Out interface number. NetFlow on OVS on the other hand uses OpenFlow port number instead to report In/Out interface number in the flow record. OpenFlow port number can be found by ovs-ofctl command as follows:

[shell]# ovs-ofctl show br0
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000000c29eed295
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
1(eth1): addr:00:0c:29:ee:d2:95
config: 0
state: 0
current: 1GB-FD COPPER AUTO_NEG
advertised: 10MB-HD 10MB-FD 100MB-HD 100MB-FD 1GB-FD COPPER AUTO_NEG
supported: 10MB-HD 10MB-FD 100MB-HD 100MB-FD 1GB-FD COPPER AUTO_NEG
speed: 1000 Mbps now, 1000 Mbps max
LOCAL(br0): addr:00:0c:29:ee:d2:95
config: 0
state: 0
speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0[/shell]

In this example, OpenFlow port number of “eth1” is 1. Some In/Out interface number have a special meaning in OVS. An interface local to the host (interfaces that is labeled as “LOCAL” in the example above) is represented as 65534. Output interface number 65535 is used for broadcast/multicast packets whereas most router/switch-based NetFlow exporters use “0” for both these two cases.

Those who are aware that IfIndex information was added to “Interface” table in OVS relatively recently may think that using IfIndex instead of OpenFlow port number is the right thing to do. May be true, but it is not that simple. For example, tunnel interface created by OVS doesn’t have an IfIndex, so it is not possible to export a flow record for traffic that traverses over tunnel interfaces if we simply chose IfIndex as NetFlow In/Out interface number.

NetFlow V5 header has a field called “Engine ID” and “Engine Type”. How these fields are set in OVS by default depends on the type of datapath. If OVS is run in a user space using netdev, Engine Type and Engine ID are derived from a hash value based on the datapath name and Engine ID and Engine Type are the most significant and least significant 8 bit of the hash value respectively. On the other hand, in the case of Linux kernel datapath using netlink, IfIndex of datapath is set to both Engine ID and Engine Type. You can find the IfIndex of OVS datapath by the following command:

[shell]# cat /proc/sys/net/ipv4/conf/ovs-system[/shell]

If you don’t like the default value of these fields, you can configure them explicitly as shown below:

[shell]# ovs-vsctl set Bridge br0 netflow=@nf — –id=@nf create NetFlow targets=\”10.127.1.67:2055\” engine_id=10 engine_type=20[/shell]

The example above shows the case where Engine Type and Engine ID are set at the same time when NetFlow configuration is created for the first time. You can also change NetFlow-related parameters after NetFlow configuration is created by doing like:

[shell]# ovs-vsctl clear NetFlow br0 engine_id=10 engine_type=20[/shell]

In general, typical use case of Engine Type and Engine ID is to distinguish logically separated but physically a single exporter. Good example of such a case is Cisco 6500, which has MSFC and PFC in a single chassis but having separate NetFlow export engines. In OVS case, it can be used to distinguish two or more bridges that is generating NetFlow flow records. As I mentioned earlier, the source IP address of NetFlow packet that OVS exports is determined by the standard IP stack (and it is usually not the IP address associated to the bridge interface in OVS that is NetFlow-enabled.) Therefore it is not possible to use the source IP address of NetFlow packets to tell which bridge is exporting the flow records. By setting a distinct Engine Type and Engine ID on each bridge, then you can tell it to the collector. To my knowledge, not many collectors can use Engine Type and/or Engine ID to distinguish multiple logical exporters however.

There is another use case for Engine ID. As I already explained that OVS uses OpenFlow port number as an In/Out interface number in NetFlow flow record. Because OpenFlow port number is a per bridge unique number, there is a chance for these numbers to collide across bridges. To get around this problem, you can set “add_to_interface” to true.

[shell]# ovs-vsctl set Bridge br0 netflow=@nf — –id=@nf create NetFlow targets=\”10.127.1.67:2055\” add_to_interface=true[/shell]

When this parameter is set to true, the 7 most significant bits of In/Out interface number is replaced with the 7 least significant bits of Engine ID. This will help interface number collision happen less likely.

Similar to typical router/switch-based NetFlow exporters, OVS also has a concept of active and inactive timeout. You can explicitly configure active timeout (in seconds) using the following command:

[shell]# ovs-vsctl set Bridge br0 netflow=@nf — –id=@nf create NetFlow targets=\”10.127.1.67:2055\” active_timeout=60[/shell]

If it is not explicitly specified, it defaults to 600 seconds. If it is specified as -1, then active timeout will be disabled.

While OVS has an inactive timeout mechanism for NetFlow, it doesn’t have an explicit configuration knob for it. When flow information that OVS maintains is removed from the datapath, information about those flows will also be exported via NetFlow. This timeout is dynamic; it varies depending on many factors like the version of OVS, CPU and memory load etc., but typically it is 1 to 2 seconds in recent OVS. This duration is considerably shorter than that of typical router/switch-based NetFlow exporters which is 15 seconds in most cases.

As with most router/switch-based NetFlow exporters, OVS exports flow records for ICMP packet by filling the number calculated by “ICMP Type * 256 + Code” to source port number in a flow record. Destination port number for ICMP packets is always set to 0.

NextHop, source and destination of AS number, source and destination of Netmask are always set to 0. This is an expected behavior as OVS is inherently a “switch”.

While there are some caveats as I described above, NetFlow on OVS is a very useful tool if you want to monitor traffic handled by OVS. One advantage of NetFlow over sFlow or IPFIX is that there are so many open source or commercial collectors available today. Whatever flow collector you choose most likely supports NetFlow V5. Please give it a try. It will give you a great visibility about the traffic of your network.

BGP4 library implementation by Ruby

A while back, I posted a simple BGP4 library implemented by Ruby to GitHub. I wrote this code when I was working at Fivefront, which was selling GenieATM, a NetFlow/sFlow/IPFIX collector.

https://github.com/mshindo/bgp-server

Using this library, you can easily write a code that speaks BGP4 as shown below:

#
# example1.rb
#
require 'bgp-server'

open = Open.new(7675, '172.16.167.1')
openmsg = BgpMsg.new(open)

origin = Origin.new
pathseg1 = AsPathSegment.new([100, 101, 102])
aspath = AsPath.new
aspath.add(pathseg1)
nexthop = NextHop.new('11.0.0.2')
localpref = LocalPreference.new

path_attr = PathAttribute.new
path_attr.add(origin)
path_attr.add(aspath)
path_attr.add(nexthop)
path_attr.add(localpref)

nlri = Nlri.new(['10.0.0.0/8', '20.0.0.0/16'])
update = Update.new(path_attr, nlri)
updatemsg = BgpMsg.new(update)

keepalive = KeepAlive.new
keepalivemsg = BgpMsg.new(keepalive)

bgp = Bgp.new
bgp.start
bgp.send openmsg
puts "OPEN sent"
bgp.send keepalivemsg
puts "KEEPALIVE sent"

bgp.send updatemsg
puts "UPDATE sent"

loop do
  sleep 30
  bgp.send keepalivemsg
  puts "KEEPALIVE sent"
end

I wrote this code primarily because we were so “poor” at that time (we were a startup with just 5 people) and couldn’t afford to buy a decent physical router such as Cisco or Juniper 🙂 Another reason was that we needed to inject large number of routes (e.g. 1,000,000 routes) to GenieATM, in which case Zebra did do a good job as it has to keep all routes in memory. These situations motivated me to write a stateless BGP code for route injection.

Over time, I added more features as needed and the code became more generic. Sometimes it gave me a great help, for example, when I needed to debug an issue introduced by a subtle difference of how BGP4 Capability Announcement is made by Juniper and Force10 (now DELL). Juniper announces capability by listing multiple Capability Advertisement that includes only one capability. On the other hand, Force10 announces single Capability Advertisement that includes multiple capabilities. This BGP4 library was useful to simulate these two different behaviors.

Because of the intended use case of this library code, the abstraction level of BGP was set low intentionally. It could have been abstracted more, but then it would in turn loose the flexibility of crafting a BGP message arbitrarily.

Please note that there are some limitations and constrains because this code was primarily developed to test GenieATM, a flow collector. First, while BGP session can be established from either peer in general, this code will never initiate BGP session from itself. Instead, it simply expects the peer to establish a session (a.k.a. passive mode). Second, no matter what messages are sent by the peer, it doesn’t do anything with it. In essence, this BGP4 library aims to experiment how the peer BGP implementation behaves when various kinds of BGP messages are sent.

Not to mention, this code is not a complete implementation of BGP4. Compared to other complete BGP4 implementations like Zebra/Quagga, this code is just a toy. That said, I decided to make this code available to public hoping that someone who has a similar use case as mine may find this code useful. I would like to say thanks to Tajima-san, who suggested me to publish this code to Github.

A book about NSX

It has been my desire that I would like to contribute something “visible” to the world and my dream came true (at least to some extent). I am pleased to announce that I wrote a book about VMware NSX with a couple of my colleagues, which can be found here.

VMware NSX Illustrated
VMware NSX Illustrated

This is not a translation of an existing book, instead it is written from scratch. We apology that this books is available only in Japanese for the time being, while we are getting several requests outside Japan for the translation of this book into other languages. We are not certain if such a work will take place, but we do hope it will. That said,  you can purchase it today  from www.amazon.co.jp (sorry it is not available on www.amazon.com) if you’re interested.

The table of contents looks like as follows:

  • Chapter 01 Technical Background and Definitions
  • Chapter 02 Standardization
  • Chapter 03 Some challenges in networking today
  • Chapter 04 Network Virtualization API
  • Chapter 05 Technical Details about NSX
  • Chapter 06 OpenStack and Neutron
  • Chapter 07 SDDC and NSX

among which I was responsible for the first half of Chapter 05 and some columns. While I have contributed many articles to magazines and books in the past, it was the first time for me that my name was explicitly listed as authors, so it was a profound experience.

What drove us to publish this book was the experiences in our day to day activities. Because network virtualization is a relatively new concept, we’ve often come across a situation where people don’t understand it very well or even misunderstand it sometimes. In a sense it was our fault because we haven’t been able to provide enough information about NSX publicly particularly in Japan market. To change this situation the best way we thought was to write a book about NSX and we made it!

It was not an easy journey though. It was almost an year ago when we first came up with an idea to write this book and we had many chances to give it up. I would like to say thank you for all co-authors (i.e. Mizumoto-san, Tanaka-san, Yokoi-san, Takata-san and Ogura-san) for their efforts that made this book available at vForum Tokyo 2014, which was tough I know. I am also very thankful to Mr. Maruyama (Hecula Inc.) and Mr. Hatanaka (Impress Inc.) for their night and day devotion for editing this book.

Publishing the book is not our goal. We hope that this book helps people understand network virtualization better and get more traction on NSX.

Geneve on Open vSwitch

A few weeks back, I posted a blog (sorry it was done only in Japanese) about a new encapsulation called “Geneve” which is being proposed to IETF as an Internet-Draft. Recently the first implementation of Geneve became available for Open vSwitch (OVS) contributed by Jesse Gross, a main author of Geneve Internet Draft, and the patch was upstream to a master branch on github where the latest OVS code resides. I pulled the latest source code of OVS from github and played with Geneve encapsulation. The following part of this post explains how I tested it. Since this effort is purely for testing Geneve and nothing else, I didn’t use KVM this time. Instead I used two Ubuntu 14.04 VM instances (host-1 and host-2) running on VMware Fusion with the latest OVS installed. In terms of VMware Fusion configuration, I assigned 1 Ethernet NIC on each VM which obtains an IP address from DHCP provided by Fusion by default. In the following example, let’s assume that host-1 and host-2 obtained an IP address 192.168.203.151 and 192.168.203.149 respectively. Next, two bridges are created (called br0 and br1), where br0 connecting to the network via eth0 while br1 on each VM talking with each other using Geneve encapsulation.

Geneve Test with Open vSwitch Geneve Test with Open vSwitch

OVS configuration for host-1 and host-2 are shown below:

[shell]mshindo@host-1:~$ sudo ovs-vsctl add-br br0
mshindo@host-1:~$ sudo ovs-vsctl add-br br1
mshindo@host-1:~$ sudo ovs-vsctl add-port br0 eth0
mshindo@host-1:~$ sudo ifconfig eth0 0
mshindo@host-1:~$ sudo dhclient br0
mshindo@host-1:~$ sudo ifconfig br1 10.0.0.1 netmask 255.255.255.0
mshindo@host-1:~$ sudo ovs-vsctl add-port br1 geneve1 — set interface geneve1 type=geneve options:remote_ip=192.168.203.149[/shell]

[shell]mshindo@host-2:~$ sudo ovs-vsctl add-br br0
mshindo@host-2:~$ sudo ovs-vsctl add-br br1
mshindo@host-2:~$ sudo ovs-vsctl add-port br0 eth0
mshindo@host-2:~$ sudo ifconfig eth0 0
mshindo@host-2:~$ sudo dhclient br0
mshindo@host-2:~$ sudo ifconfig br1 10.0.0.2 netmask 255.255.255.0
mshindo@host-2:~$ sudo ovs-vsctl add-port br1 geneve1 — set interface geneve1 type=geneve options:remote_ip=192.168.203.151[/shell]

Once this configuration has been done, now ping should work between br1 on each VM and those ping packets are encapsulated by Geneve.

[shell]mshindo@host-1:~$ ping 10.0.0.2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.759 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.486 ms
64 bytes from 10.0.0.2: icmp_seq=3 ttl=64 time=0.514 ms
64 bytes from 10.0.0.2: icmp_seq=4 ttl=64 time=0.544 ms
64 bytes from 10.0.0.2: icmp_seq=5 ttl=64 time=0.527 ms
^C
— 10.0.0.2 ping statistics —
5 packets transmitted, 5 received, 0% packet loss, time 3998ms
rtt min/avg/max/mdev = 0.486/0.566/0.759/0.098 ms
mshindo@host-1:~$ [/shell]

Let’s take a closer look how Genve encapsulated packets look like using Wireshark. A Geneve dissector for Wireshark became available recently (this is also a contribution from Jesse, thanks again!) and merged into the latest master branch. Using this latest Wireshark, we can see how Geneve packet looks like as follows:

Geneve Frame by Wireshark
Geneve Frame by Wireshark

As you can see, Geneve uses 6081/udp as its port number. This is a port number officially assigned by IANA on Mar.27, 2014. Just to connect two bridges together by Geneve tunnel, there’s no need to specify a VNI (Virtual Network Identifier) specifically. If VNI is not specified, VNI=0 will be used as you can see in this Wireshark capture.

On the other hand if you need to multiplex more than 1 virtual networks over a single Geneve tunnel, VNI needs to be specified. In such a case, you can designate VNI using a parameter called “key” as an option to ovs-vsctl command as shown below:

[shell]mshindo@host-1:~$ sudo ovs-vsctl add-port br1 geneve1 — set interface geneve1 type=geneve options:remote_ip=192.168.203.149 options:key=5000[/shell]

The following is a Wireshark capture when VNI was specified as 5000 (0x1388):

Geneve Frame with VNI 5000 by Wireshark
Geneve Frame with VNI 5000 by Wireshark

Geneve is capable of encapsulating not only Ethernet frame but also arbitrary frame types. For this purpose Geneve header has a field called “Protocol Type”. In this example, Ethernet frames are encapsulated so this filed is specified as 0x6558 meaning “Transparent Ethernet Bdiging”.

As of this writing, Geneve Options are not supported (more specifically, there is no way which Geneve Options to be added to Geneve header). Please note that Geneve Options are yet to be defined in Geneve’s Internet Draft. Most likely a separate Internet Draft will be submitted to define Geneve Options sooner or later. As such a standardization process progresses, Geneve implementation in OVS will also evolve for sure.

Although Geneve-aware NIC which can perform TSO against Geneve encapsulated packets is not available on the market yet, OVS is at least “Geneve Ready” now. Geneve code is only included in the latest master branch of OVS at this point, but it will be included in the subsequent official release of OVS (hopefully 2.2). When that happens you can play with it more easily. Enjoy!