Network Working Group                                          W. Kumari
Internet-Draft                                                    Google
Intended status: Informational                                J. Halpern
Expires: February 12, 2012                                      Ericsson
                                                         August 11, 2011


                Virtual Machine mobility in L3 Networks.
                  draft-wkumari-dcops-l3-vmmobility-00

Abstract

   This document outlines how Virtual Machine mobility can be
   accomplished in datacenter networks that are based on L3
   technologies.  It is not really intended to solve (or fully define)
   the problem, but rather to outline it at a very high level to
   determine if standardization within the IETF makes sense.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on February 12, 2012.

Copyright Notice

   Copyright (c) 2011 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as



Kumari & Halpern        Expires February 12, 2012               [Page 1]


Internet-Draft               L3 VM Mobility                  August 2011


   described in the Simplified BSD License.


Table of Contents

   1.  Author Notes  . . . . . . . . . . . . . . . . . . . . . . . . . 3
   2.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . . . 3
     2.1.  Requirements notation . . . . . . . . . . . . . . . . . . . 4
   3.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 4
   4.  Overview  . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 7
   6.  Security Considerations . . . . . . . . . . . . . . . . . . . . 7
   7.  Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . 7
   9.  Normative References  . . . . . . . . . . . . . . . . . . . . . 8
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . . . 8



































Kumari & Halpern        Expires February 12, 2012               [Page 2]


Internet-Draft               L3 VM Mobility                  August 2011


1.  Author Notes

   [ RFC Editor -- Please remove this section before publication! ]

   1.  Fix terminology section!
   2.  Rejigger Introduction into Intro and Background!
   3.  Do we need to extend the mapping service to include
       (Customer_ID)?  This will allow the use of overlapping addresses
       by customers, but *does* limit the encapsulating technologies.
   4.  Currently I'm envisioning this as IP only.  It would be fairly
       trivial to make the query in be for the MAC address instead of
       the IP.  This does lead to some interesting issues, like what do
       we do with broadcast, such as ARP?  Have the mapping server reply
       with all of the destinations and then have the source replicate
       the packet?!


2.  Introduction

   There are many ways to design and build a datacenter network (and the
   definition of what exactly a datacenter network is very vague!), but
   in general they can be separated into two main classes, Layer-2 based
   and Layer-3 based.

   A Layer-2 based datacenter is one in which the majority of the
   traffic is bridged (or switched) in a large, flat Layer-2 domain or
   number of Layer-2 domains.  VLANs are often employed to provide
   customer isolation.

   A Layer-3 based datacenter is one in which much of the communication
   between hosts is switched.  In this architecture there are a large
   number of separate Layer-3 domains (for example, one subnet per rack)
   and communication between hosts is usually routed.  Communication
   between hosts in the same subnet is (obviously) bridged / switched.
   While customer isolation can be provided though careful layout and
   access control lists, in general this architecture is better suited
   to a single (or small number ) of users, such as a single
   organization.

   This delineation is obviously a huge simplification as the design and
   build out of a datacenter has many dimensions and most real-world
   datacenters have properties of both Layer-2 and Layer-3.

   Virtual Machines are fast gaining popularity as they allow a
   datacenter operator to more fully leverage their hardware resources
   and, in essence provide statistical multiplexing of compute
   resources.  By selling multiple VMs on a single physical machine they
   can maximise their investment, quickly allocate resources to



Kumari & Halpern        Expires February 12, 2012               [Page 3]


Internet-Draft               L3 VM Mobility                  August 2011


   customers and potentially move VMs to other hosts when needed.

   One of the factors driving the design of datacenters is the desire to
   provide Virtual Machine Mobility.  This allows an operator to move
   the guest machine state from one machine to another, including all of
   the network state, including keeping TCP connections alive.  This
   allows a datacenter operator to dynamically move guest machines
   around to better allocate resources and take devices offline for
   maintain without negatively impacting customers.  VM Mobility can
   even be used to move running machine around to provide better latency
   - for example an instance can be moved from the East Coast of the USA
   to Australia and back on a daily basis to "follow the sun".

   In many cases VM Mobility requires that the source and destination
   host machines are on the same layer-2 networks, which has lead to the
   formation of large Layer-2 networks containing thousands (or tens of
   thousands) of machines.  This has led to some scaling concerns, such
   as those being addressed in the ARMD Working Group.  Some operators
   are more comfortable running Layer-3 networks (and, to be honest
   think that big Layer-2 networks are bad JuJu.)

   This document outlines how VM Mobility can be designed to work in a
   datacenter (or across datacenters) that are broken up into multiple
   Layer-3 domains.

2.1.  Requirements notation

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].


3.  Terminology

   There is a whole industry build around these technologies, and as
   with many new industries each vendor has their own, unique
   nomenclature for the various parts.  This is the terminology that we
   are using within this document -- it may not line up with what others
   call something, but "A rose by any other name..."
   Guest network  A virtual network connecting guest instances owned by
      a customer.  This is also referred to as a Guest LAN or Customer
      Network.
   Host Machine  A machine that "hosts" guest (virtual) machines and
      runs a Hypervisor.  This is usually a powerful server, using
      software and hardware to provide isolation between the guest
      machines, often referred to as a Hypervisor.  The host machine
      emulates all of the functions of a "normal" machine so that,
      ideally, the guest OS is unaware that it is not running on



Kumari & Halpern        Expires February 12, 2012               [Page 4]


Internet-Draft               L3 VM Mobility                  August 2011


      dedicated hardware.
   Gateway  A device that provides access to external networks.  It
      provides services to
   Guest Machine  A "Virtual Machine" that run on a Host Machine.
   Hypervisor  A somewhat loose them that encompasses the hardware and
      software that provides isolation between guest machines and
      emulates all of the functions of bare metal server.  This usually
      includes such things as a virtual Network Interface Card (NIC), a
      virtual CPU (usually assisted by specialized hardware in the host
      machine's CPU), virtual memory, etc.
   Mapping Service  A service providing a mapping between guest machines
      and host machines on which those guests are running.  This mapping
      service also provides mappings to Gateways that provide
      connectivity to devices outside the customer networks.
   Virtual Machine  A synonym for Guest Machine
   Virtual Switch  A vitualized bridge created by the Hypervisor,
      bridging the virtual NICs in the virtual machines and providing
      access to the physical network.


4.  Overview

   By providing a "shim" layer within the network stack provided by the
   Hypervisor (or Guest machine) we can create a virtual L2 network
   connecting the machines belonging to a customer, even if these
   machines are in different L3 networks (subnets).

   When an application on a virtual machine sends a packet to a receiver
   on another virtual machine, the operating system on the sending VM
   needs to resolve the hardware address of the destination IP address
   (using ARP in IPv4 or Neighbor Discovery / Neighbor Solicitation in
   IPv6).  To do this, it generates an ARP / NS packet and broadcasts /
   multicasts this.  As with all traffic sent by the VM, this is handed
   to a virtual network card, which is simulated by the hypervisor (yes,
   some VM technologies provide direct access to hardware, this will be
   further discussed later).  The hypervisor examines the packet to
   provide access control (and similar) and then discards or munges or
   sends the packet on the physical network.  So far this describes the
   current operation of VM networking.

   In order to provide Layer-2 connectivity between a set of virtual
   machines that run on host machines in different IP subnets (for
   example, in a Layer-3 based datacenter (or even owned and operated by
   different providers) we simply build an overlay network connecting
   the host machines.

   When the VM passes the ARP / NS packet to the virtual NIC, the
   hypervisor intercepts the packet, records which VM generated the



Kumari & Halpern        Expires February 12, 2012               [Page 5]


Internet-Draft               L3 VM Mobility                  August 2011


   request and extracts the IP address to be resolved.  It then queries
   a mapping server with the guest VM identifier and requested address
   to determine the IP address of the host machine that hosts the
   requested destination VM, the VM identifier on that host, and the
   virtual MAC assigned to that virtual machine.  Once the source guest
   VM receives this information it caches it, and either encapsulates
   the original ARP / NS in an encapsulation / tunneling mechanism
   (similar to GRE) or simply synthesized an response and hands that
   back to the source VM.

   Presumably the source VM initialed resolution of the destination VM
   because it wanted to send traffic to it, so shortly after the source
   has resolved the destination it will try and send an data packet to
   it.  Once this data packet reaches the hypervisor on the source host
   machine, the hypervisor simply encapsulates the packet in a tunneling
   protocol and ships it over the IP network to the destination.  When
   the packet reaches the destination host machine, the packet is
   decapsulated, the VM ID is extracted and the packet is passed up to
   the destination VM.  (TODO (WK): We need a tunneling mechanism that
   has a place to put the VM ID -- find one, extend one or simply define
   a new one).  In many ways much of this is similar to LISP...

   As the ability to resolve (and so send traffic to) a given machine
   requires getting the information from a mapping server, communication
   between hosts can be easily granted and revoked by the mapping
   server.  It is expected that the mapping server will know which VMs
   are owned by each customer and will, by default, allow access between
   only those VMs (and a gateway, see below), but if the operator so
   chooses it can (but probably shouldn't!) allow access between VMs
   owned by different customers, etc.  In addition, because the mapping
   server uses both the IP address and VM ID to look up the destination
   information (and the traffic between VMs is encapsulated),
   overlapping customer space is seamlessly handled (other than in the
   pathological case where operators allow the customers to interconnect
   at L2!).

   Obviously just having a bunch of customer machines communication just
   amongst themselves isn't very useful - the customer will want to
   reach them externally, they will be serving traffic to / from the
   Internet, etc.  This functionality is provided by gateway machines -
   these machines decapsualte traffic that is destined to locations
   outside the virtual network and encapsulate traffic bound for ht
   destination network, etc.

   By encapsulating the packet (for example in a GRE packet) the
   Hypervisor can provide a virtual, transparent network to the
   receiver.  In order to obtain the necessary information to
   encapsulate the packet (for example, the IP address of the machine



Kumari & Halpern        Expires February 12, 2012               [Page 6]


Internet-Draft               L3 VM Mobility                  August 2011


   hosting the receiving VM) the sending Hypervisor queries the Mapping
   Service.  This service maps the tuple of (Customer_ID, Destination
   Address) to the host machine hosting the instance.

   For example, if guest machine GA, owned by customer CA on host
   machine HX wishes to send a packet to guest machine GB (also owned by
   customer CA) on host machine HY it would generate an ARP request (or,
   in IPv6 land, a neighbor solicitation) for GB.  The Hypervisor
   process on HX would intercept the ARP, and query the Mapping Service
   for (CA, GB) which would reply with the address of HY (Hypervisor
   also cache this information) The Hypervisor on HX would then
   encapsulate the ARP request packet in a GRE packet, setting the
   destination to be HY and sending the packet.  When the Hypervisor
   process on HY receives the packet it would decapsulate the packet and
   hand it to the guest instance GB.  This process is transparent to GA
   and GB - as far as they are concerned, they are both connected to a
   single network.  While the above might sound like a heavyweight
   operation, the hypervisor is (in general) already examining all
   packets to provide a virtualized switch, performing access control
   functions and similar - performing the mapping functionality and
   encapsulation / decapsulation is not expected to be expensive.

   The Mapping Service contains information about all of the guest
   machines, which customer they are associated with, and routes to
   external networks.  If a guest machine sends a packet that is
   destined to an external network (such as a host on the Internet), the
   mapping server returns the address of a Gateway.


5.  IANA Considerations

   No action required.


6.  Security Considerations


7.  Privacy

   There


8.  Acknowledgements

   I would like to thank Google for 20% time.






Kumari & Halpern        Expires February 12, 2012               [Page 7]


Internet-Draft               L3 VM Mobility                  August 2011


9.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.


Authors' Addresses

   Warren Kumari
   Google
   1600 Amphitheatre Parkway
   Mountain View, CA  94043
   US

   Email: warren@kumari.net


   Joel M. Halpern
   Ericsson


   Email: joel.halpern@ericsson.com





























Kumari & Halpern        Expires February 12, 2012               [Page 8]