Internet Engineering Task Force T. Narten, Ed. Internet-Draft IBM Intended status: Informational Sept 2011 Expires: March 4, 2012 Problem Statement: Using L3 Overlays for Network Virtualization draft-narten-overlay-problem-statement-00 Abstract This document lays out the case for developing L3 overlays to provide network virtualization. In addition, the document describes what issues need to be resolved in order to produce an interoperable standard. Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on March 4, 2012. Narten Expires March 4, 2012 [Page 1] Internet-Draft Overlays for Network Virtualization Sept 2011 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Problem Details . . . . . . . . . . . . . . . . . . . . . . . 3 2.1. Limitations Imposed by Spanning Tree and VLAN Spaces . . . 4 2.2. Multitenant Environments . . . . . . . . . . . . . . . . . 4 2.3. Inadequate Table Sizes at ToR Switch . . . . . . . . . . . 5 2.4. Decoupling Logical and Physical Configuration . . . . . . 5 3. Overlay Network Framework . . . . . . . . . . . . . . . . . . 6 3.1. Standardization Issues for Overlay Networks . . . . . . . 6 4. Benefits of an Overlay Approach . . . . . . . . . . . . . . . 7 5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 8 5.1. ARMD . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5.2. Trill . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5.3. L2VPNs . . . . . . . . . . . . . . . . . . . . . . . . . . 8 6. Further Work . . . . . . . . . . . . . . . . . . . . . . . . . 9 7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 10. Security Considerations . . . . . . . . . . . . . . . . . . . 10 11. Informative References . . . . . . . . . . . . . . . . . . . . 10 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 10 Intellectual Property and Copyright Statements . . . . . . . . . . 11 Narten Expires March 4, 2012 [Page 2] Internet-Draft Overlays for Network Virtualization Sept 2011 1. Introduction Server virtualization is increasingly becoming the norm in data centers. With server virtualization, each physical server supports multiple virtual machines (VMs), each running its own operating system, middleware and applications. Virtualization is a key enabler of workload agility, i.e., allowing any server to host any application and providing the flexibility of adding, shrinking, or moving services among the physical infrastructure. Server virtualization provides numerous benefits, including higher utilization, increased data security, reduced user downtime, reduced power usage, etc. Server virtualization is driving and accentuating scaling limitations in existing datacenter networks. Placement and movement of VMs in a network effectively requires that VM IP addresses be fixed and static. From an IP perspective, that means VM's can only move within a single IP subnet, and in particular cannot migrate from one IP subnet to another. In practice, this leads to a desire for larger and flatter L2 networks, so that a given VM can be placed (or moved to) any physical location within the datacenter, without being constrained by subnet boundary concerns. The general scaling problems of large, flat L2 networks are well known. For example, broadcast storms date back to the earliest deployments. Current network deployments, however, are experiencing additional and new pain points. For example, the lack of multipathing in traditional STP/RSTP Ethernets makes it impossible to fully use available network capacity. STP/RSTP eliminates loops by excluding (and not using) redundant links. The increase in both the number of physical machines, and the number of VMs per physical machine has lead to MAC address explosion whereby switches need to have increasingly large forwarding tables to handle the traffic they switch. Finally, the 4094 VLAN limit is no longer sufficient in a shared infrastructure servicing multiple tenants. This document outlines the problems encountered in scaling L2 networks in a datacenter and makes the case than an overlay based approach, where individual L2 networks are implemented within individual L3 "domains" provide a number of advantages of current approaches. 2. Problem Details Narten Expires March 4, 2012 [Page 3] Internet-Draft Overlays for Network Virtualization Sept 2011 2.1. Limitations Imposed by Spanning Tree and VLAN Spaces Current Layer 2 networks use the Spanning Tree Protocol (STP) to avoid loops in the network due to duplicate paths. STP will turn off links to avoid the replication and looping of frames. Some data center operators see this as a problem with Layer 2 networks in general since with STP they are effectively paying for more ports and links than they can use. In addition, resiliency due to multipathing is not available with the STP model. Newer initiatives like TRILL are being proposed to help with multipathing and thus surmount some of the problems with STP. STP limitations may be avoided by configuring servers within a rack to be on the same Layer 3 network with switching happening at Layer 3 both within the rack and between racks. However, this is incompatible with the desire to be able to move VMs anywhere within the datacenter. Another characteristic of Layer 2 data center networks is their use of Virtual LANs (VLANs) to provide broadcast isolation. A 12-bit VLAN ID is used in the Ethernet data frames to divide the larger Layer 2 network into multiple broadcast domains. This has served well for smaller data centers which are limited to less than 4094 VLANs. With the growing adoption of virtualization, this upper limit is seeing pressure. Moreover, due to STP, several data centers limit the number of VLANs that could be used. In addition, requirements for multitenant environments accelerate the need for larger VLAN limits, as discussed below. 2.2. Multitenant Environments Cloud computing involves on-demand elastic provisioning of resources for multitenant environments. The most common example of cloud computing is the public cloud, where a cloud service provider offers these elastic services to multiple customers over the same infrastructure. Isolation of network traffic by tenant could be done via Layer 2 or Layer 3 networks. For Layer 2 networks, VLANs are often used to segregate traffic - so a tenant could be identified by its own VLAN, for example. Due to the large number of tenants that a cloud provider might service, the 4094 VLAN limit is often inadequate. In addition, there is often a need for multiple VLANs per tenant, which exacerbates the issue. Another use case is cross pod expansion. A pod typically consists of one or more racks of servers with its associated network and storage connectivity. Tenants may start off on a pod and, due to expansion, require servers/VMs on other pods, especially the case when tenants on the other pods are not fully utilizing all their resources. This Narten Expires March 4, 2012 [Page 4] Internet-Draft Overlays for Network Virtualization Sept 2011 use case requires a "stretched" Layer 2 environment connecting the individual servers/VMs. Layer 3 networks are not a complete solution for multi tenancy either. Two tenants might use the same set of Layer 3 addresses within their networks which requires the cloud provider to provide isolation in some other form. Further, requiring all tenants to use IP excludes customers relying on direct Layer 2 or non-IP Layer 3 protocols for inter VM communication. 2.3. Inadequate Table Sizes at ToR Switch Today's virtualized environments place additional demands on the MAC address tables of Top of Rack (ToR) switches which connect to the servers. Instead of just one MAC address per server link, the ToR now has to learn the MAC addresses of the individual VMs (which could range in the 100s per server). This is a requirement since traffic from/to the VMs to the rest of the physical network will traverse the link to the switch. A typical ToR switch could connect to 24 or 48 servers depending upon the number of its server facing ports. A data center might consist of several racks, so each ToR switch would need to maintain an address table for the communicating VMs across the various physical servers. This places a much larger demand on the table capacity compared to non-virtualized environments. If the table overflows, the switch may stop learning new addresses until idle entries age out, leading to significant flooding of unknown destination frames. 2.4. Decoupling Logical and Physical Configuration Data center operators must be able to achieve high utilization of server and network capacity. In order to achieve efficiency it should be possible to assign workloads that operate in a single Layer-2 network to any server in any rack in the network. It should also be possible to migrate workloads to any server anywhere in the network while retaining the workload's addresses. This can be achieved today by stretching VLANs. When workloads migrate, however, the physical network needs to be reconfigured which is typically error prone. By decoupling the workload's location on the infrastructure LAN from the network addresses VMs use when communicating with each other, the network administrator can configure the network once and not every time a service migrates. This decoupling enables any server to become part of any server resource pool. Narten Expires March 4, 2012 [Page 5] Internet-Draft Overlays for Network Virtualization Sept 2011 3. Overlay Network Framework The idea behind overlays is straightforward. Take the set of machines that are allowed to communicate with each other and group them into a high-level construct called a domain. A domain could be one L2 VLAN, a single IP subnet, or just an arbitrary collection of machines. The domain identifies the set of machines that are allowed to communicate with each other directly, and provides isolation from machines not in the same domain. The overlay connects the machines of a particular domain together. A switch connects each machine to its domain, accepting ethernet frames from attached VMs and encapsulating them for transport across the IP overlay. An egress switch decapsulates the frame and delivers it to the target VM. 3.1. Standardization Issues for Overlay Networks To provide a complete overlay solution, several issues need to be resolved. First, an overlay header is needed for transporting encapsulated Ethernet frames across the IP network to their ultimate destination within a specific domain. To provide multi-tenancy, the overlay header needs a field to identify which domain an encapsulated packet belongs to. Consequently, some sort of Domain Identifier is needed. VXLAN [I-D.mahalingam-dutt-dcops-vxlan] uses a 24-bit VXLAN Network Identifier (VNI), while NVGRE [I-D.sridharan-virtualization-nvgre] uses a 24-bit Tenant Network Identifier (TNI). The details of a specific overlay header format need to be worked out. Questions to be resolved include whether to use a standard format such as GRE [RFC2784] [RFC2890], or to define one specifically tailored to meet the requirements of an overlay network. Questions include whether to use UDP (in order to facilitate transport through NAT and other middlebox devices) or to build directly on top of IP as GRE does. Additionally, the encapsulated payload could include a full Ethernet header, including source and destination MAC addresses, VLAN information, etc., or some subset thereof. Second, an address mapping system is needed to map the destination address as specified by the originating VM into the egress IP address of the router to which the Ethernet frame will be tunneled. VXLAN uses a "learning" approach for this, similar to what L2 bridges use. NVGRE [I-D.sridharan-virtualization-nvgre] has not indicated how it proposes to perform address mapping, leaving details for a later document. Other approaches are possible, such as managing mappings in a centralized control manager, which commonly exist in datacenters already as part of managing VM placement and migration, and already maintain knowledge about the current locations of VMs. Use of a centralized controller would require development of a protocol for Narten Expires March 4, 2012 [Page 6] Internet-Draft Overlays for Network Virtualization Sept 2011 distributing address mappings from the controller to the switches where encapsulation takes place. Another aspect of address mapping concerns the handling of broadcast and multicast frames, or the delivery of unicast packets when no mapping exists. One approach is to flood such frames to all machines belonging to the domain. Both VXLAN and NVGRE suggest associating an IP multicast address taken from the network's infrastructure as a way of connecting together all the machines in a particular domain. All VMs within a domain can be reached by sending encapsulated packets to the domain's IP multicast address. Another issue is whether fragmentation is needed. Whenever tunneling is used, one faces the potential problem that the packet plus encapsulation overhead will exceed the MTU of the path to the egress router. Fragmentation could be left to IP, could be done at the overlay level in a more optimized fashion or could be left out altogether, if it is believed that datacenter networks can be engineered to prevent MTU issues from arising. Finally, successful deployment of an overlay approach will likely require appropriate Operations, Administration and Maintenance (OAM) facilities. 4. Benefits of an Overlay Approach A key aspect of overlays is the decoupling of the "virtual" MAC and IP addresses used by VMs from the physical network infrastructure and the infrastructure IP addresses used by the datacenter. If a VM changes location, the switches at the edge of the overlay simply update their mapping tables to reflect the new location of the VM within the data center's infrastructure space. Because IP is used, a VM can now be located anywhere in the data center without regards to traditional constraints implied by L2 properties such as VLAN numbering, or requiring that an L2 broadcast domain be scoped to a single pod or TOR switch. Multitenancy is supported by isolating the traffic of one domain from traffic of another. Traffic from one domain cannot be delivered to another domain without (conceptually) exiting the domain and reentering the other domain. Likewise, external communications (from a VM within a domain to a machine outside a domain) is handled by having an ingress switch forward traffic to an external router, where an egress switch decapsulates a tunneled packet and delivers it to the router for normal processing. This router is external to the overlay, and behaves much like existing external facing routers in datacenters today. Narten Expires March 4, 2012 [Page 7] Internet-Draft Overlays for Network Virtualization Sept 2011 The use of a large (e.g., 24-bit) domain identifiers would allow 16 million distinct domains within a single datacenter, eliminating current VLAN size limitations. Using an overlay that sits above IP allows for the leveraging of the full range of IP technologies, including quality-of-service (QoS) and Equal Cost Multipath (ECMP) routing for load balancing across multiple links. Overlays are designed to handle the common case of a set of VMs placed within a single L2 broadcast domain. Such configurations include VMs placed within a single VLAN or IP subnet. All the VMs would be placed into a common overlay domain. 5. Related Work 5.1. ARMD ARMD is chartered to look at data center scaling issues with a focus on address resolution. ARMD is currently chartered to develop a problem statement and is not currently developing solutions. An overlay-based approach would address many if not all of the "pain points" that have been raised in ARMD. 5.2. Trill Trill is an L2 based approach aimed at improving deficiencies and limitations with current Ethernet networks. While Trill provides a good approach to improving current Ethernets, it is entirely L2 based. Trill is not designed to scale in size the current sizes that current data centers need to scale to. [RFC6325] explicitly says: The TRILL protocol, as specified herein, is designed to be a Local Area Network protocol and not designed with the goal of scaling beyond the size of existing bridged LANs. 5.3. L2VPNs The IETF has specified a number of approaches for connecting L2 domains together as part of the L2VPN Working Group. That group, however is focused on Provider-provisioned L2 VPNs. Overlay approaches can be used in data centers where the overlay network is managed by the datacenter operator. Other L2VPN approaches, such as L2TP [RFC2661] require significant tunnel state at the encapsulating and decapsulating end points. Overlays require less tunnel state than other approaches, which is Narten Expires March 4, 2012 [Page 8] Internet-Draft Overlays for Network Virtualization Sept 2011 important to allow overlays to scale to hundreds of thousands of end points. It is assumed that smaller switches (i.e., virtual switches in hypervisors or the physical switches to which VMs connect) will be part of the overlay network and be responsible for encapsulating and decapsulating packets. 6. Further Work It is believed that overlay-based approaches can reduce the overall amount of flooding and other multicast and broadcast related traffic (e.g, ARP and ND) currently experienced within current datacenters with a large flat L2 network. Further analysis is needed to characterize the expected improvement. 7. Summary This document has argued that network virtualization using L3 overlays addresses a number if issues being faced as data centers scale in size. 8. Acknowledgments This document incorporates significant amounts of text from [I-D.mahalingam-dutt-dcops-vxlan]. Specifically, much of Section 2 is incorporated verbatim from Section 3 of [I-D.mahalingam-dutt-dcops-vxlan]. The authors of that document include Mallik Mahalingam (VMware), Dinesh G. Dutt (Cisco), Kenneth Duda (Arista Networks), Puneet Agarwal (Broadcom), Lawrence Kreeger (Cisco), T. Sridhar (VMware), Mike Bursell (Citrix), and Chris Wright (Red Hat). Additional text in Section 2 was taken from [I-D.sridharan-virtualization-nvgre]. The authors of that document include Murari Sridharan (Microsoft), Kenneth Duda (Arista Networks), Ilango Ganga (Intel), Albert Greenberg (Microsoft), Geng Lin (Dell), Mark Pearson (Hewlett-Packard), Patricia Thaler (Broadcom), Chait Tumuluri (Emulex), Narasimhan Venkataramiah (Microsoft) and Yu-Shun Wang (Microsoft). 9. IANA Considerations This memo includes no request to IANA. Narten Expires March 4, 2012 [Page 9] Internet-Draft Overlays for Network Virtualization Sept 2011 10. Security Considerations TBD 11. Informative References [I-D.mahalingam-dutt-dcops-vxlan] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, L., Sridhar, T., Bursell, M., and C. Wright, "VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks", draft-mahalingam-dutt-dcops-vxlan-00 (work in progress), August 2011. [I-D.sridharan-virtualization-nvgre] Sridharan, M., Duda, K., Ganga, I., Greenberg, A., Lin, G., Pearson, M., Thaler, P., Tumuluri, C., Venkataramaiah, N., and Y. Wang, "NVGRE: Network Virtualization using Generic Routing Encapsulation", draft-sridharan-virtualization-nvgre-00 (work in progress), September 2011. [RFC2661] Townsley, W., Valencia, A., Rubens, A., Pall, G., Zorn, G., and B. Palter, "Layer Two Tunneling Protocol "L2TP"", RFC 2661, August 1999. [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, March 2000. [RFC2890] Dommety, G., "Key and Sequence Number Extensions to GRE", RFC 2890, September 2000. [RFC6325] Perlman, R., Eastlake, D., Dutt, D., Gai, S., and A. Ghanwani, "Routing Bridges (RBridges): Base Protocol Specification", RFC 6325, July 2011. Author's Address Thomas Narten (editor) IBM Phone: Email: narten@us.ibm.com Narten Expires March 4, 2012 [Page 10] Internet-Draft Overlays for Network Virtualization Sept 2011 Full Copyright Statement Copyright (C) The IETF Trust (2011). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Narten Expires March 4, 2012 [Page 11]