Internet Engineering Task Force T. Narten Internet-Draft IBM Intended status: Informational May 03, 2013 Expires: November 04, 2013 An Architecture for Overlay Networks (NVO3) draft-narten-nvo3-arch-00 Abstract This document presents a high-level overview of a possible architecture for building overlay networks in NVO3. The architecture is given at a high-level, showing the major components of an overall system. An important goal is to divide the space into individual smaller components that can be implemented independently and with clear interfaces and interactions with other components. It should be possible to build and implement individual components in isolation and have them work with other components with no changes to other components. That way implementers have flexibility in implementing individual components and can optimize and innovate within their respective components without necessarily requiring changes to other components. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on November 04, 2013. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents Narten Expires November 04, 2013 [Page 1] Internet-Draft Overlays for Network Virtualization May 2013 (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 3 4. Network Virtualization Edge (NVE) . . . . . . . . . . . . . . 4 4.1. NVE Co-located With Server Hypervisor . . . . . . . . . . 5 4.2. Bare Metal Servers . . . . . . . . . . . . . . . . . . . 5 4.3. Split-NVE . . . . . . . . . . . . . . . . . . . . . . . . 5 5. Address Mapping Dissemination . . . . . . . . . . . . . . . . 6 5.1. Network Virtualization Authority . . . . . . . . . . . . 6 5.2. NVE-NVA Interaction Models . . . . . . . . . . . . . . . 8 5.3. NVE-to-NVA Protocol . . . . . . . . . . . . . . . . . . . 9 5.3.1. Push vs. Pull Model . . . . . . . . . . . . . . . . . 9 5.4. Intra-NVA Control Protocol . . . . . . . . . . . . . . . 10 5.5. Inter-NVA Control Protocol . . . . . . . . . . . . . . . 10 5.6. Control Protocol Summary . . . . . . . . . . . . . . . . 10 6. NVO3 Data Plane Encapsulation . . . . . . . . . . . . . . . . 11 7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 9. Security Considerations . . . . . . . . . . . . . . . . . . . 12 10. Informative References . . . . . . . . . . . . . . . . . . . 12 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 12 1. Introduction This document presents a high-level overview of a possible architecture for building overlay networks in NVO3. The architecture is given at a high-level, showing the major components of an overall system. An important goal is to divide the space into smaller individual components that can be implemented independently and with clear interfaces and interactions with other components. It should be possible to build and implement individual components in isolation and have them work with other components with no changes to other components. That way implementers have flexibility in implementing individual components and can optimize and innovate within their respective components without necessarily requiring changes to other components. Narten Expires November 04, 2013 [Page 2] Internet-Draft Overlays for Network Virtualization May 2013 The motivation for overlay networks is given in [I-D.ietf-nvo3-overlay-problem-statement]. "Framework for DC Network Virtualization" [I-D.ietf-nvo3-framework] provides a framework for discussing overlay networks generally and the various components that must work together in building such systems. This document differs from the framework document in that it doesn't attempt to cover all possible approaches within the general design space. Rather, it describes one particular approach. This document is intended to be a concrete strawman that can be used for discussion within the IETF NVO3 WG on what the NVO3 architecture should look like. 2. Terminology This document uses the same terminology as [I-D.ietf-nvo3-framework]. 3. Background Overlay networks provide networking service to a set of Tenant Systems (TSs) [I-D.ietf-nvo3-framework]. Tenant Systems connect to Virtual Networks (VNs), with the VN's attributes defining aspects of the network including the set of members belonging to that specific virtual network. Tenant Systems connected to a virtual network communicate freely with other Tenant Systems on the same VN, but communication between Tenant Systems on one VN and those on another VN or not connected to a VN is carefully restricted and governed by policy. A Virtual Network provides either L2 or L3 service to connected tenants. For L2 service, VNs transport Ethernet frames, and a Tenant System is provided with a service that is analogous to being connected to a specific L2 C-VLAN. L2 broadcast frames are delivered to all (and multicast frames delivered to a subset of) the other Tenant Systems on the VN. To a Tenant System, it appears as if they are connected to a regular L2 Ethernet link. Within NVO3, tenant frames are tunneled to remote NVEs based based on the MAC addresses of the frame headers as originated by the Tenant System. On the underlay, NVO3 packets are forwarded between NVEs based on the outer addresses of tunneled packets. Narten Expires November 04, 2013 [Page 3] Internet-Draft Overlays for Network Virtualization May 2013 For L3 service, a Tenant System still connects to the network via an L2 Ethernet link, but all traffic to and from the Tenant System is assumed to be IP. The L2 headers are only used to provide backwards compatibility, so that unmodified Tenant Systems can operate unchanged when using NVO3. Within NVO3, tenant frames are tunneled to remote NVEs based based on the IP addresses of the packet originated by the Tenant System; the L2 destination addresses provided by Tenant Systems are effectively ignored. It is important to note that whether NVO3 provides L2 or L3 service to a Tenant System, the Tenant end point does not need to be aware of the distinction. The Tenant System still connects to an NVO3 network via an L2 link. L2 service is intended for systems that need native L2 ethernet service and the ability to run protocols directly over Ethernet (i.e., not based on IP). L3 service is intended for systems in which all the traffic can safely be assumed to be IP. 4. Network Virtualization Edge (NVE) As described in [I-D.ietf-nvo3-framework], a Network Virtualization Edge (NVE) is the entity that resides at the boundary between a Tenant System and the overlay network and implements the overlay functionality. Towards the Tenant System, the NVE provides L2 (or L3) service. Towards the data center network, the NVE sends and receives native IP traffic. When ingressing traffic from a Tenant System, the NVE identifies the egress NVE to which the packet should be sent, adds an overlay encapsulation header, and sends the packet on the underlay network. When egressing traffic, an NVE receives an encapsulated packet from a remote NVE via the underlay network, strips off the encapsulation header, and delivers the (original) packet to the appropriate Tenant System. Conceptually, the NVE is a single entity implementing the NVO3 functionality. An NVE will have two external interfaces: Tenant Facing: On the tenant facing side, an NVE interacts with the Tenant System to provide the NVO3 service. An NVE will need to learn when a Tenant System "attaches" to a virtual network (so it can validate the request and set up any state needed to send and receive traffic on behalf of the Tenant System on that VN). Likewise, an NVE will need to be informed when the Tenant System "detaches" from the virtual network so that it can reclaim state and resources appropriately. DCN Facing: On the data center network facing side, an NVE interfaces with the data center underlay network, sending and receiving IP packets to and from the underlay. Narten Expires November 04, 2013 [Page 4] Internet-Draft Overlays for Network Virtualization May 2013 4.1. NVE Co-located With Server Hypervisor When server virtualization is used, the entire NVE functionality will typically be implemented as part of the hypervisor and/or vSwitch on the server. In such cases, the Tenant System interacts with the hypervisor and the hypervisor interacts with the NVE. Because the hypervisor and NVE interaction is implemented entirely in software on the server, there is no "on-the-wire" protocol between Tenant Systems (or the hypervisor) and the NVE that need to be standardized. While there may be APIs between the NVE and hypervisor to support necessary interaction, the details of such an API are not in-scope for the IETF to work on. Implementing NVE functionality entirely on a server has the disadvantage that precious server CPU resources must be spent implementing the NVO3 functionality. Experimentation with overlay approaches suggests that offloading at least the encapsulation and decapsulation operations an NVE implements can produce significant performance improvements. As has been done with checksum and/or TCP server offload and other optimization approaches, there may be benefits to offloading common operations onto adaptors where possible. For server systems, such offloading is an implementation matter between server and adaptor vendors and does not require any IETF standardization. 4.2. Bare Metal Servers Many data centers will continue to have at least some servers operate as non-virtualized (or "bare metal") machines running a traditional operating system and workload. In such systems, there will be no NVE functionality on the server, and the server will have no knowledge of NVO3 (including whether overlays are even in use). In such environments, the NVE functionality can reside on the first-hop phsyical switch that understands NVO3. In such a case, the network administrator would (manually) configure the switch to enable the appropriate NVO3 functionality on the network port that connects to the server. Such configuration would typically be static, since the server is not virtualized, and once configured, is unlikely to change frequently. Consequently, this scenario does not require any protocol or standards work. 4.3. Split-NVE One final possible scenario leads to the need for a split NVE implementation. A hypervisor running on a server could be aware that NVO3 is in use, but have some of the actual NVO3 functionality implemented on the first-hop switch to which the server is attached. While one could imagine a number of link types between a server and Narten Expires November 04, 2013 [Page 5] Internet-Draft Overlays for Network Virtualization May 2013 the NVE, we assume a common case deployment where the server and NVE are separated by a simple L2 ethernet link, across which LLDP runs. More complicated scenarios, e.g., where the server and NVE are separated by a bridged access network should be considered only if a compelling use case emerges. [note: the above eliminates a scenario where the NVE resides on a ToR, but is separated from the server by an embedded switch. Is this reasonable? And would handling this scenario eliminate VDP as a candidate solution?] For the split NVE case, protocols will be needed that allow the hypervisor and NVE to negotiate and setup the necessary state so that traffic sent across the access link between a server and the NVE can be associated with the correct virtual network instance. Specifically, on the access link, traffic belonging to a specific Tenant System would be tagged with a specific VLAN C-TAG that identifies which specific NVO3 virtual network instance it belongs too. The hypervisor-NVE protocol would negotiate which VLAN C-TAG to use for a particular virtual network instance. More details of the protocol requirements for this functionality can be found in [I-D.kreeger-nvo3-hypervisor-nve-cp]. 5. Address Mapping Dissemination Address Dissemination refers to the process of learning, building and distributing the necessary mapping/forwarding information that NVEs need in order to tunnel traffic between communicating Tenant Systems from one NVE to another. Before sending and receiving traffic on behalf of a Tenant System attached to a virtual network, the NVE must obtain the information needed to build its internal forwarding tables and state. For example, the NVE will need to know what encapsulation header to use (in the case that there are choices), what Context ID to associate with a given VN, mapping/forwarding tables that indicate where traffic should be tunneled to, etc. An NVE obtains such information from a Network Virtualization Authority. 5.1. Network Virtualization Authority The Network Virtualization Authority (NVA) is the entity that NVEs interact with to obtain any required address mapping information. The term NVA refers to the overall system, without regards to its scope or how it is implemented. That said, the NVA does not consist of a single standalone-server; it will consist of numerous components, designed for fault-tolerance, performance, and availability. The internal organization and architecture of the NVA is hidden from NVEs. NVEs simply interact with NVAs via a carefully- defined, purpose-specific NVE-NVA protocol. Narten Expires November 04, 2013 [Page 6] Internet-Draft Overlays for Network Virtualization May 2013 NVAs operate in a federated manner with the overall NVA operating as a loosely-coordinated federation of individual local NVAs. Local NVAs are operated by a single administrative entity and typically operate within a single data center. A local NVA provides service to the NVEs residing at that site. If a virtual network spans multiple data centers, and an NVE needs to tunnel traffic to an NVE at a remote data center, it still interacts only with the local NVA at its local site, even when obtaining mappings for NVEs at remote sites. Individual NVAs provide address mappings to local NVEs in a highly- resilient and performance sensitive manner. To avoid single points of failure, An NVA would be implemented in a distributed or replicated manner, but the internal details of the implementation are not visible to NVEs. NVAs at one site share information and interact with NVAs at other sites, but only in a controlled manner. It is expected that policy and access control will be applied at the boundaries between different sites (and NVAs) so as to minimize dependencies on external NVAs that could negatively impact the operation within a site. It is an architectural principle that operations involving NVAs at one site not be immediately impacted by failures or errors at another site. (Of course, communication between NVEs in different data centers may be impacted by such failures or errors.) It is a strong requirement that a local NVA continue to operate properly for local NVEs even if external communication is interrupted (e.g., should communication between a local and remote NVA fail). At a high level, a federation of interconnected NVAs has some analogies to BGP and Autonomous Systems. Like an Autonomous System, NVAs at one site are managed by a single administrative entity and do not interact with external NVAs except as allowed by policy. Likewise, the interface between NVAs at different sites is well defined, so that the internal details of operations at one site are invisible to another site. Finally, an NVA only peers with other NVAs that it has a relationship with, i.e., where an overlay network needs to span multiple data centers. Reasons for using a federated model include: o Provide isolation between NVAs operating at different sites at different geographic locations. o Control the quantity and rate of information updates that flow (and must be processed) between different NVAs in different data centers. Narten Expires November 04, 2013 [Page 7] Internet-Draft Overlays for Network Virtualization May 2013 o Control the set of external NVAs (and external sites) a site peers with. A site will only peer with other sites that are cooperating in providing an overlay service. o Allow policy to be applied between sites. A site will want to carefully control what information it exports (and to whom) as well as what information it is willing to import (and from whom). o Allow different protocols and architectures to be used to for intra- vs. inter-NVA communication. For example, within a single data center, a replicated transaction server using database techniques might be an attractive implementation option for an NVA, and protocols optimized for intra-NVA communication would likely be different from protocols involving inter-NVA communication between different sites. o Allow for optimized protocols, rather than using a one-size-fits all approach. Within a data center, networks tend to have lower- latency, higher-speed and higher redundancy when compared with WAN links interconnecting data centers. The design constraints and tradeoffs for a protocol operating within a data center network are different from those operating over WAN links. While a single protocol could be used for both cases, there could be advantages to using different and more specialized protocols for the intra- and inter-NVA case. 5.2. NVE-NVA Interaction Models An NVE can obtain the information it needs to forward traffic in at least two ways: o An NVE can obtain necessary information entirely through the tenant-facing side. Such an approach is most appropriate in virtualized environments where server virtualization software running on the hypervisor already has most (if not all) of that information, and where an existing VM orchestration protocol can be leveraged to obtain any needed information. Specifically, VM orchestration systems used to create, terminate and migrate VMs already have well-defined (though possibly proprietary) protocols to handle the interactions between the hypervisor and back-end orchestration systems (e.g., VMware's vCenter or Microsoft's System Center). For such systems, an obvious approach for an NVE to obtain additional information would be to leverage the existing orchestration protocol. Narten Expires November 04, 2013 [Page 8] Internet-Draft Overlays for Network Virtualization May 2013 o Alternatively, an NVE can obtain needed information from the DCN- facing side, by using a protocol that operates across the underlay network directly rather than obtaining that information (indirectly) through the hypervisor on the tenant facing side. The NVO3 architecture should support both of the above models and indeed, it is possible that both models could be used simultaneously. Existing virtualization environments will (and already are) using the first model. But they are not sufficient to cover the case of standalone gateways -- such gateways do not support virtualization and do not interface with existing VM orchestration systems. Also, A hybrid approach might be desirable where the first model is used to obtain the information, but the latter approach is used to validate and further authenticate the information before using it. 5.3. NVE-to-NVA Protocol An NVE uses a dedicated NVE-to-NVA protocol to interact with the NVA. Using a dedicated protocol allows the protocols used by the NVE to evolve independently from the protocols used for intra-NVA communication. Using a dedicated protocol also ensures that both NVE and NVA implementations can evolve independently and without dependencies on each other. In practice, it is assumed that an NVE will be implemented once, and then (hopefully) not again, whereas an NVA (and its associated protocols) are more likely to evolve over time as experience is gained from usage. It should be noted that with an appropriate design, the NVE-to-NVA protocol could also be implemented by existing VM orchestration systems as an alternate way of providing an interface to obtain address mapping information out of an existing VM orchestration system. Such an approach could be used by standalone gateways, which forward traffic to and from virtual networks. For standalone gateways that relay traffic onto and off of a virtual network (i.e., those implemented on physical switches), there is no easy way today that they can obtain needed addressing mapping and other information. Whereas in virtualized systems the VM orchestration system can push addressing information to the NVE via server software, no such entry point is readily available on a standalone gateway. 5.3.1. Push vs. Pull Model There has been discussion within NVO3 about a "push vs. pull" model for NVE-to-NVA interaction. In the push model, the NVA would push address binding information to the NVE. Since the NVA has current knowledge of which NVE each Tenant System is connected to, the NVA can simply push updates out to the NVEs when they occur. The push model has the benefit that NVEs will always have the mapping Narten Expires November 04, 2013 [Page 9] Internet-Draft Overlays for Network Virtualization May 2013 information they need, and do not need to query the NVA on a cache miss. In the pull model, an NVE may not have all the mappings it needs when it attempts to forward tenant traffic. If an NVE attempts to send traffic to a destination for which it has no forwarding entry, the NVE queries the NVA to get the needed information or to definitively determine that no such entry exists. While the pull model has the advantage that an NVE doesn't need table entries for destinations it is not forwarding traffic to, it has the disadvantage of delaying the sending of traffic on a cache miss. The NVO3 architecture should support both models or even a combination model that supports both push and pull. In the case that the NVA wants to push information to the NVEs, there is no reason not to support such a model. In the case that the NVA is willing to generate queries on demand, there is no reason to have the architecture prevent such a model. 5.4. Intra-NVA Control Protocol Given a well-defined interface between an NVE and an NVA, a number of architectural approaches could be used to implement local NVAs themselves. NVAs manage address bindings and distribute them to where they need to go. One approach would be to use BGP (possibly with extensions) and route reflectors. Another approach could use a transaction-based data base model with replicated servers. Because the implementation details are local to the an NVA, there is no need to pick exactly one solution technology, so long as the external interfaces to the NVEs (and remote NVAs) are sufficiently well defined to achieve interoperability. 5.5. Inter-NVA Control Protocol The NVAs running at different locations (whether geographically, administratively or both) may need to communicate and share information with each other. It is assumed that the protocol will be used to share addressing information between data centers and must scale well over WAN links. 5.6. Control Protocol Summary The NVO3 address dissemination architecture consists of three major distinct components: the NVE, the local NVA, and one or more remote NVAs. In order to provide isolation and independence for each of these entities, the NVO3 architecture calls for a well defined protocol for interfacing between these three components. For the local NVA, the NVO3 architecture calls for a single conceptual Narten Expires November 04, 2013 [Page 10] Internet-Draft Overlays for Network Virtualization May 2013 entity, that could be implemented in a distributed or replicated fashion. While the IETF may choose to define one or more specific approaches to the local NVA (e.g., one using BGP), there is no need for it to pick exactly one to the exclusion of others. For the inter-NVA protocol, a protocol such as BGP could work well. A profile would be needed to define the specific set of features and extensions needed to support NVO3. For the NVE to NVA protocol, a purpose-specific protocol seems appropriate. 6. NVO3 Data Plane Encapsulation A key requirement for the NVO3 encapsulation protocol is support for a Context ID of sufficient size. A number of encapsulations already exist that provide a Context ID of sufficient size for NVO3. For example, VXLAN [I-D.mahalingam-dutt-dcops-vxlan] has a 24-bit VXLAN Network Identifier (VNI). NVGRE [I-D.sridharan-virtualization-nvgre] has a 24-bit Tenant Network ID (TNI). MPLS-over-GRE provides a 20-bit label field. While there is widespread recognition that a 12-bit Context ID would be too small (only 4096 distinct values), it is generally agreed that 20 bits (1 million distinct values) and 24 bits (16.8 million distinct values) are sufficient for a wide variety of deployment scenarios. While one might argue that a new encapsulation should be defined just for NVO3, no compelling requirements for doing so have been identified yet. Moreover, optimized implementations for existing encapsulations are already starting to become available on the market (i.e., in silicon). If the IETF were to define a new encapsulation format, it would take at least 2 (and likely more) years before optimized implementations of the new format would become available in products. In addition, a new encapsulation format would not likely displace existing formats, at least not for years. Thus, there seems little reason to define a new encapsulation. However, it does make sense for NVO3 to support multiple encapsulation formats, so as to allow NVEs to use their preferred encapsulations when possible. This implies that the address dissemination protocols must also include an indication of supported encapsulations along with the address mapping details. 7. Summary This document provides a start at a general architecture for overlays in NVO3. Narten Expires November 04, 2013 [Page 11] Internet-Draft Overlays for Network Virtualization May 2013 8. IANA Considerations This memo includes no request to IANA. 9. Security Considerations 10. Informative References [I-D.ietf-nvo3-framework] Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y. Rekhter, "Framework for DC Network Virtualization", draft- ietf-nvo3-framework-02 (work in progress), February 2013. [I-D.ietf-nvo3-overlay-problem-statement] Narten, T., Gray, E., Black, D., Dutt, D., Fang, L., Kreeger, L., Napierala, M., and M. Sridharan, "Problem Statement: Overlays for Network Virtualization", draft- ietf-nvo3-overlay-problem-statement-02 (work in progress), February 2013. [I-D.kreeger-nvo3-hypervisor-nve-cp] Kreeger, L., Narten, T., and D. Black, "Network Virtualization Hypervisor-to-NVE Overlay Control Protocol Requirements", draft-kreeger-nvo3-hypervisor-nve-cp-01 (work in progress), February 2013. [I-D.mahalingam-dutt-dcops-vxlan] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, L., Sridhar, T., Bursell, M., and C. Wright, "VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks", draft-mahalingam-dutt-dcops-vxlan-03 (work in progress), February 2013. [I-D.sridharan-virtualization-nvgre] Sridharan, M., Greenberg, A., Venkataramaiah, N., Wang, Y., Duda, K., Ganga, I., Lin, G., Pearson, M., Thaler, P., and C. Tumuluri, "NVGRE: Network Virtualization using Generic Routing Encapsulation", draft-sridharan- virtualization-nvgre-02 (work in progress), February 2013. Author's Address Thomas Narten IBM Email: narten@us.ibm.com Narten Expires November 04, 2013 [Page 12]