Cloud Native Microservice Principles

How cloud native organizations deliver software using pipelines

Cloud Native Microservices Writ Large

P1 - If an organization or a set of organizations deliver cloud native software, the software’s features will be delivered as microservice software (a microservice, set of microservices, or component of a microservice), declarative APIs, and immutable infrastructure.

P2 - If an organization develops microservice software, the features of the microservice software will be constrained by the organization's business capabilities [1],[2],[3],[4] and structure [5].

Conway’s law predicts that a system will represent the organizational structure that created that system. The various groups within an organization have different rates of change and concerns with respect to business capability. The delivery of microservice software harnesses the existing group boundaries within an organization and works with, not against, the different rates of change and business capabilities residing within those boundaries.

P3 - If an organization develops microservice software, the responsibility of the microservice software from inception to delivery will be with that organization.

P4 - If an organization has all of the responsibilities for the microservice software, that organization has the structure of a product team [6].

Microservice teams are product delivery teams. These teams are responsible for all parts of features delivery spanning from requirements gathering to production deployment. This allows the team to deploy based on business capability and to be sensitive to those capabilities’ rate of change.

P5 - If a product team has responsibility for microservice software, the rate of change [7],[8],[9],[10],[11],[12], cycle time [13],[14], and pipeline [15],[16],[17],[18] for that microservice software will be driven by that product team.

Similar to how a building’s components have different rates of change (foundation, plumbing, exterior, etc), software components and services also have different rates of change. When we split up services based on business capability the responsibility for changes and actual rate of change are coupled. At the same time conflicting agendas, road maps, and concerns are decoupled.

When the new freedom given with respect to rate of change microservice teams can then adopt techniques that are sensitive to cycle time and MTTR (mean time to recovery). This leads to using a software delivery methods and best practices that are compatible with deployment pipelines.

P6 - If the microservice has dependencies, the dependencies constrain the relationship structure between multiple organizations [19],[20]

Organizations and product teams that deliver software require varying levels of coordination with one another. Teams that have higher levels of coordination with other teams need to coordinate deployment pipelines and integration testing.

P7 - If the microservice software has a dependency, it will be delivered from a provider [21] to a consumer in the form of a library [22],[23] or a service instance [24],[25],[26] via a pipeline.

The rate of change between providers of microservice software and consumers of that software needs to be managed. When software is delivered as a library, it has a release number that can be referenced in the pipeline of the consumer. When software is delivered as a service instance, it can either be self service (and therefore can be referenced via release number in the consumer’s pipeline) or it can be hosted. If microservice software is hosted, there needs to be a way to reference a test instance of that service for the consumers pipeline. The license registration process for service instances should be automated, flexible, and should avoid impeding the development of a deployment pipeline.

P8 - If a microservice is deployed, the microservice will be deployed with all of its library dependencies [27],[28],[29],[30],[31],[32]

The microservice has all of its dependencies deployed with it during the deployment phase of the pipeline. These dependencies are decoupled from the infrastructure environment (e.g. a node) so the rate of change of the environment is separate from the microservices it hosts.

P9 - If a microservice is deployed, [33],[34],[35] the pipeline artifacts and configuration for the microservice will be versioned and associated with the stack [36],[37] of infrastructure elements that were provisioned with it.

The provisioning of infrastructure and the deployment of a microservice are related. The deployment of a microservice must know the version of the infrastructure that it was deployed and tested with.

Cloud Native in the Small: Microservices and Networking

Cloud native network functions follow the same principles as cloud native microservices with few exceptions.

P10 - If an organization or a set organizations deliver cloud native network functions, the software’s features will be delivered as microservice software (a microservice, set of microservices, or component of a microservice).

P11 - If a pipeline provisions network infrastructure[38] (physical or virtual layer 1 and layer 2 [39] networking functions), it will be provisioned [40] using declarative configuration.

Network infrastructure (the platform that the cloud native network functions will be deployed into) is provisioned (instances are made available for use to consumers) using declarative configuration. Configuration should designate what the outcome is, while the tools that provision that network infrastructure should create that outcome.

P12 - If a pipeline provisions network infrastructure, it will be provisioned immutably.

P13 - If a provider for network infrastructure delivers software or hardware, it will be delivered to the consumer as a library dependency or service instance (whether self service or hosted).

P14 - If the provider of networking software delivers cloud native service chains, the service chains will be composed of immutable microservices with declarative APIs,[41],[42],[43],[44].

Cloud native network functions can be composed with one another. During this composition their configuration is not modified after deployment (immutable), designates the outcome of the network that is wanted (declarative), and not steps of how to get to that outcome (imperative).

P15 - If an application developer consumes a cloud native networking function, it will be consumed using a declarative API.

A cloud native network function exposes its configuration using a declarative API, such as a yaml file. An application developer has the ability to reference cloud native functions at a higher level, using elements that were provided by operators.

P16 - If an operator combines cloud native network functions into a service chain, they will combined using a declarative API and will be exposed as a declarative API.

Operators compose fine grained cloud native functions and provide them in as a coarse grained element to consumers (e.g. application developers) via a declarative API.

P17 - If a cloud native network function developer creates networking software, it will expose a declarative API.

The cloud native network functions themselves are developed in such a way as to expose a way to configure them declaratively.

LICENSE

This work is licensed under a Creative Commons Attribution 4.0 International License.

LIST OF CONTRIBUTORS

If you would like credit for helping with these documents (for either this document or any of the other four documents linked above), please add your name to the list of contributors.

W Watson Vulk Coop

Taylor Carpenter Vulk Coop

Denver Williams Vulk Coop

Jeffrey Saelens Charter Communications

Endnotes

  1. Stine, Matt. Migrating to Cloud-Native Application Architecture, O'reilly, 2015, pp. 16.. “Microservices represent the decomposition of monolithic business systems into independently deployable services that do “one thing well.” That one thing usually represents a business capability, or the smallest, “atomic” unit of service that delivers business value.”

  2. Stine, Matt. Migrating to Cloud-Native Application Architecture, O'reilly, 2015, pp. 16.. “As we decouple the business domain into independently deployable bounded contexts of capabilities, we also decouple the associated change cycles. As long as the changes are restricted to a single bounded context, and the service continues to fulfill its existing contracts, those changes can be made and deployed independent of any coordination with the rest of the business. The result is enablement of more frequent and rapid deployments, allowing for a continuous flow of value.”

  3. Stine, Matt. Migrating to Cloud-Native Application Architecture, O'reilly, 2015, pp. 16–17.Development can be accelerated by scaling the development organization itself. It’s very difficult to build software faster by adding more people due to the overhead of communication and coordination. Fred Brooks taught us years ago that adding more people to a late software project makes it later. However, rather than placing all of the developers in a single sandbox, we can create parallel work streams by building more sandboxes through bounded contexts.

  4. Stine, Matt. Migrating to Cloud-Native Application Architecture, O'reilly, 2015, pp. 17 The new developers that we add to each sandbox can ramp up and become productive more rapidly due to the reduced cognitive load of learning the business domain and the existing code, and building relationships within a smaller team.

  5. Conway’s law describes the relationship between the structure of an organization and its systems: Any organization that designs a system (defined more broadly here than just information systems) will inevitably produce a design whose structure is a copy of the organization’s communication structure. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4914-4922). O'Reilly Media. Kindle Edition.

  6. Cross-functional teams put all of the people responsible for building and running an aspect of a system together. This may include testers, project managers, analysts, and a commercial or product owner, as well as different types of engineers. These teams should be small; Amazon uses the term “two-pizza teams,” meaning the team is small enough that two pizzas is enough to feed everyone. The advantage of this approach is that people are dedicated to a single, focused service or small set of services, avoiding the need to multitask between projects. Teams formed of a consistent set of people work far more effectively than those whose membership changes from day to day. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 6457-6462). O'Reilly Media. Kindle Edition.

  7. The peculiarity of buildings that turned Architectural Digest into a contradiction of itself is that different parts of buildings change at different rates. Brand, Stewart. How Buildings Learn (p. 21). Penguin Publishing Group. Kindle Edition.

  8. “Our basic argument is that there isn’t such a thing as a building,” says Duffy. “A building properly conceived is several layers of longevity of built components.” He distinguishes four layers, which he calls Shell, Services, Scenery, and Set. Shell is the structure, which lasts the lifetime of the building (fifty years in Britain, closer to thirty-five in North America). Services are the cabling, plumbing, air conditioning, and elevators (“lifts”), which have to be replaced every fifteen years or so. Scenery is the layout of partitions, dropped ceilings, etc., which changes every five to seven years. Set is the shifting of furniture by the occupants, often a matter of months or weeks.Brand, Stewart. How Buildings Learn (pp. 21-22). Penguin Publishing Group. Kindle Edition.

  9. I’ve taken the liberty of expanding Duffy’s “four S’s”—which are oriented toward interior work in commercial buildings—into a slightly revised, general-purpose “six S’s”: • SITE - This is the geographical setting, the urban location, and the legally defined lot, whose boundaries and context outlast generations of ephemeral buildings. “Site is eternal,” Duffy agrees. • STRUCTURE - The foundation and load-bearing elements are perilous and expensive to change, so people don’t. These are the building. Structural life ranges from 30 to 300 years (but few buildings make it past 60, for other reasons). • SKIN - Exterior surfaces now change every 20 years or so, to keep up with fashion or technology, or for wholesale repair. Recent focus on energy costs has led to re-engineered Skins that are air-tight and better-insulated. • SERVICES - These are the working guts of a building: communications wiring, electrical wiring, plumbing, sprinkler system, HVAC (heating, ventilating, and air conditioning), and moving parts like elevators and escalators. They wear out or obsolesce every 7 to 15 years. Many buildings are demolished early if their outdated systems are too deeply embedded to replace easily. • SPACE PLAN - The interior layout—where walls, ceilings, floors, and doors go. Turbulent commercial space can change every 3 years or so; exceptionally quiet homes might wait 30 years. • STUFF - Chairs, desks, phones, pictures; kitchen appliances, lamps, hair brushes; all the things that twitch around daily to monthly. Furniture is called mobilia in Italian for good reason.Brand, Stewart. How Buildings Learn (pp. 24-25). Penguin Publishing Group. Kindle Edition.

  10. Frank Duffy: “Thinking about buildings in this time-laden way is very practical. As a designer you avoid such classic mistakes as solving a five- minute problem with a fifty-year solution, or vice versa. It legitimizes the existence of different design skills—architects, service engineers, space planners, interior designers—all with their different agendas defined by this time scale. It means you invent building forms which are very adaptive.” Brand, Stewart. How Buildings Learn (p. 32). Penguin Publishing Group. Kindle Edition.

  11. The layering also defines how a building relates to people. Organizational levels of responsibility match the pace levels. The building interacts with individuals at the level of Stuff; with the tenant organization (or family) at the Space plan level; with the landlord via the Services (and slower levels) which must be maintained; with the public via the Skin and entry; and with the whole community through city or county decisions about the footprint and volume of the Structure and restrictions on the Site. The community does not tell you where to put your desk or your bed; you do not tell the community where the building will go on the Site (unless you’re way out in the country). Brand, Stewart. How Buildings Learn (pp. 32-33). Penguin Publishing Group. Kindle Edition.

  12. O’Neill’s A Hierarchical Concept of Ecosystems. O’Neill and his co-authors noted that ecosystems could be better understood by observing the rates of change of different components. Hummingbirds and flowers are quick, redwood trees slow, and whole redwood forests even slower. Most interaction is within the same pace level—hummingbirds and flowers pay attention to each other, oblivious to redwoods, who are oblivious to them. Meanwhile the forest is attentive to climate change but not to the hasty fate of individual trees. Brand, Stewart. How Buildings Learn (p. 33). Penguin Publishing Group. Kindle Edition.

  13. The most effective measurement of a change management pipeline is the cycle time. Cycle time is the time between deciding on the need for a change to seeing that change in production use. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4867-4868). O'Reilly Media. Kindle Edition.

  14. Metrics are best used by the team to help itself, and should be continually reviewed to decide whether they are still providing value. Some common metrics used by infrastructure teams include: Cycle time The time taken from a need being identified to fulfilling it. This is a measure of the efficiency and speed of change management. Cycle time is discussed in more detail later in this chapter. Mean time to recover (MTTR) The time taken from an availability problem (which includes critically degraded performance or functionality) being identified to a resolution, even where it’s a workaround. This is a measure of the efficiency and speed of problem resolution. Mean time between failures (MTBF) The time between critical availability issues. This is a measure of the stability of the system, and the quality of the change management process. Although it’s a valuable metric, over-optimizing for MTBF is a common cause of poor performance on other metrics. Availability The percentage of time that the system is available, usually excluding time the system is offline for planned maintenance. This is another measurement of system stability. It is often used as an SLA in service contracts. True availability The percentage of time that the system is available, not excluding planned maintenance.

  15. Continuous delivery for software is implemented using a deployment pipeline. A deployment pipeline is an automated manifestation of a release process. It builds the application code, and deploys and tests it on a series of environments before allowing it to be deployed to production. The same concept is applied to infrastructure changes. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 3823-3826). O'Reilly Media. Kindle Edition.

  16. The point of CD and the software deployment pipeline is to allow changes to be delivered in a continuous flow, rather than in large batches. Changes can be validated more thoroughly, not only because they are applied with an automated process, but also because changes are tested when they are small, and because they are tested immediately after being committed. The result, when done well, is that changes can be made more frequently, more rapidly, and more reliably. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4525-4529). O'Reilly Media. Kindle Edition.

  17. Teams who embrace the pipeline as the way to manage changes to their infrastructure find a number of benefits: Their infrastructure management tooling and codebase is always production ready. There is never a situation where extra work is needed (e.g., merging, regression testing, and “hardening”) to take work live. Delivering changes is nearly painless. Once a change has passed the technical validation stages of the pipeline, it shouldn’t need technical attention to carry through to production unless there is a problem. There is no need to make technical decisions about how to apply a change to production, as those decisions have been made, implemented, and tested in earlier stages. It’s easier to make changes through the pipeline than any other way. Hacking a change manually other than to bring up a system that is down is more work, and scarier, than just pushing it through the pipeline. Compliance and governance are easy. The scripts, tools, and configuration for making changes are transparent to reviewers. Logs can be audited to prove what changes were made, when, and by whom. With an automated change management pipeline, a team can prove what process was followed for each and every change. This tends to be stronger than taking someone’s word that documented manual processes are always followed. Change management processes can be more lightweight. People who might otherwise need to discuss and inspect each change can build their requirements into the automated tooling and tests. They can periodically review the pipeline implementation and logs, and make improvements as needed. Their time and attention goes to the process and tooling, rather than inspecting each change one by one. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4551-4563). O'Reilly Media. Kindle Edition.

  18. The design of your change management pipelines is a manifestation of your system’s architecture. Both of these are a manifestation of your team structure. Conway’s law describes the relationship between the structure of an organization and its systems: Any organization that designs a system (defined more broadly here than just information systems) will inevitably produce a design whose structure is a copy of the organization’s communication structure. Organizations can take advantage of this to shape their teams, systems, and pipeline to optimize for the outcomes they want. This is sometimes called the Inverse Conway Maneuver . Ensure that the people needed to deliver a given change through to production are all a part of the same team. This may involve restructuring the team but may also be done by changing the system’s design. It can often be achieved by changing the service model, which is the goal of self-service systems. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4914-4922). O'Reilly Media. Kindle Edition.

  19. Integration Models The design and implementation of pipelines for testing how systems and infrastructure elements integrate depends on the relationships between them, and the relationships between the teams responsible for them. There are several typical situations: Single team One team owns all of the elements of the system and is fully responsible for managing changes to them. In this case, a single pipeline, with fan-in as needed, is often sufficient. Group of teams A group of teams works together on a single system with multiple services and/or infrastructure elements. Different teams own different parts of the system, which all integrate together. In this case, a single fan-in pipeline may work up to a point, but as the size of the group and its system grows, decoupling may become necessary. Separate teams with high coordination Each team (which may itself be a group of teams) owns a system, which integrates with systems owned by other teams. A given system may integrate with multiple systems. Each team will have its own pipeline and manage its releases independently. But they may have a close enough relationship that one team is willing to customize its systems and releases to support another team’s requirements. This is often seen with different groups within a large company and with close vendor relationships. Separate teams with low coordination As with the previous situation, except one of the teams is a vendor with many other customers. Their release process is designed to meet the requirements of many teams, with little or no customizations to the requirements of individual customer teams. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4892-4907). O'Reilly Media. Kindle Edition.

  20. Practice: Decouple Pipelines When separate teams build different components of a system, such as microservices, joining pipeline branches for these components together with the fan-in pattern can create a bottleneck. The teams need to spend more effort on coordinating the way they handle releases, testing, and fixing. This may be fine for a small number of teams who work closely together, but the overhead grows exponentially as the number of teams grows. Decoupling pipelines involves structuring the pipelines so that a change to each component can be released independently. The components may still have dependencies between each other, so they may need integration testing. But rather than requiring all of the components to be released to production together in a “big bang” deployment, a change to one component could go ahead to production before changes to the second component are released. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4883-4889). O'Reilly Media. Kindle Edition.

  21. Given two integrated components, one provides a service, and the other consumes it. The provider component needs to test that it is providing the service correctly for its consumers. And the consumer needs to test that it is consuming the provider service correctly. For example, one team may manage a monitoring service, which is used by multiple application teams. The monitoring team is the provider, and the application teams are the consumers. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4930-4933). O'Reilly Media. Kindle Edition.

  22. Pattern: Library Dependency One way that one component can provide a capability to another is to work like a library. The consumer pulls a version of the provider and incorporates it into its own artifact, usually in the build stage of a pipeline. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4937-4939). O'Reilly Media. Kindle Edition.

  23. The important characteristic is that the library component is versioned, and the consumer can choose which version to use. If a newer version of the library is released, the consumer may opt to immediately pull it in, and then run tests on it. However, it has the option to “pin” to an older version of the library. This gives the consumer team the flexibility to release changes even if they haven’t yet incorporated new, incompatible changes to their provider library. But it creates the risk that important changes, such as security patches, aren’t integrated in a timely way. This is a major source of security vulnerability in IT systems. For the provider, this pattern gives freedom to release new changes without having to wait for all consumer teams to update their components. But it can result in having many different versions of the component in production, which increases the time and hassle of support. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4941-4947). O'Reilly Media. Kindle Edition.

  24. Pattern: Self-Provisioned Service Instance The library pattern can be adapted for full-blown services. A well-known example of this is AWS’s Relational Database Service, or RDS , offered by AWS. A team can provision complete working database instances for itself, which it can use in a pipeline for a component that uses a database. As a provider, Amazon releases new database versions, while still making older versions available as “previous generation DB instances” . This has the same effect as the library pattern, in that the provider can release new versions without waiting for consumer teams to upgrade their own components. Being a service rather than a library, the provider is able to transparently release minor updates to the service. Amazon can apply security patches to its RDS offering, and new instances created by consumer teams will automatically use the updated version. The key is for the provider to keep close track of the interface contract, to make sure the service behaves as expected after updates have been applied. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4963-4973).

  25. Prefer Products with Cloud-Compatible Licensing Licensing can make dynamic infrastructure difficult with some products. Some examples of licensing approaches that work poorly include: A manual process to register each new instance, agent, node, etc., for licensing. Clearly, this defeats automated provisioning. If a product’s license does require registering infrastructure elements, there needs to be an automatable process for adding and removing them. Inflexible licensing periods. Some products require customers to buy a fixed set of licenses for a long period. For example, a monitoring tool may have licensing based on the maximum number of nodes that can be monitored. The licenses may need to be purchased on a monthly cycle. This forces the customer to pay for the maximum number of nodes they might use during a given month, even when they only run that number of nodes for a fraction of the time. This cloud-unfriendly pricing model discourages customers from taking advantage of the ability to scale capacity up and down with demand. Vendors pricing for cloud charge by the hour at most. Heavyweight purchasing process to increase capacity. This is closely related to the licensing period. When an organization is hit with an unexpected surge in business, they shouldn’t need to spend days or weeks to purchase the extra capacity they need to meet the demand. It’s common for vendors to have limits in place to protect customers against accidentally over-provisioning, but it should be possible to raise these limits quickly. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1880-1892). O'Reilly Media. Kindle Edition.

  26. Providing Test Instances of a Service to Consumers The provider of a hosted service needs to provide support for consumers to develop and test their integration with the service. This is useful to consumers for a number of purposes: To learn how to correctly integrate to the service. To test that integration still works after changes to the consumer system. To test that the consumer system still works after changes to the provider. To test and develop against new provider functionality before it is released. To reproduce production issues for troubleshooting. To run demonstrations of the consumer system without affecting production data. An effective way for a provider to support these is to provide self-provisioned service instances. If consumers can create and configure instances on-demand, then they can easily handle their own testing and demonstration needs. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4992-4999). O'Reilly Media. Kindle Edition.

  27. The benefits of decoupling runtime requirements from the host system are particularly powerful for infrastructure management. It creates a clean separation of concerns between infrastructure and applications. The host system only needs to have the container runtime software installed, and then it can run nearly any container image. Applications, services, and jobs are packaged into containers along with all of their dependencies [...]. These dependencies can include operating system packages, language runtimes, libraries, and system files. Different containers may have different, even conflicting dependencies, but still run on the same host without issues. Changes to the dependencies can be made without any changes to the host system. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1652-1658). O'Reilly Media. Kindle Edition.

  28. The important thing is how the artifact is treated, conceptually. A configuration artifact is an atomic, versioned collection of materials that provision and/or configure a system component. An artifact is: Atomic A given set of materials is assembled, tested, and applied together as a unit. Portable It can be progressed through the pipeline, and different versions can be applied to different environments or instances. It can be reliably and repeatably applied to any environment, and so any given environment has an unambiguous version of the component. Complete A given artifact should have everything needed to provision or configure the relevant component. It should not assume that previous versions of the component artifacts have been applied before. Consistent Applying the artifact to any two component instances should have the same results. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4669-4678). O'Reilly Media. Kindle Edition.

  29. The best way to think of a container is as a method to package a service, application, or job. It’s an RPM on steroids, taking the application and adding in its dependencies, as well as providing a standard way for its host system to manage its runtime environment . Rather than a single container running multiple processes, aim for multiple containers, each running one process. These processes then become independent, loosely coupled entities. This makes containers a nice match for microservice application architectures. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1708-1711). O'Reilly Media. Kindle Edition.

  30. The benefits of containerization include: Decoupling the runtime requirements of specific applications from the host server that the container runs on. Repeatably create consistent runtime environments by having a container image that can be distributed and run on any host server that supports the runtime. Defining containers as code (e.g.,in a Dockerfile) that can be managed in a VCS, used to trigger automated testing, and generally having all of the characteristics for infrastructure as code. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1633-1637). O'Reilly Media. Kindle Edition.

31.The immutable server pattern mentioned in “Server Change Management Models” doesn’t make configuration updates to existing servers. Instead, changes are made by building a new server with the new configuration. With immutable servers, configuration is usually baked into the server template. When the configuration is updated, a new template is packaged. New instances of existing servers are built from the new template and used to replace the older servers. This approach treats server templates like software artifacts. Each build is versioned and tested before being deployed for production use. This creates a high level of confidence in the consistency of the server configuration between testing and production. Advocates of immutable servers view making a change to the configuration of a production server as bad practice, no better than modifying the source code of software directly on a production server. Immutable servers can also simplify configuration management, by reducing the area of the server that needs to be managed by definition files. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2239-2247). O'Reilly Media. Kindle Edition.

  1. Using the term “immutable” to describe this pattern can be misleading. “Immutable” means that a thing can’t be changed, so a truly immutable server would be useless. As soon as a server boots, its runtime state changes processes run, entries are written to logfiles, and application data is added, updated, and removed. It’s more useful to think of the term “immutable” as applying to the server’s configuration, rather than to the server as a whole. This creates a clear line between configuration and data. It forces teams to explicitly define which elements of a server they will manage deterministically as configuration and which elements will be treated as data. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2918-2926). O'Reilly Media. Kindle Edition.

  2. Blue-green replacement is the most straightforward pattern to replace an infrastructure element without downtime. This is the blue-green deployment pattern for software 4 applied to infrastructure. It requires running two instances of the affected infrastructure, keeping one of them live at any point in time. Changes and upgrades are made to the offline instance, which can be thoroughly tested before switching usage over to it. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5681-5685). O'Reilly Media. Kindle Edition.

  3. Phoenix replacement is the natural progression from blue-green using dynamic infrastructure. Rather than keeping an idle instance around between changes, a new instance can be created each time a change is needed. As with blue-green, the change is tested on the new instance before putting it into use. The previous instance can be kept up for a short time, until the new instance has been proven in use. But then the previous instance is destroyed. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5694-5697). O'Reilly Media. Kindle Edition.

  4. The canary pattern involves deploying the new version of an element alongside the old one, and then routing some portion of usage to the new elements. For example, with version A of an application running on 20 servers, version B may be deployed to two servers. A subset of traffic, perhaps flagged by IP address or by randomly setting a cookie, is sent to the servers for version B. The behavior, performance, and resource usage of the new element can be monitored to validate that it’s ready for wider use. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5724-5728). O'Reilly Media. Kindle Edition.

  5. A stack is a collection of infrastructure elements that are defined as a unit (the inspiration for choosing the term stack comes mainly from the term’s use by AWS CloudFormation). A stack can be any size. It could be a single server. It could be a pool of servers with their networking and storage. It could be all of the servers and other infrastructure involved in a given application. Or it could be everything in an entire data center. What makes a set of infrastructure elements a stack isn’t the size, but whether it’s defined and changed as a unit. The concept of a stack hasn’t been commonly used with manually managed infrastructures. Elements are added organically, and networking boundaries are naturally used to think about infrastructure groupings. But automation tools force more explicit groupings of infrastructure elements. It’s certainly possible to put everything into one large group. And it’s also possible to structure stacks by following network boundaries. But these aren’t the only ways to organize stacks. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 3227-3230). O'Reilly Media. Kindle Edition.

  6. “I Heard Calico Is Suggesting Layer 2: I Thought You Were Layer 3! What’s Happening?” Project Calico Documentation, docs.projectcalico.org/v3.5/usage/troubleshooting/faq#i-heard-calico-is-suggesting-layer-2-i-thought-you-were-layer-3-whats-happening. It’s important to distinguish what Calico provides to the workloads hosted in a data center (a purely layer 3 network) with what the Calico project recommends operators use to build their underlying network fabric. Calico’s core principle is that applications and workloads overwhelmingly need only IP connectivity to communicate. For this reason we build an IP-forwarded network to connect the tenant applications and workloads to each other, and the broader world. However, the underlying physical fabric obviously needs to be set up too. Here, Calico has discussed how both a layer 2 (see here) or a layer 3 (see here) fabric could be integrated with Calico. This is one of the great strengths of the Calico model: it allows the infrastructure to be decoupled from what we show to the tenant applications and workloads. We have some thoughts on different interconnect approaches (as noted above), but just because we say that there are layer 2 and layer 3 ways of building the fabric, and that those decisions may have an impact on route scale, does not mean that Calico is “going back to Ethernet” or that we’re recommending layer 2 for tenant applications. In all cases we forward on IP packets, no matter what architecture is used to build the fabric.

  7. “Concerns over Ethernet at scale” Calico over an Ethernet interconnect fabric, https://docs.projectcalico.org/v3.5/reference/private-cloud/l2-interconnect-fabric. It has been acknowledged by the industry for years that, beyond a certain size, classical Ethernet networks are unsuitable for production deployment. Although there have been multiple attempts to address these issues, the scale-out networking community has, largely abandoned Ethernet for anything other than providing physical point-to-point links in the networking fabric. The principal reasons for Ethernet failures at large scale are: 1. Large numbers of end points 1. Each switch in an Ethernet network must learn the path to all Ethernet endpoints that are connected to the Ethernet network. Learning this amount of state can become a substantial task when we are talking about hundreds of thousands of end points. 2. High rate of churn or change in the network. With that many end points, most of them being ephemeral (such as virtual machines or containers), there is a large amount of churn in the network. That load of re-learning paths can be a substantial burden on the control plane processor of most Ethernet switches. 3. High volumes of broadcast traffic. As each node on the Ethernet network must use Broadcast packets to locate peers, and many use broadcast for other purposes, the resultant packet replication to each and every end point can lead to broadcast storms in large Ethernet networks, effectively consuming most, if not all resources in the network and the attached end points. 4. Spanning tree. Spanning tree is the protocol used to keep an Ethernet network from forming loops. The protocol was designed in the era of smaller, simpler networks, and it has not aged well. As the number of links and interconnects in an Ethernet network goes up, many implementations of spanning tree become more fragile. Unfortunately, when spanning tree fails in an Ethernet network, the effect is a catastrophic loop or partition (or both) in the network, and, in most cases, difficult to troubleshoot or resolve. While many of these issues are crippling at VM scale (tens of thousands of end points that live for hours, days, weeks), they will be absolutely lethal at container scale (hundreds of thousands of end points that live for seconds, minutes, days).

  8. Provisioning” is a term that can be used to mean somewhat different things [...] provisioning is used to mean making an infrastructure element such as a server or network device ready for use. Depending on what is being provisioned, this can involve: Assigning resources to the element. Instantiating the element. Installing software onto the element. Configuring the element. Registering the element with infrastructure services. At the end of the provisioning process, the element is fully ready for use. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1206-1208). O'Reilly Media. Kindle Edition.

  9. Declarative configuration is different from imperative configuration , where you simply take a series of actions (e.g., apt-get install foo ) to modify the world. Years of production experience have taught us that maintaining a written record of the system’s desired state leads to a more manageable, reliable system. Declarative configuration enables numerous advantages, including code review for configurations as well as documenting the current state of the world for distributed teams. Additionally, it is the basis for all of the self-healing behaviors in Kubernetes that keep applications running without user action.” Hightower, Kelsey; Burns, Brendan; Beda, Joe. Kubernetes: Up and Running: Dive into the Future of Infrastructure (Kindle Locations 892-896). Kindle Edition.

  10. “The combination of declarative state stored in a version control system and Kubernetes’s ability to make reality match this declarative state makes rollback of a change trivially easy. It is simply restating the previous declarative state of the system. With imperative systems this is usually impossible, since while the imperative instructions describe how to get you from point A to point B, they rarely include the reverse instructions that can get you back. “Hightower, Kelsey; Burns, Brendan; Beda, Joe. Kubernetes: Up and Running: Dive into the Future of Infrastructure (Kindle Locations 186-190). Kindle Edition.

  11. “Because it describes the state of the world, declarative configuration does not have to be executed to be understood. Its impact is concretely declared. Since the effects of declarative configuration can be understood before they are executed, declarative configuration is far less error-prone. Further, the traditional tools of software development, such as source control, code review, and unit testing, can be used in declarative configuration in ways that are impossible for imperative instructions. “ Hightower, Kelsey; Burns, Brendan; Beda, Joe. Kubernetes: Up and Running: Dive into the Future of Infrastructure (Kindle Locations 183-186). Kindle Edition.

  12. So declarative definitions lend themselves to running idempotently. You can safely apply your definitions over and over again, without thinking about it too much. If something is changed to a system outside of the tool, applying the definition will bring it back into line, eliminating sources of configuration drift. When you need to make a change, you simply modify the definition, and then let the tooling work out what to do. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1275-1278). O'Reilly Media. Kindle Edition.