# Cloud Native Immutable Infrastructure Principles

***P1*** *- If infrastructure is* ***immutable**, it is easily* ***reproduced*** \[1],\[2]*,* ***consistent*** \[3], ***disposable*** \[4],\[5]*, will have a* ***repeatable*** \[6] ***provisioning process**, and will not have configuration or artifacts that are modifiable in place.*

&#x20;Many qualities characterize an immutable process. Reproducibility, consistency, disposability and repeatability are mandatory attributes in any well designed infrastructure process, including an immutable one.

***P2*** *- If the provisioning of cloud native infrastructure is dynamic, it will use unattended code execution* \[7]\[8]\[9] *with declarative configuration* \[10],\[11],\[12],\[13],\[14]*.*

&#x20;The gold standard for cloud infrastructure is for it be able to be provisioned without any assistance. The tools that provision that infrastructure should accept declarative configuration as inputs.

***P3*** *- If a cloud native* ***infrastructure element*** \[15],\[16] *(**compute, storage, or network**) is* ***provisioned*** \[17],\[18],\[19],\[20],\[21],\[22]*, that element is* ***ready for use**.*

***P4*** *- If a cloud native infrastructure element is* ***provisioned**, it will be provisioned* ***immutably**.*

&#x20;Once immutable infrastructure \\(the orchestrator and all of the software and hardware that it depends on\\) is provisioned, the infrastructure is not changed after it is made ready for use. New changes to the infrastructure are rolled out as new instances of infrastructure.

***P5*** *- If an infrastructure element’s provisioning is* ***immutable**, its* ***base configuration*** \[23],\[24],\[25] *will* ***not*** *be* ***changed.***

&#x20;In the immutable change management model \[38], immutable infrastructure elements are built from scratch (or from artifacts and configuration with a known state) as a new instance of the element. The artifacts of an immutable infrastructure are composed of scripts, binaries, containers, images, and server templates while the configuration is the declaration of what that infrastructure should look like after it is instantiated. If the infrastructure element is hardware, such as a physical layer 1 networking device, it should be ‘flashed’ (a complete replacement of its software) for its artifact updates. Conversely changes to virtual infrastructure should be treated the same way, with a new virtual instance being deployed based on a current artifact. The \_configuration\_ for such a device should be managed with an atomic application of a versioned configuration file, which replaces all of the configuration on the device at once. This configuration should only apply to non base configuration and not the fundamental configuration used to define the provisioning of such a device itself. The choice of \*\*change management model\*\* is \*\*separate\*\* from a \*\*deployment strategy\*\*, such as a phoenix \[46] or canary \[47] deployment strategy. Any deployment strategy that supports immutable infrastructure (i.e. when infrastructure such as a server needs a configuration or artifact change, a brand new instance of that infrastructure is created) can be used.

***P6*** *- If an infrastructure element is* ***immutable*** \[26],\[27],\[28],\[29],\[30],\[31]*, its* ***base*** ***configuration*** *is stored as a* ***template*** \[32],\[33],\[34],\[35],\[36],\[37]*.*

&#x20;An infrastructure \*\*image\*\* resides at the lowest level and usually includes an operating system, but may also include an \*\*orchestrator\*\* for the higher level applications or any other dependencies that have a low rate of change but are needed for applications. This \*\*image\*\* is managed using a \*\*template\*\* \*\*system\*\* with \*\*versioning\*\* (e.g. a versioned image of an operating system) and minimizes the MTTR\[44] (mean time to recovery) and deployment time of the infrastructure.

***P7*** *- If there is* ***configuration*** *outside of an infrastructure element’s template, it is* ***versioned*** *and stored in* ***source control*** \[41],\[43]*.*

&#x20;Any configuration that is applied after an infrastructure’s base image/template has been created (also called \*\*bootstrapping\*\*) will be applied \*\*before\*\* the infrastructure is considered ready for use. After the element is in use, no more configuration is allowed.

***P8*** *- If an infrastructure element is* ***immutable**, the* ***dependencies*** *of the applications that run on that infrastructure element will be* ***decoupled*** \[48],\[49],\[50],\[51]\[52],\[53] *from that* ***infrastructure element***

&#x20;An infrastructure element such as a generic host server, a network device, or a storage device, has a different rate of change with respect to, and is separate from, the applications that are deployed on or in relation to those elements. For instance, a generic host server \[42] can act as a network router, but should still have separation between the dependencies that make it ready for use to the network applications deployed on that server. A microservice has all of its dependencies deployed with it during its deployment phase of the pipeline, which is separate from the infrastructure’s pipeline. These dependencies are decoupled from the infrastructure environment (e.g. a node) so the rate of change of the environment is separate from the microservices it hosts.

***P9*** *- If an infrastructure element is* ***immutable**, the applications that run on that infrastructure element will run in an* ***unprivileged mode*** \[54],\[55]

&#x20;Applications should not require any elevated level of security permissions on the underlying infrastructure as they should have no need to make modifications to it.

***P10*** *- If an application and its dependencies are* ***decoupled*** *from the infrastructure element that it runs on, the* ***application*** *will be* ***orchestrated*** \[56].

&#x20;All application components that run on the infrastructure, except data \[57],\[58],\[59],\[60],\[61], will be orchestrated and built from versioned artifacts and configuration templates during each deploy.

***P11*** *- If an infrastructure element provides a* ***service*** \[62],\[63],\[64],\[65]*, that service will be available via* ***service discovery.***

***P12*** *- If an infrastructure is* ***immutable**, its changes will be managed via a* ***change management pipeline*** \[66\[,\[67],\[68]

***P13*** *- If* ***artifacts*** \[69],\[70] *are* ***provisioned*** *as new* ***immutable*** ***infrastructure*** ***elements**, they are deployed using* ***templates**.*

&#x20;The pipeline for low level infrastructure artifacts should combine them into templates and/or images which reduces the provisioning time for everything dependent on them.

***P14*** *- If an immutable infrastructure has a change management pipeline, the* ***change management pipeline*** *applies* ***tests*** \[71],\[72],\[73],\[74],\[75],\[76] *to the codebase with increasing levels of complexity.*

***P15*** *- If the immutable or idempotent infrastructure has dependencies, the* ***dependencies*** ***constrain the relationship structure*** \[77] *between* ***multiple organizations***

***P16*** *- If the immutable or idempotent infrastructure has a dependency, that* ***dependency*** *will be* ***delivered*** \[78] *from a* ***provider*** *to a* ***consumer*** \[79] *in the form of a* ***library*** \[80]\[81] *or a* ***service instance*** \[82] *via a* ***pipeline**.*

&#x20;All software components delivered by a provider need to have the ability to be integration \*\*tested\*\* \[85],\[86] using the consumer’s pipeline \[95] in a non-production environment. This should be via the consumer’s own pipeline via a library or a self-serviced instance, or a hosted environment \[83] given by the provider. The software should resist \*\*breaking\*\* the \*\*contract\*\* formed \*\*between\*\* the \*\*provider\*\* of that software and the \*\*consumer\*\* of that software. This means paying attention to forward and backward compatibility \[87],\[88],\[89],\[90],\[91],\[92] with respect to interfaces between the provider and consumer. Any breaks in compatibility should force a major version \[93],\[94] change.

**LICENSE**

This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

**LIST OF CONTRIBUTORS**

If you would like credit for helping with these documents (for either this document or any of the other four documents linked above), please add your name to the list of contributors.

W Watson Vulk Coop

Taylor Carpenter Vulk Coop

Denver Williams Vulk Coop

Jeffrey Saelens Charter Communications

## Endnotes

1. It should be possible to **effortlessly** and reliably rebuild any element of an infrastructure. Effortlessly means that there is **no need to make any significant decisions** about **how** to **rebuild** the thing. Decisions about which software and versions to install on a server, how to choose a hostname, and so on should be captured in the scripts and tooling that provision it. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 349-352). O'Reilly Media. Kindle Edition.
2. When problems are discovered, fixes may not be rolled out to all of the systems that could be affected by them. Differences in versions and configurations across servers mean that software and scripts that work on some machines don’t work on others. This leads to **inconsistency** across the **servers**, called **configuration drift**. \[...] Even when servers are initially created and configured consistently, **differences** can creep in **over time**: \[...]. But **variations should be captured and managed in a way that makes it easy to reproduce and to rebuild servers and services.** Unmanaged variation between servers leads to **snowflake servers** and automation **fear**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 278-290). O'Reilly Media. Kindle Edition.
3. Given **two infrastructure elements** providing a **similar service** for example, two application servers in a cluster the servers should be nearly **identical**. Their system software and configuration should be the same, except for those **bits** of **configuration** that differentiate them, like their **IP addresses**. Letting inconsistencies slip into an infrastructure keeps you from being able to trust your automation. If one file server has an 80 GB partition, while another has 100 GB, and a third has 200 GB, then you can’t rely on an action to work the same on all of them. This encourages doing special things for servers that don’t quite match, which leads to **unreliable** **automation**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 380-384). O'Reilly Media. Kindle Edition.
4. One of the **benefits** of **dynamic infrastructure** is that **resources** can be easily **created, destroyed, replaced, resized, and moved**. In order to take advantage of this, systems should be designed to **assume** that the infrastructure will **always** be **changing**. **Software** should **continue running** even when **servers** **disappear**, appear, and when they are resized. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 357-359). O'Reilly Media. Kindle Edition.
5. A popular expression is to “**treat your servers like cattle, not pets**.” ,Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 362-363). O'Reilly Media. Kindle Edition.
6. Building on the **reproducibility** principle, any action you carry out on your infrastructure should be **repeatable**. This is an obvious benefit of **using scripts and configuration management tools** **rather than** making changes **manually**, but it can be hard to stick to doing things this way, especially for experienced system administrators. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 393-395). O'Reilly Media. Kindle Edition.
7. These are some characteristics of scripts and tasks that help to support reliable unattended execution: **Idempotent** It should be possible to execute the same script or task multiple times without bad effects. **Pre-checks** A task should validate its starting conditions are correct, and fail with a visible and useful error if not, leaving things in a usable state. **Post-checks** A task should check that it has succeeded in making the changes. This isn’t just a matter of checking return codes on commands, but proving that the end result is there. For example, checking that a virtual host has been added to a web server could involve making an HTTP request to the web server. **Visible failure** When a task fails to execute correctly, it should be visible to the team. This may involve an information radiator and/or integration with monitoring services (covered in “What Is An Information Radiator?” and “Alerting: Tell Me When Something Is Wrong” ). **Parameterized** Tasks should be applicable to multiple operations of a similar type. For example, a single script can be used to configure multiple virtual hosts, even ones with different characteristics. The script will need a way to find the parameters for a particular virtual host, and some conditional logic or templating to configure it for the specific situation. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1071-1086). O'Reilly Media. Kindle Edition.
8. In order to allow a **tool** to **repeatedly run unattended**, it needs to be **idempotent**. This means that the result of running the tool should be the same no matter how many times it’s run. **Idempotent scripts** and tools can be set to **run continuously** (for example, at a fixed time interval), which helps to prevent configuration drift and improve confidence in automation. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1090-1092). O'Reilly Media. Kindle Edition.
9. If you apply the Terraform definition five times, it will only create one web server. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Location 1291). O'Reilly Media. Kindle Edition.
10. A good **domain-specific language (DSL)** for server configuration works by having you **define** the **state** you **want** something to be in, and then doing whatever is needed to **bring it into that state**. This should happen without side effects from being applied to the same server many times. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1096-1097). O'Reilly Media. Kindle Edition.
11. So **declarative definitions** lend themselves to running **idempotently**. You can safely **apply** your definitions **over and over** again, without thinking about it too much. If something is changed to a system outside of the tool, applying the definition will bring it back into line, eliminating sources of configuration drift. When you need to make a change, you simply modify the definition, and then let the tooling work out what to do. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1275-1278). O'Reilly Media. Kindle Edition.
12. “**Configuration definition file**” is a generic term for the tool-specific **files** used to **drive infrastructure automation tools**. Most tools seem to have their own names: playbooks, cookbooks, manifests, templates, and so on. A configuration definition could be any one of these, or even a configuration file or script. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1156-1158). O'Reilly Media. Kindle Edition.
13. A **configuration registry** is a directory of information about the elements of an infrastructure. It provides a **means** for scripts, **tools**, applications, and services to **find the information** they need in order to **manage** and integrate with **infrastructure**. This is particularly useful with dynamic infrastructure because this information changes continuously as elements are added and removed.
14. There are many **configuration registry** products. Some examples include **Zookeeper** , **Consul** , and **etcd** . Many server configuration tool vendors provide their own configuration registry for example, Chef Server, PuppetDB, and Ansible Tower. These products are designed to integrate easily with the configuration tool itself, and often with other elements such as a dashboard. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1335-1339). O'Reilly Media. Kindle Edition.
15. A **stack** is a **collection** of **infrastructure elements** that are **defined** as a **unit** (the inspiration for choosing the term stack comes mainly from the term’s use by AWS CloudFormation). A stack can be any size. It could be a single server. It could be a pool of servers with their networking and storage. It could be all of the servers and other infrastructure involved in a given application. Or it could be everything in an entire data center. What makes a set of infrastructure elements a **stack** isn’t the size, but **whether it’s defined and changed as a unit.** The concept of a stack **hasn’t** been commonly used with **manually managed infrastructures.** Elements are added organically, and **networking boundaries** are naturally used to think about infrastructure groupings. But automation tools force more explicit groupings of infrastructure elements. It’s certainly possible to put everything into one large group. And it’s also possible to structure stacks by following **network boundaries.** But these **aren’t the only ways to organize stacks.** Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 3227-3230). O'Reilly Media. Kindle Edition.
16. The difficulty of a **monolithic definition** is that it becomes **cumbersome** to change. With most definition tools, the file can be organized into separate files. But if making a change involves **running the tool** against the **entire infrastructure stack**, things become dicey: It’s **easy** for a **small change** to **break many things**. It’s hard to avoid tight coupling between the parts of the infrastructure. Each **instance** of the environment is **large** and **expensive**. Developing and testing changes requires **testing an entire stack at once**, which is **cumbersome**. If **many people** can make changes to the infrastructure, there is a high **risk** someone will **break** something. On the other hand, if **changes** are **limited** to a **small group** to minimize the risk, then there are likely to be **long delays** waiting for changes to be made. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 3357-3363). O'Reilly Media. Kindle Edition.
17. “**Provisioning”** is a term that can be used to mean somewhat different things \[...] provisioning is used to mean **making** an **infrastructure element** such as a **server** or **network device** **ready for use**. Depending on what is being provisioned, this can involve: Assigning **resources** to the element. **Instantiating** the element. **Installing** software onto the element. **Configuring** the element. **Registering** the element with infrastructure services. At the end of the provisioning process, the element is **fully ready for use.**
18. “**Provisioning**” is sometimes used to refer to a more **narrow** part of this process. For instance, **Terraform** and **Vagrant** both use it to define the callout to a server configuration tool like Chef or Puppet to **configure** a **server** **after** it has been **created**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1218-1220). O'Reilly Media. Kindle Edition.
19. The **infrastructure element** could be a **server**; a **part** of a **server**, such as a user account; **network** **configuration**, such as a **load balancer rule**; or many other things. Different tools have different terms for this: for example, playbooks (Ansible), recipes (Chef), or manifests (Puppet). The term “**configuration definition file”** is used \[...] as a generic term for these. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 421-424). O'Reilly Media. Kindle Edition.
20. An **infrastructure definition tool** will **create servers** but **isn’t** responsible for what’s **on the server** itself. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1304-1305). O'Reilly Media. Kindle Edition.
21. The term “**environment**” is typically used when there are **multiple stacks** that are actually **different instances** of the **same** **service** or set of services. The most common use of environments is for testing. An application or service may have “development,” “test,” “preproduction,” and “production” environments. The infrastructure for these should generally be the same, with perhaps some variations due to scale. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 3227-3230). O'Reilly Media. Kindle Edition.
22. This dependency on **manual effort** to keep multiple files up to date and consistent is the **weakness** of **per-environment** **definition** **files**. There is no way to ensure that changes are made consistently. Also, because the change is made manually to each environment, it isn’t possible to use an automated pipeline to test and promote the change from one environment to the next.
23. \[...] server configuration should result in the following : A new **server** can be **completely provisioned on demand**, without waiting more than a few minutes. A new server can be completely provisioned **without human involvemen**t for example, in response to events. When a server configuration **change** is defined, it is **applied** to servers **without human involvement**. Each **change is applied to all the servers** it is relevant to, and is reflected in all new servers provisioned after the change has been made. The processes for provisioning and for applying changes to servers are **repeatable, consistent, self-documented, and transparent**. It is **easy and safe to make changes** to the processes used to provision servers and change their configuration. **Automated tests are run every time a change is made** to a server configuration definition, and to any process involved in provisioning and modifying servers. **Changes** to configuration, and changes to the processes that carry out tasks on an infrastructure, are **versioned** and applied to different environments, in order to support controlled testing and staged release strategies. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1440-1449). O'Reilly Media. Kindle Edition.
24. A **new server** is created by the dynamic infrastructure platform using an **infrastructure definition tool** \[...] The server is created from a **server template**, which is a **base image** of some kind. This might be in a VM image format specific to the infrastructure platform (e.g., an AWS AMI image or VMware VM template), or it could be an OS installation disk image from a vendor (e.g., an ISO image of the Red Hat installation DVD). Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1460-1465). O'Reilly Media. Kindle Edition.
25. There are many **use cases** where **new servers** are created: A member of the infrastructure team needs to build a new server of a standard type for **example**, **adding a new file server to a cluster**. They **change** an **infrastructure definition file** to specify the new server. A user wants to set up a new instance of a standard application for example, a bug-tracking application. They use a **self-service portal**, which builds an application server with the bug-tracking software installed. A web server VM crashes because of a hardware issue. The **monitoring service** **detects** the **failure** and **triggers** the creation of a **new VM to replace it**. User **traffic** grows beyond the capacity of the existing application server pool, so the infrastructure platform’s **autoscaling** functionality creates new application servers and adds them to the pool to meet the demand. A **developer commits a change** to the software they are working on. The CI software (e.g., Jenkins or GoCD) **automatically provisions** an application server in a test environment with the new build of the software so it can run an automated test suite against it. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1467-1475). O'Reilly Media. Kindle Edition.
26. The **immutable server pattern** mentioned in “Server Change Management Models” **doesn’t make configuration updates to existing servers**. Instead, changes are made by **building a new server** with the new configuration. With **immutable servers**, **configuration** is **usually** **baked** into the **server template**. When the configuration is updated, a new template is **packaged**. **New instances** of **existing servers** are built from the **new template** and used to **replace** the **older servers**. This approach **treats** **server templates** like **software artifacts**. Each build is versioned and tested before being deployed for production use. This creates a high level of confidence in the consistency of the server configuration between testing and production. **Advocates** of **immutable server**s view making a **change** to the **configuration** of a **production** **server** as **bad** practice, no better than modifying the source code of software directly on a production server. Immutable servers can also **simplify configuration** management, by **reducing** the area of the server that **needs** to be managed by **definition files**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2239-2247). O'Reilly Media. Kindle Edition.
27. Using the term “**immutable**” to describe this pattern can be misleading. “Immutable” means that a thing can’t be changed, so a **truly immutable server would be useless**. As soon as a server boots, its **runtime** **state** **changes** **processes** run, entries are written to logfiles, and **application data** is added, updated, and removed. It’s more **useful** to think of the term “**immutable**” as applying to the **server’s configuration,** rather than to the server as a whole. This creates a clear **line** between **configuration** and **data**. It forces teams to explicitly **define** which elements of a server they will **manage** deterministically as **configuration** and which elements will be treated as **data**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2918-2926). O'Reilly Media. Kindle Edition.
28. **Antipattern: Handcrafted Server** ... **manually building** servers leads almost immediately to **configuration drift** and **snowflake** servers. Server creation is **not traceable, versionable, or testable, and is certainly not self-documenting.** Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2376-2391). O'Reilly Media. Kindle Edition.
29. **Antipattern: Hot Cloned Serve**r ... A server that has been **cloned** from a **running server** is **not** **reproducible**. You can’t create a third server from the same starting point, because both the original and the new server have moved on: they’ve been in use, so various things on them will have changed. **Cloned servers** also suffer because they have **runtime data** from the **original server.** Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2412-2423). O'Reilly Media. Kindle Edition.
30. **Antipattern: Snowflake Factory** Many organizations adopt automated tools to provision servers, but a person still creates each one by **running the tool** and **choosing options** for the particular server. This typically happens when processes that were used for **manual server provisioning** are simply carried over to the automated tools. The result is that it may be quicker to build servers, but that servers are still inconsistent, which can make it difficult to automate the process of keeping them patched and updated. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2436-2441). O'Reilly Media. Kindle Edition.
31. The phrase **lift and shift** describes installing **software** that was **written** for **static infrastructure** (i.e., most software designed with legacy, pre-cloud assumptions) **onto dynamic infrastructure**. Although the software ends up running on a dynamic infrastructure platform, the infrastructure must be **managed statically** to avoid breaking anything. These applications are **unlikely** to be able to take advantage of advanced infrastructure capabilities such as **automatic** **scaling** and **recovery**, or the creation of ad hoc instances. Some **characteristics** of non-cloud-native software that require “lift and shift” migrations : **Stateful sessions**. Storing **data on the local filesystem**. **Slow-running startup routines**. **Static configuration** of **infrastructure** **parameters** Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5628-5634). O'Reilly Media. Kindle Edition.
32. In many cases, new servers can be built using off-the-shelf server template images. Infrastructure platforms, such as **IaaS clouds**, often provide **template images** for common **operating systems**. Many also offer libraries of **templates** built by vendors and third parties, who may provide images that have been preinstalled and configured for particular purposes, such as **application servers**. But many infrastructure teams find it useful to **build their own server templates**. They can pre-configure them with their team’s preferred tools, software, and configuration. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1506-1512). O'Reilly Media. Kindle Edition.
33. **Packaging** **common elements** onto a template makes it **faster** to **provision** new servers. Some teams take this further by creating server templates for particular **roles** such as web servers and application servers. One of the key **trade-offs** is that, as **more elements** are **managed** by packaging them into server **templates**, the **templates** need to be **updated** **more** often. This then requires more sophisticated processes and tooling to build and manage templates. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1510-1515). O'Reilly Media. Kindle Edition.
34. Keeping **templates** **minimal** makes **sense** when there is a **lot of variation** in what may be installed on a server. For example, if people create servers by self-service, choosing from a large menu of configuration options, it makes sense to provision dynamically when the server is created. Otherwise, the **library** of **prebuilt** templates would need to be **huge** to include **all of the variations** that a user might select. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2564-2567). O'Reilly Media. Kindle Edition.
35. At the other end of the provisioning spectrum is **putting** nearly **everything** into the **server template**. **Building** new servers then becomes very **quick** and simple, just a matter of selecting a template and applying instance-specific configuration such as the hostname. This can be **useful** for infrastructures where new instances **need** to be spun up very **quickly** for **example**, to support **automated scaling**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2574-2577). O'Reilly Media. Kindle Edition.
36. With **immutable servers**, **templates** are treated like a **software** **artifact** in a **continuous delivery pipeline**. Any change to a server’s configuration is made by building a **new version of the template**. Each new template version is automatically **tested** before it is rolled out to production environments. This ensures that every production server’s configuration has been thoroughly and reliably tested. There is **no opportunity** to **introduce** an **untested configuration** change to production. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2587-2590). O'Reilly Media. Kindle Edition.
37. It may be useful to **create a separate server template** for each **role**, with the relevant **software** and **configuration** baked into it. This requires more **sophisticated** (i.e., automated) processes for **building** and managing **templates** but makes **creating** **servers** **faster** and simpler. Different server templates may be needed for reasons other than the functional role of a server. Different operating systems and distributions will each need their own server template, as will significant versions. For example, a team could have separate server templates for CentOS 6.5.x, CentOS 7.0.x, Windows Server 2012 R2, and Windows Server 2016. In other cases, it could **make** **sense** to have server **templates** **tuned** for different **purposes**. Database server nodes could be built from one template that has been tuned for high-performance file access, while web servers may be tuned for network I/O throughput. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2796-2802). O'Reilly Media. Kindle Edition.
38. The **four models** for making **configuration changes** servers are **ad hoc,** **configuration synchronization**, **immutable servers**, and **containerized servers**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2887-2888). O'Reilly Media. Kindle Edition.
39. **Ad** **hoc** change management makes **changes** to **servers** only **when** a specific **change** is **needed**. This was the traditional approach before the automated server configuration tools became mainstream, and is still the most commonly used approach. It is **vulnerable** to **configuration** **drift**, **snowflakes**, and all of the evils described \[...]. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1599-1602). O'Reilly Media. Kindle Edition.
40. **Configuration synchronization** **repeatedly applies configuration** definitions to servers, for example, by running a Puppet or Chef agent on an hourly schedule. This ensures that any **changes** to parts of the system managed by these definitions are **kept in line**. Configuration synchronization is the mainstream approach for infrastructure as code, and most server configuration tools are designed with this approach in mind. The main limitation of this approach is that **many areas** of a server are **left unmanaged**, leaving them vulnerable to **configuration drift**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1605-1609). O'Reilly Media. Kindle Edition.
41. **Immutable infrastructure** makes configuration changes by **completely replacing servers**. Changes are made by **building new server templates**, and then rebuilding relevant servers using those templates. This increases predictability, as there is little variance between servers as tested, and servers in production. It requires **sophistication in server template managemen**t. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1611-1614). O'Reilly Media. Kindle Edition.
42. **Containerized services** works by packaging applications and services in **lightweight containers** (as popularized by Docker). This **reduces coupling** between **server configuration** and the things that **run on** the **servers**. **So host servers tend to be very simple, with a lower rate of change.** One of the other change management **models** still needs to be **applied** to these **hosts**, but their implementation becomes much simpler and easier to maintain. **Most effort and attention goes into packaging, testing, distributing, and orchestrating the services and applications**, but this follows something similar to the immutable infrastructure model, which again is simpler than managing the configuration of full-blown virtual machines and servers. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1617-1621). O'Reilly Media. Kindle Edition.
43. **Bootstrap Configuration** with **Immutable Servers**: The **purest use** of **immutable servers** is to **bake** everything onto the server **template** and **change nothing**, even when creating server instances from the template. **But some teams have found that for certain types of changes, the turnaround time needed to build a new template is too slow.** **An emerging practice is to put almost everything into the server template, but add one or two elements when bootstrapping a new server.** This might be a **configuration setting that is only known when the server is created, or it might be a frequently changing element such as an application build for testing**. A small development team using continuous integration (CI) or continuous delivery (CD) is likely to deploy dozens of builds of their application a day, so **building a new server template for every build may be unacceptably slow**. Having a **standard server template image** that can **pull** **in** and **start** a **specified** **application** **build** when it is started is particularly useful for **microservices**. This still follows the **immutable** **server** **pattern**, in that **any** **change** to the server’s **configuration** (such as a new version of the microservice) is carried out by **building a new server instance**. It shortens the turnaround time for changes to a microservice, because it **doesn’t** **require** building a **new** **server** **template**. However, this practice arguably **weakens** the **testing** **benefits** from the **pure immutable model**. Ideally, a given **combination** of **server** **template** **version** and **microservice** **version** will have been **tested** through each stage of a change management **pipeline**. But there is some **risk** that the process of installing a microservice, or making other changes, when creating a server will behave slightly differently when done for different servers. This could cause unexpected behavior. So **this practice trades some of the consistency benefits of baking everything into a template and using it unchanged in every instance in order to speed up turnaround times for changes made in this way. In many cases, such as those involving frequent changes, this trade-off works quite well.** Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 3099-3116). O'Reilly Media. Kindle Edition.
44. The most **effective** **measurement** of a change management pipeline is the **cycle** **time**. Cycle time is the time between deciding on the need for a change to seeing that change in production use. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4867-4868). O'Reilly Media. Kindle Edition.
45. **Blue-green** **replacement** is the most straightforward pattern to replace an infrastructure element **without** **downtime**. This is the blue-green deployment pattern for software 4 applied to infrastructure. It requires running two instances of the affected infrastructure, **keeping** one of them **live** at any point in time. Changes and **upgrades** are made to the **offline** **instance**, which can be **thoroughly** **tested** before **switching** usage over to it. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5681-5685). O'Reilly Media. Kindle Edition.
46. **Phoenix** **replacement** is the natural progression from blue-green using dynamic infrastructure. Rather than keeping an idle instance around between changes, a **new** **instance** can be created each time a **change** is needed. As with blue-green, the change is **tested** on the new instance before putting it into use. The previous instance can be **kept** **up** for a **short** **time**, until the new instance has been **proven** in use. But then the **previous** **instance** is **destroyed**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5694-5697). O'Reilly Media. Kindle Edition.
47. The **canary** **pattern** involves deploying the **new** **version** of an element alongside the old one, and then **routing** some **portion** of usage to the new elements. For example, with version A of an application running on 20 servers, version B may be deployed to two servers. A subset of traffic, perhaps flagged by IP address or by randomly setting a cookie, is sent to the servers for version B. The behavior, performance, and resource usage of the new element can be monitored to validate that it’s ready for wider use. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5724-5728). O'Reilly Media. Kindle Edition.
48. The **benefits** of **containerization** include: **Decoupling** the **runtime** **requirements** of specific applications from the **host** **server** that the container runs on. Repeatably create **consistent** **runtime** **environments** by having a **container** **image** that can be distributed and run on **any** **host** **server** that supports the runtime. Defining **containers** as **code** (e.g.,in a **Dockerfile**) that can be **managed** in a **VCS**, used to trigger **automated** **testing**, and generally having all of the characteristics for infrastructure as code. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1633-1637). O'Reilly Media. Kindle Edition.
49. The benefits of **decoupling** **runtime** **requirements** from the **host** **system** are particularly powerful for infrastructure management. It creates a clean **separation** of concerns between **infrastructure** and **applications**. The host system **only** needs to have the **container** **runtime** **software** installed, and then it can run nearly any container image. Applications, services, and jobs are packaged into containers along with all of their dependencies \[...]. These dependencies can include operating system packages, language runtimes, libraries, and system files. **Different** **containers** may have different, even **conflicting** **dependencies**, but still run on the **same** **host** without issues. **Changes** to the **dependencies** can be made **without** any **changes** to the **host** system. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1652-1658). O'Reilly Media. Kindle Edition.
50. Sharing the **OS** **kernel** means a container has **less** **overhead** than a hardware **virtual** **machine**. A **container** image can be much **smaller** than a **VM** image, because it doesn’t need to include the entire OS. It can start up in seconds, as it **doesn’t** **need** to **boot** a **kernel** from scratch. And it consumes **fewer** system **resources**, because it **doesn’t** need to run its own **kernel**. So a given **host** can run **more** **container** processes **than** full **VMs**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1701-1703). O'Reilly Media. Kindle Edition.
51. The best way to think of a **container** is as a **method** to **package** a **service**, application, or job. It’s an RPM on steroids, taking the application and adding in its dependencies, as well as providing a standard way for its **host** system to **manage** its **runtime** environment . Rather than a single container running multiple processes, aim for **multiple** **containers**, each running **one** **process**. These processes then become **independent**, **loosely** **coupled** entities. This makes containers a nice match for microservice application architectures. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1708-1711). O'Reilly Media. Kindle Edition.
52. **Containerization** has the potential to create a clean **separation** between layers of **infrastructure** and the **services** and **applications** that **run** **on** it. **Host** servers that run containers can be kept very **simple**, without needing to be tailored to the requirements of specific applications, and without imposing constraints on the applications beyond those imposed by containerization and supporting services like logging and monitoring. So **the infrastructure that runs containers consists of generic container hosts**. These can be stripped down to a bare minimum, including only the minimum toolsets to run containers, and potentially a few agents for monitoring and other administrative tasks. This **simplifies** management of these **hosts**, as they **change** **less** often and have fewer things that can break or need updating. It also reduces the surface area for security exploits. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1723-1729). O'Reilly Media. Kindle Edition.
53. **Containerization** offers a different **model** for managing **server** **processes**. **Processes** are **packaged** so that they can be **run** on servers that **haven’t** been **specifically** **built** for the purpose. A pool of **generic container hosts** can be available to run a variety of different containerized processes or jobs. Assigning containerized processes to hosts is flexible and quick. The **number** of container **hosts** can be **adjusted** **automatically** based on the **aggregated** **demand** across many different types of services. However, this approach requires a **scheduler** to start and manage container instances. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2051-2055). O'Reilly Media. Kindle Edition.
54. “**Don’t Make Changes Directly on the Production Environment**: Most downtime in production environments is caused by uncontrolled changes. Production environments should be completely **locked** **down**, so that **only** your **deployment** **pipeline** can make **changes** to it. That includes everything from the **configuration** of the environment to the **applications** deployed on it and their **data**.” Humble, Jez. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (Addison-Wesley Signature Series (Fowler)) (p. 273). Pearson Education. Kindle Edition.
55. “The **CNF** should **run** **without** **privileges**. **Privileged** actions should be **managed** by the **scheduler** and environment”, x-factor-cnfs, Fred Kautz, <https://github.com/fkautz/x-factor-cnfs/blob/master/content/process-containers.md> To achieve process isolation, X-factor CNFs are designed to run in process containers without privileges. **Privileged** **actions** may be **requested** by the container and performed by **privileged** **delegate**. Running **unprivileged** promotes **loose** **coupling** between environments, **reduces** the overall **attack** **surface**, and gives the scheduler the ability to **clean** **up** after the pod in the case of the **pod** **failing**. The X-factor CNF methodology recognizes the need for hardware which requires additional kernel modules. When possible, **kernel** **modules** must follow standard Linux kernel device driver standards \[...] and do **not** affect the **kernel's** **runtime** **environment** **beyond** **enabling the device**. These devices must also **not** be **bound** directly from the **CNF**. Instead, they are listed as an **interface** **mechanism** and **injected** into the **container** **runtime** by the **orchestrator**. The existence of a hardware device should not affect other CNFs. Some **kernel** **modifications** may be acceptable, e.g. **DPDK** or drivers. This should be immutable infrastructure with a **clean** **interface** for pods. In short, **pods** should **not** be allowed to **modify** their **infrastructure**.
56. Container **orchestration** tools have emerged following the rise of containerization systems like Docker. Most of these run agents on a pool of container hosts and are able to **automatically** **select** **hosts** to run new **container** **instances**, **replace** **failed** instances, and **scale** numbers of instances **up** and **down**. Some tools also handle **service** **discovery**, **network** **routing**, **storage**, scheduled **jobs**, and other capabilities. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2063-2066). O'Reilly Media. Kindle Edition.
57. **Data**: **Files generated and updated by the system**, applications, and so on. It may change frequently. The **infrastructure** may have some **responsibility** for this **data** (e.g., **distributing** it, **backing** it **up**, **replicating** it, etc.). But the infrastructure will normally **treat** the **contents** of the **files** as a **black** **box**, not caring about what’s in the files. Database data files and logs files are examples of data in this sense. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2306-2309). O'Reilly Media. Kindle Edition.
58. The key **difference** between **configuration** and **data** is whether automation **tools** will automatically **manage** what’s **inside** the **file**. So even though some infrastructure tools do care about what’s in system logfiles, they’re normally treated as data files.
59. **Data** creates a particular **challenge** for **zero-downtime** change strategies. All of the patterns described involve running multiple versions of a component simultaneously, with the option to roll back and keep the first version of the element if there is a problem with the new one. However, if the components use read/write data storage, this can be a problem. The **problem** comes when the **new** **version** of the **component** involves a **change** to **data** **formats** so that it’s **not** **possible** to have **both** **versions** share the **same data storage** **without** **issues**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5766-5770). O'Reilly Media. Kindle Edition.
60. An effective way to approach **data** for **zero-downtime** deployments is to **decouple** **data** **format** **changes** **from** **software** **releases**. This requires the **software** to be written so that it can **work** with **two** **different data formats**, the **original** version and the **new** version. It is first **deployed** and **validated** with the data in the **original** **format**. The data is then **migrated** as a background task **while** the **software** is **running**. The **software** is able to **cope** with **whichever** **data** **format** a given record is in. Once the **data** **migration** is complete, **compatibility** with the **old** **data** format should be **removed** from the **next** **release** of the software, to keep the code clean. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5776-5780). O'Reilly Media. Kindle Edition.
61. Any **mechanism** that efficiently **replicates** **data** across distributed storage will replicate **corrupted** data just as happily as good data. A data availability strategy needs to address how to **preserve** and **restore** **previous** **versions** of data. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5823-5826). O'Reilly Media. Kindle Edition.
62. \[...] **core infrastructure resources: compute, networking, and storage**. These provide the basic building blocks for infrastructure as code. However, most infrastructures will need a variety of other supporting services and tools. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1821-1822). O'Reilly Media. Kindle Edition.
63. \[...] principles of infrastructure as code for services can be summarized as: The **service** can be **easily** **rebuilt** or **reproduced**. The elements of the service are **disposable**. The infrastructure **elements** **managed** by the **service** are **disposable**. The infrastructure elements managed by the service are always changing. Instances of the service are configured consistently. Processes for **managing** and using the **service** are **repeatable**. Routine requests are fulfilled quickly, with little effort, preferably through **self-service** or automatically. Complex changes can be made easily and safely. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1832-1837). O'Reilly Media. Kindle Edition.
64. Some of the specific practices include: Use **externalized definition files. Self-document** systems and processes. Version all the things. Continuously **test** systems and processes. **Make small changes rather than batches of them**. Keep services available continuously. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1838-1840). O'Reilly Media. Kindle Edition.
65. Prefer Products with **Cloud-Compatible Licensing** Licensing can make dynamic infrastructure difficult with some products. Some examples of **licensing** approaches that work **poorly** include: A **manual** process to **register** each new instance, agent, node, etc., for licensing. Clearly, this defeats automated provisioning. If a product’s license does require registering infrastructure elements, there needs to be an **automatable** **process** for adding and removing them. **Inflexible licensing periods.** Some products require customers to buy a **fixed** set of **licenses** for a **long** **period**. For example, a monitoring tool may have licensing based on the maximum number of nodes that can be monitored. The licenses may need to be purchased on a **monthly** cycle. This forces the customer to pay for the maximum number of nodes they might use during a given month, even when they only run that number of nodes for a fraction of the time. This **cloud-unfriendly** **pricing** model **discourages** customers from taking advantage of the ability to **scale capacity up and down with demand.** Vendors pricing for cloud **charge by the hour at most.** Heavyweight purchasing process to increase capacity. This is closely related to the licensing period. When an organization is hit with an unexpected surge in business, they **shouldn’t** **need** to spend **days or weeks** to **purchase the extra capacity** they need to meet the demand. It’s common for vendors to have limits in place to protect customers against accidentally over-provisioning, but it should be possible to raise these limits quickly. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1880-1892). O'Reilly Media. Kindle Edition.
66. Continuous delivery for software is implemented using a **deployment** **pipeline**. A deployment pipeline is an **automated** manifestation of a **release** **process**. It **builds** the application code, and **deploys** and **tests** it on a series of **environments** **before** allowing it to be deployed to **production**. The **same** concept is applied to **infrastructure** **changes**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 3823-3826). O'Reilly Media. Kindle Edition.
67. The **point** of **CD** and the software deployment pipeline is to allow changes to be delivered in a **continuous** **flow**, **rather** than in **large** **batches**. Changes can be **validated** more **thoroughly**, not only because they are applied with an automated process, but also because **changes** are **tested** when they are **small**, and because they are tested immediately after being committed. The result, when done well, is that changes can be made **more** **frequently**, more **rapidly**, and more **reliably**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4525-4529). O'Reilly Media. Kindle Edition.
68. Teams who embrace the **pipeline** as the way to manage **changes** to their **infrastructure** find a number of benefits: Their infrastructure management **tooling** and **codebase** is always **production** **ready**. There is **never** a situation where **extra** **work** is needed (e.g., merging, regression testing, and “hardening”) to take work live. Delivering changes is nearly painless. Once a change has passed the technical validation stages of the pipeline, it **shouldn’t** **need** **technical** **attention** to carry through to **production** unless there is a problem. There is **no** **need** to make technical **decisions** about how to apply a change to **production**, as those decisions have been made, implemented, and tested in earlier stages. It’s easier to make changes through the pipeline than any other way. Hacking a change manually other than to bring up a system that is down is more work, and scarier, than just pushing it through the pipeline. **Compliance** and **governance** are easy. The **scripts**, **tools**, and **configuration** for making changes are **transparent** to **reviewers**. **Logs** can be **audited** to prove what **changes** were made, when, and by whom. With an automated change management pipeline, a team can prove what process was followed for each and every change. This tends to be **stronger** than taking someone’s word that **documented manual processes** are always followed. Change management processes can be more lightweight. People who might otherwise need to discuss and inspect each change can build their requirements into the automated tooling and tests. They can periodically review the pipeline implementation and logs, and make improvements as needed. Their time and attention goes to the process and tooling, rather than inspecting each change one by one. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4551-4563). O'Reilly Media. Kindle Edition.
69. The important thing is **how the artifact is treated**, conceptually. A **configuration** **artifact** is an **atomic**, **versioned** collection of materials that **provision and/or configure a system component**. An **artifact** is: ***Atomic*** A given **set** of **materials** is **assembled**, **tested**, and **applied** together as a unit. ***Portable*** It can be **progressed** through the **pipeline**, and different versions can be applied to **different environments** or instances. It can be reliably and **repeatably applied to any environment,** and so any given environment has an unambiguous version of the component. ***Complete*** A given artifact should have **everything needed** to **provision** or **configure** the relevant **component**. It should **not assume** that **previous versions** of the **component artifacts** have been **applied** **before**. ***Consistent*** **Applying** the artifact to any two component instances should have the **same results**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4669-4678). O'Reilly Media. Kindle Edition.
70. With immutable servers \[...], the **artifact** is the **server template image**. The **build** stage checks a server **template definition file,** such as a **Packer template,** or script out from **VCS**, **builds** the server **template** **image**, and **publishes it** by making it available in the **infrastructure platform**. With AWS, for example, this results in an AMI image. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4697-4701). O'Reilly Media. Kindle Edition.
71. A **pipeline** typically applies **tests** with **increasing levels** of **complexity**. The **earlier** **stages** focus on faster and **simpler** tests, such as unit tests, and testing a single service. **Later** **stages** cover **broader** sections of a system, and often replicate more of the complexities of production, such as **integration** with other services. It’s important that the environments, tooling, and processes involved in applying changes and deploying software are consistent across all of the stages of the pipeline. This ensures that **problems** that might appear in production are **discovered early**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 3827-3831). O'Reilly Media. Kindle Edition.
72. A **top-heavy test suite** is **difficult** to **maintain**, **slow** to run, and doesn’t pinpoint errors as well as a more balanced suite. **High-level** tests tend to be **brittle**. One change in the system can break a large number of tests, which can be more work to fix than the original change. This leads to the **test** **suite** falling **behind** **development**, which means it **can’t be run continuously.** Higher-level tests are also slower to run than the more focused lower-level tests, which makes it impractical to run the full suite frequently. And because higher-level tests cover a broad scope of code and components, when a test fails it may take a **while** to **track** **down** and **fix** the cause. This usually comes about when a team puts a UI-based test automation tool at the core of their test automation strategy. This in turn often happens when testing is treated as a separate function from building. **Testers** who **aren’t** **involved** in **building** the system **don’t** have the **visibility** or involvement with the different layers of the stack. This **prevents** them from **developing** **lower-level** **tests** and incorporating them into the build and change process. For someone who only interacts with the system as a black box, the UI is the easiest way to interact with it. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4025-4034). O'Reilly Media. Kindle Edition.
73. So it’s good sense to **sanity** **check** each new **server** when it’s **created**. Automated server **smoke** **testing** scripts can check the basic things you expect for all of your servers, things specific to the server’s role, and general compliance. For example: Is the server running and accessible? Is the monitoring agent running? Has the server appeared in DNS, monitoring, and other network services? Are all of the necessary services (web, app, database, etc.) running? Are required user accounts in place? Are there any ports open that shouldn’t be? Are any user accounts enabled that shouldn’t be? Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2511-2515). O'Reilly Media. Kindle Edition.
74. People managing projects to develop and deploy software have a bucket of requirements they call **non-functional** **requirements**, or NFRs ; these are also sometimes referred to as cross-functional requirements (CFRs). **Performance**, **availability**, and **security** tend to be swept into this bucket. NFRs related to infrastructure can be labeled **operational** **qualities** for convenience. These are things that can’t easily be described using functional terms: take an action or see a result. Operational requirements are only apparent to users and stakeholders when they go wrong. If the system is slow, flaky, or compromised by attackers, people notice. **Automated** **testing** is **essential** to ensuring **operational** **requirements**. Every time a **change** is made to a system or its infrastructure, it’s important to **prove** that the change **won’t** cause **operational** **problems**. The team should have targets and thresholds defined which it can write tests against. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4242-4249). O'Reilly Media. Kindle Edition.
75. **Metrics** are best used by the team to help itself, and should be continually reviewed to decide whether they are still providing **value**. Some common metrics used by infrastructure teams include: ***Cycle time*** The **time** taken from a **need** being **identified** to **fulfilling** it. This is a measure of the **efficiency** and **speed** of **change management**. \[...] ***Mean time to recover*** (MTTR) The time taken from an **availability problem** (which includes critically degraded performance or functionality) being **identified** to a **resolution**, even where it’s a workaround. This is a measure of the **efficiency** and **speed** of **problem resolution.** ***Mean time between failures*** (MTBF) The **time** **between** **critical** availability **issues**. This is a measure of the stability of the system, and the quality of the change management process. Although it’s a valuable metric, **over-optimizing for MTBF is a common cause of poor performance on other metrics**. ***Availability*** The **percentage** of **time** that the **system** is **available**, usually excluding time the system is offline for planned maintenance. This is another measurement of system stability. It is often used as an SLA in service contracts. True availability The percentage of time that the system is available, not excluding planned maintenance.
76. **Layered** **templates** fit well, conceptually at least, with a change management pipeline (see Chapter 12 ), because changes to the base image can then **ripple out** to **automatically** **build** the **role-specific** **templates**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 2805-2807). O'Reilly Media. Kindle Edition.
77. **Integration Models** The design and implementation of pipelines for testing how systems and infrastructure elements integrate depends on the relationships between them, and the relationships between the teams responsible for them. There are several typical situations: ***Single team*** One team owns all of the elements of the system and is fully responsible for managing changes to them. In this case, a **single pipeline,** with **fan-in** as needed, is often sufficient. ***Group of teams*** A group of teams works together on a single system with multiple services and/or infrastructure elements. Different teams own different parts of the system, which all integrate together. In this case, a **single fan-in pipeline may work** up to a point, but as the **size** of the group and its system **grows**, **decoupling** may become necessary. ***Separate teams with high coordination*** Each team (which may itself be a group of teams) **owns** a **system**, which **integrates** with **systems** **owned** by **other teams.** A given system may integrate with multiple systems. Each **team** will have its **own pipeline** and manage its **releases independently.** But they may have a close enough relationship that one team is willing to **customize** its **systems** and releases to **support another team’s requirements.** This is often seen with different **groups within a large company** and with close vendor relationships. ***Separate teams with low coordination*** As with the previous situation, except **one** of the **teams** is a **vendor** with many other customers. Their release process is designed to **meet the requirements of many teams,** with **little** or **no customizations** to the requirements of individual customer teams. “X as a Service” vendors, providing logging, infrastructure, web analytics, and so on, tend to use this model. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4892-4907). O'Reilly Media. Kindle Edition.
78. **Managing** the configuration of **hardware network devices** to support a dynamic infrastructure is a particular **challenge**. For example, you may need to add and remove servers from a load balancer if they are created and destroyed automatically in response to demand. **Network** **devices** tend to be difficult to **automatically** **configure**, although many are able to load configuration files over the network for example, from a TFTP server. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 859-861). O'Reilly Media. Kindle Edition.
79. Given two **integrated** **components**, one **provides** a **service**, and the other **consumes** it. The **provider** component needs to **test** that it is providing the service correctly for its consumers. And the consumer needs to **test** that it is consuming the provider service correctly. For example, one team may manage a monitoring service, which is used by multiple application teams. The monitoring team is the provider, and the application teams are the consumers. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4930-4933). O'Reilly Media. Kindle Edition.
80. **Pattern: Library Dependency** One way that one component can provide a capability to another is to work like a **library**. The **consumer** pulls a **version** of the **provider** and **incorporates** it into its own **artifact**, usually in the **build** stage of a **pipeline**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4937-4939). O'Reilly Media. Kindle Edition.
81. The important characteristic is that the **library** component is **versioned**, and the **consumer** can **choose** which **version** to use. If a **newer version** of the library is released, the **consumer** may **opt** to immediately pull it in, and then run **tests** on it. However, it has the option to **“pin” to an older version** of the library. This gives the **consumer** team the flexibility to **release changes** even if they **haven’t** yet **incorporated** **new**, incompatible changes to their **provider library**. But it creates the **risk** that important changes, such as **security** patches, aren’t integrated in a timely way. This is a major source of security vulnerability in IT systems. For the **provider**, this pattern gives freedom to **release new changes** **without** having to wait for all **consumer** teams to update their components. But it can result in having **many different versions** of the **component** in **production**, which increases the time and **hassle** of **support**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4941-4947). O'Reilly Media. Kindle Edition.
82. **Pattern: Self-Provisioned Service Instance** The library pattern can be adapted for full-blown services. A well-known example of this is AWS’s Relational Database Service, or RDS , offered by AWS. A team can provision complete working database instances for itself, which it can use in a pipeline for a component that uses a database. As a provider, Amazon releases new database versions, while still making older versions available as “previous generation DB instances” . This has the **same** effect as the **library pattern**, in that the **provider** can **release new versions** **without waiting** for **consumer teams** to **upgrade** their own components. Being a **service** **rather** than a **library**, the provider is able to **transparently** **release** **minor updates** to the service. Amazon can apply **security** patches to its RDS offering, and **new instances** created by consumer teams will automatically **use the updated version**. The key is for the **provider** to keep close track of the **interface** **contract**, to make sure the service behaves as expected after updates have been applied. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4963-4973). O'Reilly Media. Kindle Edition.
83. **Providing Test Instances of a Service to Consumers** The **provider** of a **hosted** service needs to **provide** **support** for **consumers** to **develop** and **test** their **integration** with the **service**. This is useful to consumers for a number of purposes: To learn how to correctly integrate to the service. To **test** that integration still works after **changes** to the **consumer** system. To **test** that the consumer system still works **after** **changes** to the **provider**. To **test** and develop against **new** **provider** **functionality** **before** it is **released**. To **reproduce** **production** **issues** for troubleshooting. To run **demonstrations** of the consumer system without affecting production data. An effective way for a provider to support these is to provide self-provisioned service instances. If **consumers** can **create** and **configure** **instances** **on-demand**, then they can easily handle their own testing and demonstration needs. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4992-4999). O'Reilly Media. Kindle Edition.
84. \[...] a common technique is to **deploy** **changes** into **production** **without** necessarily **releasing** them to end users. New **versions** of components are put into the production environment and integrated with other components in ways that won’t impact normal operations. This allows them to be **tested** in true **production** conditions, **before** the **switch is flipped** to put them into active use. They can even be put into use in a drip-feed fashion, measuring their performance and impact before rolling them out to the full user base. Techniques for doing this include those used to hide unfinished changes in the codebase, such as **feature** **toggles**, as well as zero-downtime replacement patterns. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5062-5067). O'Reilly Media. Kindle Edition.
85. **Contract tests** are automated tests that check whether a provider **interface** **behaves** as **consumers** **expect**. This is a much smaller set of tests than full functional tests, purely **focused on the API** that the service has committed to provide to its consumers. By running contract tests in their own **pipeline**, the **provider** team will be alerted if they accidentally make a change that **breaks** the **contract**. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5092-5095). O'Reilly Media. Kindle Edition.
86. **Practice: Run Consumer-Driven Contract (CDC) Tests** A variation on these previous practices is for a provider to run **tests** **provided by consumer** teams. These tests are written to formalize the expectations the consumer has for the provider’s interface. The **provider** runs the **tests** in a stage of its own **pipeline**, and fails any build that fails to pass these tests. A failure of a CDC test tells the provider they need to investigate the nature of the failed expectation. In some cases, the provider will realize they have made an error, so they can correct it. In others, they **may see** that the **consumer’s** **expectations** are **incorrect**, so they can let them know to change their own code. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5112-5118). O'Reilly Media. Kindle Edition.
87. **Practice: Ensure Backward Compatibility of Interfaces** Providers should work hard to **avoid** making changes that **break** existing **interfaces**. Once a release is published and in use, **new releases** should not **change** the **interfaces** used by consumers. It’s normally easy to **add** new functionality **without** **breaking** interfaces. A command-line tool can add a new argument without changing the behavior of existing arguments. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5045-5048). O'Reilly Media. Kindle Edition.
88. If there is a need to make **drastic changes** to an existing interface, it’s often better to **create** a **new** **interface**, leaving the old one in place. The old interface can be deprecated, with warnings to users that they should move to using the new version. This **gives consumers** **time** to **develop** and test their use of the new interface before they switch, which in turn gives the provider the flexibility to release and iterate on the new functionality without having to wait for all of their consumers to switch. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5052-5055). O'Reilly Media. Kindle Edition.
89. **Backward-compatible** changes are those that **allow** any **future** **implementations** to **consume** **older** **versions** of the design. For example, if a **client** that **supports** one or more of the **new** extensions receives a representation from a **server** that does **not** **support** these **new** extensions, the **client** implementation will continue to **function** **successfully**. Usually, this means the **new** extensions do **not** create a **required** **dependency** that causes implementations to break if that extension is missing. Amundsen, Mike. Building Hypermedia APIs with HTML5 and Node (p. 147). Kindle Edition.
90. Often **backward-compatible** changes mean that **implementations** need to be **prepared** to **handle** representations that are **missing expected elements**. This may mean that some functionality or feature is not available to an implementation and that implementation must work around that missing element or fall back to a mode that supports an older design of the media type. Amundsen, Mike. Building Hypermedia APIs with HTML5 and Node (p. 147). Kindle Edition.
91. An **extension** can be considered “**forward** **compatible**” if the **changes** do **not** **break** **existing** **implementations**. This means existing client applications will continue to **work** **successfully** when they **receive** a **representation** that contains the **new** **features**. Servers will continue to function properly when they receive a representation that contains the new features. Usually, this means the **new features/functionality can be safely ignored by implementations that do not recognize them**. Amundsen, Mike. Building Hypermedia APIs with HTML5 and Node (p. 146). Kindle Edition.
92. The flipside of maintaining backward compatibility for providers is for consumers to ensure version tolerance. A consumer team should ensure that a single build of their system can easily work with different versions of a provider that they integrate with, if it’s a likely situation. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5070-5071). O'Reilly Media. Kindle Edition.
93. In order to support both forward and backward compatibility, there are some general guidelines that should be followed when making changes to media type designs. **Existing design elements cannot be removed** \[...] **The meaning or processing of existing elements cannot be changed** \[...] **New design elements must be treated as optional** Amundsen, Mike. Building Hypermedia APIs with HTML5 and Node (pp. 147-148). Kindle Edition.
94. **Versioning** a media type means **making** **changes** to the media type that will likely cause **existing** **implementations** of the original media type to “**break**” or misbehave in some significant way. \[...] Versioning should be seen as a last resort. Amundsen, Mike. Building Hypermedia APIs with HTML5 and Node (p. 148). Kindle Edition.
95. **Practice: Decouple Pipelines** When separate teams build different components of a system, such as microservices, **joining** **pipeline** branches for these components together with the **fan-in pattern** can create a **bottleneck**. The **teams** need to spend more **effort** on **coordinating** the way they handle releases, testing, and fixing. This may be **fine** for a **small number** of teams who work closely together, but the overhead grows exponentially as the number of teams grows. Decoupling pipelines involves **structuring** the **pipelines** so that a **change** to each **component** can be **released** **independently**. The components may still have dependencies between each other, so they may need integration testing. But rather than requiring all of the components to be released to production together in a “big bang” deployment, a **change** to one **component** could **go** ahead to **production** **before** changes to the **second** **component** are released. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 4883-4889). O'Reilly Media. Kindle Edition.
