The Kubernetes Migration Playbook
The Kubernetes Migration Playbook
I have helped migrate services to Kubernetes more times than I can count. The technology is not the hard part. The hard part is everything else: the team habits, the dependency surface, the operational processes that worked fine on VMs but break in containers.
Here is the playbook I use, stripped of the parts that do not matter.
Before You Write a Single YAML File
The most important work happens before you touch Kubernetes at all.
Inventory your services. For each service: What does it depend on? What depends on it? How is configuration delivered (environment variables, files, secrets)? How does it handle signals — graceful shutdown, SIGTERM, health checks?
Most services that were "fine on EC2" have never been asked these questions seriously. You will find things that need to be fixed in the application before they can run cleanly in a container.
Decide on your cluster topology early. Single cluster with namespaces per environment? Separate clusters per environment? Multi-region? Each has a different operational cost. I generally recommend separate staging and production clusters with a single dev cluster, but this depends on your team size and compliance requirements.
Pick your tooling before you start. Helm or Kustomize for manifests? ArgoCD or Flux for deployment? Cluster Autoscaler or Karpenter for scaling? Calico or VPC-native CNI? These choices have downstream effects. Making them upfront avoids replumbing things mid-migration.
The Migration Sequence
Migrate in this order. It is not glamorous but it reduces blast radius.
1. Stateless services first. If your service has no persistent state and is horizontally scalable, it is the easiest thing to containerize. Start here. Build your operational muscle — how do you read logs, how do you exec into a pod, how does your alerting work — before things get complicated.
2. Add proper health checks. Every service needs a liveness probe and a readiness probe. They should check different things: liveness checks whether the process is alive, readiness checks whether it can serve traffic. A service that passes liveness but fails readiness will stay in the cluster but get removed from load balancer rotation — which is exactly what you want during startup or when a dependency is down.
3. Set resource requests and limits. Do not skip this. A cluster without resource requests is a cluster that will OOMKill things at inopportune moments. Measure actual usage first, then set requests to p75 and limits to p99. Revisit after a week.
4. Stateful services last. Databases, caches, message queues. If you can keep these outside Kubernetes (RDS, ElastiCache, MSK) and just run your application layer on Kubernetes, do that. Running stateful services on Kubernetes is possible and sometimes the right call, but it requires more operational depth than most teams have when they are starting out.
The Operational Gaps Nobody Mentions
Log aggregation. On EC2 you probably relied on CloudWatch Logs and log files. In Kubernetes, pods are ephemeral. When a pod dies, its logs go with it unless you have something collecting them first. Fluent Bit + CloudWatch Logs or Loki works well. Set this up before your first production deploy.
Secret management. Kubernetes Secrets are base64-encoded, not encrypted, and anybody with cluster read access can see them. If you are serious about secret management — and you should be — integrate with AWS Secrets Manager or HashiCorp Vault from the start. The External Secrets Operator makes this straightforward.
Node group sizing. It is tempting to run everything on a single node group with large instances. This creates concentration risk and makes cost optimization harder. Run at least two node groups: one for burstable, interruptible workloads (Spot instances work well here) and one for stable, latency-sensitive services on On-Demand.
Pod disruption budgets. When you drain a node for maintenance or upgrade, Kubernetes will evict pods. Without a PodDisruptionBudget, it will evict all of them at once. A simple minAvailable: 1 PDB on each deployment prevents this.
What Success Looks Like
A successful Kubernetes migration is not measured by whether everything is running in the cluster. It is measured by:
- Developers can deploy without help from the platform team
- An incident in one service does not cascade to unrelated services (resource isolation working)
- A node failure or cluster upgrade does not cause an outage
- You can reproduce the production environment locally or in a test cluster
The technology is mature. The patterns are well-established. The failures I have seen come from moving too fast, skipping the boring operational foundations, or underestimating how much the team needs to learn. Go slow enough to build the habits, and the migration will stick.