Vignesh Venkataraman
Tech
June 19, 2022

Sometimes The Grass Really Is Greener — The Story of Our (Second) Cloud Migration

Blog Main Page

Once upon a time, we migrated a significant inherited portion of Curai Health’s infrastructure from an international cloud provider to Google Cloud Platform (GCP). You can read all about that journey here. A few years later, we chose to do another large-scale infrastructure migration, this time to Amazon Web Services (AWS). Why would we embark on such a journey, and what was the ultimate payoff? Read on for a story of adventure, of perseverance, of documented and undocumented gotchas, of heroism, of villainy, and ultimately, of success!

The Status Quo, Circa 2020

At the end of our last migration, we had moved our infrastructure into GCP. There were many reasons for this, including, but not limited to: the familiarity of our early engineers with GCP, a very welcoming “startup” package (with credits!) from Google, a solid IAM integration with our Google Workspace (formerly G-Suite) productivity suite, and the promise of solid AI / ML support. The cutting-edge solution at the time, in terms of container orchestration, was Google Kubernetes Engine (GKE), of which we were among the earliest adopters. We were also using Google Cloud SQL and Google Cloud Memorystore in our production stack; in addition, we were using BigQuery as our data warehouse and analytics platform.


In terms of running workloads, we had your usual mix of backend and frontend web services and scheduled recurring tasks. In an effort to wrangle these pieces, we also adopted Helm as a part of our Kubernetes infrastructure, which allowed us to at least templatize and reduce the complexity of our various Kube manifests (read: YAML files). Also via Helm, we leveraged a custom ingress controller, Argo CI / CD for complex containerized workflows, and Prometheus for monitoring. We also installed cert-manager (unfortunately not manageable via Helm) to try to automagically re-issue our SSL certificates via Let’s Encrypt.

The Proverbial Nuclear Warhead to Kill a Cockroach

All of the above sounds solid in theory, right? We’re leveraging a good number of first-party GCP managed services, which should have solid developer ergonomics and uptime guarantees. For the rest of the stack, we’re using the same tooling just about any Kube developer would be using, between Helm and the various open source Helm charts we installed. In general, as we consulted GCP experts for advice on how to best design our stack to meet our needs, they stated pretty firmly that our architecture was sound, at least based on the offerings available in GCP at the time. Put another way, the only other alternative we could even consider on the GCP side, in terms of container orchestration, was Google Cloud Run. We did use this for a few standalone toy services, but circa 2020, there wasn’t even a way to specify a minimum replica count for tasks or services running on Cloud Run. It was not a viable solution.

https://media0.giphy.com/media/DHqth0hVQoIzS/giphy.gif?cid=ecf05e475635a63o51o6zwlgw2b031ldfx36k40nzk48r0os&rid=giphy.gif&ct=g

In reality, this stack was nearly impossible for our lean engineering team to manage. The issues were multifarious and appeared on multiple different levels. One issue was related to the choice of GKE itself. Kube is an incredibly powerful tool, but in this case, it was tantamount to deploying the proverbial nuclear warhead to kill a cockroach. All we needed to do was orchestrate some containers; all of the bells, whistles, and knobs that Kube has were wholly unnecessary. More fundamentally, we were never able to achieve a good efficiency tradeoff between our nodes and node pools and the workloads we were running on them. With some especially beefy and resource-intensive containers, at least some of our nodes had to at least have capacity to fit them within their budgets. But a node pool had to be at least a few nodes large to achieve GKE SLA requirements, and the waste here due to excess capacity was significant. There are numerous other criticisms of Kube, which I will not rehash here (see here and here for detailed looks), but suffice it to say that the best practices of maintaining a production-grade GKE cluster are not things that a small engineering team can reasonably manage, even if Google bears the onus on the infrastructure front.

Similarly, Helm was a step in the right direction but more often than not led us astray. Whether it was a complete inability to ever upgrade to Helm 3, or significant issues with “production grade” public Helm charts, or poor developer ergonomics, we never got the hang of using it, even among the infrastructure-inclined engineers on our team. To add insult to injury, even minor version upgrades to Helm were enough to cause us grief. We again consulted GCP experts for advice on how to do config-as-code in the context of GKE, and all they could point us to were enterprise-grade solutions that were beyond the pale for what we were looking for.

All of these issues are manageable, but what we couldn’t really escape were fundamental platform-level problems that GCP was not able to address. We were not pleased with the reliability of their data stack: in particular, Google Cloud SQL and Google Cloud Memorystore failed more often than we would like, and the inability to be notified about upcoming maintenance, much less schedule it, bit us roughly twice a year. We had to tear down and reconstruct our GKE clusters from scratch not once, not twice, but three times — once to move from “legacy” networking to VPC native networking, once more to make our data stack accessible to our clusters, and once to recover from a botched Helm upgrade (this one was our fault). By the third time, our eyes were wandering and we were looking for alternatives that valued backwards compatibility and developer experience more.

Why AWS?

As we started to take stock of the competitive landscape, we began to solidify requirements around what we were looking for in a cloud provider. In short:

  1. Obviously, as a health tech company, we needed a HIPAA-compliant partner who would sign a BAA with us.
  2. We wanted rock-solid reliability and configurability in the data tier.
  3. We wanted a way to orchestrate our containers without a huge amount of technical (or monetary) overhead.
  4. We wanted a developer experience that didn’t involve us rewriting code or rebuilding infrastructure to keep up with a breakneck pace of breaking changes.
  5. We wanted to lean as far into “infrastructure-as-code” as possible.
  6. And finally, as a bonus, we wanted a cloud provider who could offer these same developer experience guarantees for AI / ML infrastructure

Point (5) in particular was the result of many hours of clicking through the Google Cloud console, performing repetitive tasks for which we’d need extensive runbooks to make sure we didn’t mess things up. The idea of writing declarative code to provision our cloud environments was highly appealing and something we wanted to prioritize as we shopped around.

AWS checked a lot of boxes right off the bat. RDS and Elasticache are rock solid, and offer the biggest feature that didn’t work for us in GCP (scheduled maintenance windows and notifications). ECS Fargate was a serverlesssolution that allowed us to pay only for the actual usage of our workloads, without any maintenance of node pools or virtual machines. AWS is notorious, in a good way, for maintaining backwards compatibility for exhaustive periods of time. Amazon proudly markets SageMaker as a battle-tested AI / ML infrastructure solution. And with AWS, we had not one, not two, but three options for infrastructure-as-code: CloudFormation, AWS’s in-house declarative configuration language, Cloud Development Kit (CDK), which can loosely be thought of as CloudFormation, except with actual programming languages rather than YAML, and of course the AWS Terraform Provider, which provides all of the AWS concepts you would expect as declarative Terraform resources.

Intrigued, we loosely floated the idea of doing a migration internally, but didn’t think the investment would be worth it without some factors tipping the scales. And on that note…

The Bumpy Migration Path

https://media.istockphoto.com/photos/bumpy-road-picture-id668039754?b=1&k=20&m=668039754&s=170667a&w=0&h=tpKS3ww5NCwXfJHzbOZvk_fy49boQnUoIdcGrXwJA38=

Through our contacts, we came across an AWS preferred partner who offered to help us with the migration, with the cost borne by AWS! This seemed too good to be true, and so we spent some time digging in. In hindsight, perhaps we should have dug a little deeper, but with our GCP frustrations looming large, we bit the bullet and kicked off the migration preparation.

Unfortunately, while AWS was incredibly supportive and helpful, the choice of migration partner ended up being a mistake. The partner submitted a lowball bid to AWS in order to gain our, and their, business, but then compounded that (lack of) investment by saddling us with their C team. They refused to leverage open-source modules, didn’t take our design input, and quickly fell behind even the most conservative time estimates for this admittedly-complex rollover.

At the end of the day, our Curai engineers swooped in to salvage this situation, and we terminated our engagement with the partner after successfully getting our dev environment migrated. This was a good lesson learned for us: when working with a technical partner who needs to be deeply integrated with your team and your code, make sure that they pass muster and will give you their best shot. In particular, we should have asked specific questions on implementation details, requested a tech spec, and inquired about staffing before committing.

Our New Architecture

All that aside, by the time our dev environment came to life, we were delighted with the results of our new architecture. After some experimentation, we settled on Terraform (and Terraform Cloud) as our infrastructure-as-code solution; within that ecosystem, we elected to create 4 different workspaces, representing dev, staging, production, and a “shared” workspace that applied to all environments that contained things like global IAM roles, users and groups, SSL / TLS certificates, Route 53 DNS records, and non-env-specific resources. We decided right off the bat to exclude our data platform (based on BigQuery) from the migration; it could be handled separately from the migration of our core stack.

Within each workspace, the structure was roughly similar. The networking stack was the standard AWS VPC architecture with a separate VPC per environment. Each environment (dev, staging, and production) had an application load balancer (ALB) and public and private subnets, with internet and NAT gateways attached. All of our workloads were run on ECS Fargate, where we were able to allocate (and pay for) precisely the amount of CPU, RAM, and disk space used. Our recurring tasks were scheduled through CloudWatch cron event targets. Speaking of CloudWatch, we set up alarms and log groups to feed all of our container insights and outputs into. We used (as you might expect) RDS for our PostgreSQL databases, Elasticache for our Redis caches, S3 for object storage, CodeBuild for building our Docker containers, and Elastic Container Registry (ECR) for storing our built Docker containers. All in all, this is about as standard as you can get for any cloud provider.

Perhaps the most interesting thing we did was related to deploying changes into these environments. One of the big gotchas with infrastructure as code is what happens when you want to roll something out. Most orchestration platforms accept imperative commands to update running workloads to new configurations or containers, but this is immediately in tension with the declarative guarantees that infra-as-code provides. We decided to kill two birds with one stone and made our deployments declarative as well: we’d write changes to a “hashes” file, and have the hashes in that file be consumed by our ECS task definitions in Terraform. This works great for manual deploys, but what if you want to enable continuous integration? We wrote a little Git-aware bot (cleverly named “bumpahashabot”) that does this automatically when invoked by a container-building utility such as CodeBuild. This allows our deployments to be as simple as filing pull requests from main to staging and then to production.

The Devil Is In The Details

https://xkcd.com/908/

With the finish line clearly drawn, now we had to figure out how to get there. Immediately, we decided to mark specific things as out of scope. First on the chopping block were things like a Google Cloud Storage to S3 migration; object storage could be moved over later, and our priority was to get our critical workloads moved over. Other things that didn’t make the cut included our GitHub pre-merge checks, which were running on Google Cloud Build, and our data platform, which, as mentioned earlier, remained on BigQuery. Everything else had to be shunted over; in particular, our running workloads and the data that backed them. Workloads could be spun up in parallel in both environments, which was one of the first steps we took, but when we started to chart out our data migration, things started to get hairy.

Our initial hope was to spin up AWS PostgreSQL databases and Redis caches as read-replicas of their counterparts in GCP. This proved to be impossible, because the GCP-managed databases and caches were not set up for cross-cloud replication. That left us with an unpalatable plan B: shut down all traffic to GCP, take manual exports of the data, shift them across cloud providers into AWS, and then restore the data into RDS and Elasticache. We spent a great deal of time investigating other options, but ultimately this was the way it had to be. Obviously, no one wants to be administering and maintaining their own databases in the cloud, but when you go with a managed solution, this is the kind of lock-in that you might end up fighting with at a future date.

With the details ironed out, all that was left to do was actually take the downtime and perform the migration, which we successfully did in early February 2021!

The Aftermath

http://www.quickmeme.com/img/52/529c9e0cd806e50f846bc6ea20c3e67144b5fc1e2d692cc66a5cd29a5f027a38.jpg

Overall, the migration went very well! We did ship out one bug as a result of the migration — a misconfigured environment variable that we caught within a few hours and corrected — but otherwise things went very smoothly. As a result of the migration, even if you disregard the AWS credits we received, our monthly cloud bill was sawed in half. And while AWS (of course!) had some outages last year, they didn’t affect us too badly, and we appreciated the fact that our once-a-quarter fires due to unexpected database maintenance were eliminated entirely.

Our infrastructure-as-code setup is an enormous step up in terms of ergonomics, especially when compared to our old “death by a thousand clicks” approach. While Terraform has its quirks, and entirely new infrastructure features usually take a few shots to get right on our dev environment, overall the complexity and scariness of infrastructure changes goes down tremendously when you can inspect every single change before applying it.

We’ve since, as promised, migrated our cloud storage infrastructure over to AWS as well and have zero complaints on that axis as well. Our data platform remains in GCP, since BigQuery remains one of the better “serverless” data platforms / warehouses out there, but we are looking to migrate this over in the near future — stay tuned for another blog post on that topic!

Overall, if I were to summarize the lessons learned in this saga, they would be:

  1. Sometimes, the grass actually is greener on the other side! But it takes a great deal of care to make sure you don’t talk yourself into this conclusion.
  2. Vetting technical vendors thoroughly is critical. I cannot overstate this.
  3. Infrastructure-as-code is your friend, quirks and all. Embrace it when you have the chance! You won’t regret it. Terraform in particular does a pretty solid job balancing ergonomics with complexity, even if certain resources are more difficult to manage than others.
  4. Bias towards providers whose aim it is to minimize the number of times you change something for their benefit, rather than yours. In tech, you should focus on where your comparative advantage lies, not in rewriting things to continue to work on a provider’s systems.
  5. Try to avoid vendor lock-in, with the caveat that this may be in tension with some of the bullet points above. As a practical example, if you embrace infrastructure-as-code, you will ultimately have to write some code that is tied directly to your cloud provider. However, there are ways to abstract it such that you decouple the interface from the implementation, and if you do need to switch things up, you have the ability to do so relatively elegantly.

Thanks for reading! If you are at all interested in the kinds of things covered in this blog post, or want to work in a highly complex and meaningful space, Curai is hiring for multiple roles across multiple teams! Check out our careers page here, or get in touch with me via email at viggy@curai.com.

Stay Informed!

Sign up for our newsletter to stay up to date on all Curai news and information.