One of Curai’s core products is First Opinion (now Curai Health), a chat application where users can connect with medical providers to access health information and address their primary care needs. Within our chat app, messages are exchanged back and forth and many of these messages require calls out to ML services. Our legacy codebase handled much of this time-consuming and CPU intensive work via a job queue system. This system was a classic example of high-complexity, low-reliability architecture. As we ported our codebase from Python 2 to 3, we saw an opportunity to replace the jobs and move towards a simpler and more maintainable architecture better suited to our product needs. For anyone considering whether or not they need a job queue, we hope this post will serve as an instructive example.
The Legacy Architecture
The above system had too much indirection for our current product needs and the complexity did not have the added benefit of preparing us for future scale.
In our legacy architecture, messages were sent from the frontend to the backend via a WebSocket connection. Our backend asynchronously sent along pertinent payloads to Amazon’s SQS, which handled queuing up the jobs in a FIFO fashion. Some (but not all) of the queued jobs made additional calls out to separate machine learning services.
This system resulted in a number of product-impacting issues:
The above system also had too much indirection for our current product needs and the complexity did not have the added benefit of preparing us for future scale. Adding a new message type required engineers to modify five separate files and familiarize themselves with SQS tooling, and once the work was done, flaky performance was hard to test and debug.
Requirements and considerations
As we set out to build something better, we wanted what we created to satisfy the following requirements:
In deciding our path forward, deprecating the queue jobs completely was not a foregone conclusion. Many of the disadvantages we’ve enumerated thus far were a result of the implementation of our job queue; the drawbacks are certainly not endemic to queues or SQS itself. It is also important to note that as we look across the tech ecosystem to how many large scale chat applications are built, we see job queues as a common pattern. Notably, as of 2017 Slack’s job queue system was processing 1.4 billion jobs daily. For Slack, the job queue is an integral part of the architecture enabling reliable service to more than 10 million daily active users. To provide a point of comparison, in First Opinion in 2019 we saw on average 14.7k messages sent per day. With Slack and other companies (such as Quora) using message queues to great effect, why then would we forego our job queue completely? What it came down to was identifying the best fitted engineering solution to meet our current business requirements. Our choice here was a nod to the agile or extreme programming idea to “Do the simplest thing that could possibly work.”
Our choice here was a nod to the agile or extreme programming idea to “Do the simplest thing that could possibly work.”
The path forward
We satisfied all our requirements by integrating Flask-SocketIO into our stack. Flask-SocketIO enables us to connect to socket.io, which provides a real-time and bidirectional client-server communication. In terms of implementation details, at the start we had about 20 jobs we were aiming to replace. We did so incrementally, removing the jobs one-by-one and then removing the old infrastructure completely at the very end.
With each job migration, we had a decision to make: would we run the task synchronously or asynchronously? As a guiding principle, we decided to handle tasks that relied on results from ML models asynchronously and everything else synchronously. For example, sending a wait time message to the user could be executed synchronously on the parent thread of execution. We used SocketIO’s emit() function to send messages between our front and backends. Many tasks that were once navigating a complex loop (chat server to SQS back down to queue consumer to chat server again), are now handled quite simply (directly between frontend and chat server)! In order to avoid delays for more time-consuming tasks such as calls out to ML models, we leveraged SocketIO’s start_background_task() function. With start_background_task(), we spawn a background green thread within our server so that other processes are not blocked during execution.
The pros and cons of the path we chose
Investing engineering time to port a legacy system is never a given. Because we had to make changes for Python 3 compatibility and because our entire team had consistently felt the pain of maintaining the legacy system, we felt a few days investment here was worth it. Replacing our job queues increased the percentage of tasks handled synchronously which carried with it the risk of increasing latency of some message deliveries. On the other hand, without the queue jobs our architecture is much simpler, and being able to spawn green threads for tasks that are specifically time-consuming has its own latency reducing benefits. In summary, we feel the pros outweigh the cons: we chose the path of refraining from over-engineering and architecting thoughtfully at our current scale. For fans of pro/con lists, let’s enumerate these trade-offs in a tabular fashion:
Our work here was a lesson in how the most shiny or “industry standard” solution is not always the best one for your current needs
Results and a look to the future
As our application scales and as we introduce more ML services to augment our doctors’ work, we imagine a scenario where we would integrate PubSub or a similar service so that our frontend could speak to microservices directly. For now, with basic SocketIO tooling, we are handling events with better reliability and faster speeds for the end-user.
In short, with simpler and less fancy infrastructure we now have better performance. While chat applications handling much larger volumes may benefit from the job queue pattern, we learned that by deprecating our job queue we were able to vastly simplify a core piece of our codebase while serving our users better. Our work here was a lesson in how the most shiny or “industry standard” solution is not always the best one for your current needs, and that sometimes engineering simplicity really does have positive impacts for engineers and users alike.
If Curai’s engineering work interests you, please check out our job page and don’t hesitate to say hi 👋 To continue the conversation, you can also find me on Twitter.
Thanks to Matt Willian for collaboration and mentorship on this project.