At Curai we accept interns on all our different teams all year round. For obvious reasons, we have gone to a full-remote model the past two years, and we expect it to continue like this at least for now. We have always been fortunate to attract amazing interns every year (see here or here for some past examples), and being full remote has even enabled us to access a larger pool of great candidates. So, this year we again had a great class of interns whose work is worth sharing!
This post focuses on the work done by the wonderful machine learning engineering interns we had over the past year. We will update this blog with work by interns in other teams in the coming months. Here we include work done by Abdullah Ahmed, currently a medical student at Brown, Costa Huang, a computer science Ph.D. student at Drexel, Cecilia Li, a graduate student at UC Santa Cruz, and Poojan Palwai, a graduate student at Carnegie Mellon.
What follows is an edited short version of their work in their own words. Enjoy reading about their work, and please take a look at our website if you are interested in helping us with our mission by doing an internship or taking on any of our other positions. We pride ourselves on providing our interns with meaningful projects and exposing them to some of the most relevant tools of the trade including PyTorch, AWS, and research at the cutting edge of natural language understanding for healthcare data.
My name is Abdullah Ahmed, and I’m currently a second-year medical student at Brown (go Brunos!) with a background in AI/ML. The goal of my internship was to combine my clinical and ML knowledge to improve Curai’s Entity Linking Module (CELM).
Let’s take a step back. Entity Linking (EL) is the NLP task of extracting entities from free-text and mapping them to concepts in a Knowledge Base (KB). CELM was built to perform this task specifically for concepts in Curai’s proprietary KB.
To give a concrete example, suppose a patient logs onto the Curai platform and sends the following message:
I have had pain while peeing for the past five days. Also, I noticed blood in my pee last night. What’s wrong with me, doc?
CELM takes in this text and spits out two IDs that correspond to medical concepts in Curai’s KB:
1. CID0072 à Dysuria (pain on urination)
2. CID00189 à Hematuria (blood in urine)
Before I made any improvements to CELM, I first had to pick up a way to measure success. The standard metrics for generic EL tasks are precision, recall, and F1. However, given that CELM has a direct impact on the patients that use the application, I wanted to create a metric that was more closely linked to clinical outcome.
The intuition behind the metric I created is that CELM should be penalized more heavily for failing to extract emergent symptoms. For instance, it is a much bigger problem if CELM misses “chest pain” than if it misses “runny nose,” as the former is a symptom associated with serious cardiac complications like a myocardial infarction (i.e. heart attack).
I used metadata from our KB to calculate this new, weighted, clinically-relevant metric.
With this new metric, I was able to see the impact of changes I made to CELM. For instance, we proved that adding a medical spellchecker to the preprocessing step improved the overall performance of our system.
Overall, my experience at Curai Health was fantastic! I’m incredibly grateful to my mentor, Niraj, and my managers, François and Namit, for curating a project that perfectly complemented my dual background and helping me see it through. I had the privilege of working with some incredible folks here in both the engineering and clinical departments. It was certainly a challenge to balance my work at Curai with my studies in medical school, but in retrospect it was totally worth it! Highly recommend to anyone interested 😁
Hello, my name is Costa. I am a third-year computer science Ph.D. student at Drexel University focusing on deep reinforcement learning. During my internship, I had the amazing opportunity to join Curai as an ML engineering intern. Over the course of a few months, I 1) implemented an experiment management solution that was adopted to Curai’s ML workflow under my ML engineering mentor Luladay Price, and 2) led the design of a prototype project to generate medical conversation by leveraging DialoGPT under my ML research mentor Namit Katariya.
My main project (80% of my time) was to implement an experiment management solution. Prior to this project, the current workflow had primarily relied on custom scripts to track experiments. This is problematic because sometimes the experiment dependencies or the datasets are not logged, which could make it challenging to reproduce experiments.
My first step was to analyze 4 years of ML experiment needs. My amazing team members were especially helpful in creating a document listing the pain points of Curai’s ML projects. To summarize, we wanted our experiment management solution to have the following features:
2. Experiment orchestration, Model deployment
That’s a lot of features to maintain if we were to build everything in-house. A more viable approach is to look for vendors who provide these services. My next step was to survey experiment management solution providers. In total, I have looked at over 15 state-of-the-art solutions including Weights and Biases (W&B), Neptune, Comet.ml, Sagemaker, Valohai, Verta, Aim, KubeFlow, ClearML, Polyaxon, Sacred + Omniboard, MLflow, DVC, Guild AI, Pachyderm, Deepkit-ml. After reading their docs and building prototype projects using them, I found most solutions cover experiment tracking very well, but W&B really shines in the analysis features and data versioning, so I recommended it to my team.
My team agreed with my assessment and I helped lead the technical discussion with W&B to set up a trial. To help this project take roots in the team’s workflow, I wrote tutorials to help everyone get on board and set up 1:1 with ML product owners to help them track experiments using W&B.
In particular, my mentor Luladay worked with me to adopt W&B into a routing model is_diagnosable project. This model looks at the patient’s chief complaints and determines if we should send the patient automated history-taking questions. By using W&B, It was much easier to track experiments on dataset feature ablation and classifier comparison. Additionally, we could also visualize the effects and dependencies of the experiments as shown below:
My side project (20% of my time) was to build a medical conversation chatbot by fine-tuning DialoGPT using our custom dataset. And the end result looks like the following, where my input starts with “>>>” and the simulator replies back.
As you can see, the simulator is surprisingly good at the beginning of the conversation, asking relevant questions about my symptoms. However as a conversation goes on, the simulator usually degenerates, asking repeated questions. This is a pretty interesting project that I see having actual product applications. For example, this could be used as an auto-completion feature for the doctors when they type, potentially saving them time in typing predictable questions. For future work, we would like to try training our models on a much larger dataset and experiment with newer language models to improve the performance.
Overall, I feel really grateful for this awesome internship experience.The internship projects were tailored perfectly for me, and I had really enjoyed working with everyone. My personal favorites were the company events such as trivia and cheese making lessons, which had added a lot of color to my experience. This experience allowed me to peek into how things are done in the industry, and will ultimately help me evaluate going to industry vs academia after my graduation. My big thanks to everyone on the team for making this experience so great.
Hello, my name is Cecilia Li. I am a graduate student at UC Santa Cruz’s M.S. program in natural language processing. Under the mentorship of Niraj Kumar Singh, I had the opportunity to put into practice some of the principles I have been learning in my graduate program by implementing a writing assistant for our Curai’s custom Electronic Health Record.
Curai’s mission to scale high quality healthcare derives into several different technical goals. One of these goals is to improve Electronic Health Record (EHR) effectiveness and usability. To achieve this, we would like to help doctors save time by typing less repetitive or similar information that can be pre-filled or auto-completed by an algorithm.
For my internship, I was in charge of building such a feature for our EHR system. The feature pre-fills the first part in a History of Present Illness(HPI) note that contains the patient’s demographic information. The suggestion appears as faded out gray text and can be selected with a simple tab keystroke. The auto-complete feature also makes predictions similar to Google’s smart-compose as providers type medical notes. Again, doctors select the grayed out predicted text with a tab stroke (as shown in the following image).
To explore and develop the pre-fill feature, we needed to construct a dataset on which we could train our models and also evaluate them. We also needed to perform feature engineering to select features that are required for the pre-fill feature based on domain knowledge. For example, most of the HPI notes have the patient’s demographic information (gender and age) followed by the findings charted by the provider. Other features like the episode of care were not required. We also needed to define the metrics to evaluate different techniques and models. Last but not least, we want to design the techniques and models that will power our writing assistant. The next section provides details on these steps.
Dataset: We created a dataset that included the patient’s age, gender, the HPI notes written by providers and the findings charted by the providers during their chat with the patient. This dataset was created using a SQL query by merging different tables in our data warehouse (BigQuery).
Metrics: To evaluate a model, we used two metrics: One is the standard classification accuracy (to check if the pre-fill prediction matched the ground truth) and a custom scoring function to assess the quality of the prefill (generally the longer correct predictions are preferred over shorter ones).
Model: We preferred simplicity to generate a baseline, by using an ensemble of traditional bi-gram, n-gram, and string match approaches with hopes of exploring contemporary neural network based approaches in the future.
Hyperparameter tuning: In order to maximize the accuracy, we fine-tuned the input prefix and the model prediction lengths, as well as modified the traditional n-gram approach to do partial word prediction. Our best fine tuned ensemble model gave an accuracy score of 0.88.
What was really interesting to me in this process was to be able to explore and understand Curai’s data, then come up with a dataset that would be applicable to this task. The challenges along the way helped me realize there can be so much customization in our model in a real-life setting; there really is no limit on what we can perform and test out. I really appreciate Curai giving me this opportunity to contribute to a great community with a very important mission, and giving me the freedom to do exploratory work like this. The work environment at Curai is very enjoyable as everyone is easy to talk to and communicate with. With the help and support provided for me, the internship process was a truly enjoyable and productive experience.
My name is Poojan Palwai. I am a Machine Learning Masters Student at Carnegie Mellon University. During my internship at Curai and under the guidance of my mentor, Luladay Price, I was able to dive deep into a critical Machine Learning component in Curai Health’s provider-facing app: the Question Serving Panel.
The Question Serving Panel is a feature in the Curai app that provides a way for physicians to send structured questions to patients, and select diagnoses to filter these questions. They can accept AI-suggested questions and diagnoses, or use a search field that exposes Curai’s proprietary medical knowledge base (Question Serving Panel Search) to search for other questions and diagnoses. My internship focused on improving the Question Serving Panel Search feature — specifically (1) creating a debugging notebook, (2) reducing the feature latency, and (3) curating a dataset and metrics to evaluate the feature’s performance.
The first part of my internship was to implement an offline debugging notebook in the backend of the Question Serving Panel. The goal of the notebook was to have an offline mechanism to learn why a particular query returned certain results, and test out different scenarios offline. For the notebook, I implemented three commands: the “query” command that returns unfiltered Question Serving Panel search results for a query, the “verbose” command that returns unfiltered search results for a query and the concept id, score, and concept synonyms for the top n KB concepts, and the “contains” command to check if a concept is contained in the results for Question-Serving Panel Search.
The second part of my internship was focused on reducing the latency of the application. Providers had reported a lag in the display of search results that was leading to a poor user experience. My task was to investigate and try to reduce this latency. After reviewing how the application works, I discovered the application makes two separate calls to our internal Knowledge Base to assist in compiling a list of relevant medical concepts. I was able to consolidate those two separate calls into one request and filter results locally. The latency difference between functions was measured by checking the mean running time of 20 queries and it was found that the new function averaged 32% lower execution time than the old function for all 20 queries.
Old Function Example:
allergies m mean time: 3.74s, std time: 0.36s
allergies me mean time: 3.44s, std time: 0.43s
allergies med mean time: 3.32s, std time: 0.49s
New Function Example:
allergies m mean time: 2.54s, std time: 0.37s
allergies me mean time: 2.25s, std time: 0.41s
allergies med mean time: 2.20s, std time: 0.42s
The final part of my internship was to introduce a dataset and metric to measure the accuracy of the Question Serving Panel Search. Before improving search results, we needed to define metrics to measure improvements and a dataset we could use to calculate these metrics.The proposed dataset associates a query with an ordered list of questions/diagnoses where each question/diagnosis is assigned a relevance rank of 1, 2, or 3. The question and diagnosis bank was compiled offline by taking the 100 most frequent questions/diagnoses used by physicians in online encounters where the query was called. The proposed metric to be used with this dataset is normalized discounted cumulative gain, or nDCG. It’s a metric with a score between 0 and 1 that takes into account the relevance score and ranking of documents, where more relevant documents are ranked higher.
Overall, the internship was a great experience. I gained experience doing data analysis with the logging files and noticing general trends to see where the system was doing well and where it needed improvement. I also learned the process of writing design documents that explained my thought process to the entire automation team. I loved my time at Curai and I found it to be a great place to do an internship due to all of the support.