Solugen is a biotech startup that produces industrial chemicals from sources other than petroleum. Solugen was using a single EC2 instance to do large amounts of high-performance computing (HPC) using RosettaCommons - a software for computational modeling and analysis of protein structures. A single job would take weeks to complete. They needed a solution that was more time- and resource-efficient.
Solugen uses a software platform called RosettaCommons to model protein folding behavior. This analysis is core to their research and this research is core to their business. Due to the complexity and computational needs of this software, they were facing a situation where scaling their business using traditional computational methods would have meant making considerable investments into single instances of on-prem or cloud computing infrastructure and still not ending up with a scalable solution. Additional jobs would have meant additional machines. They needed a creative way to scale their application for larger jobs when that computational power was needed, but they needed to be able to do this in a reasonable amount of time and at an affordable cost.
Cloud303's engagements follow a streamlined five-phase lifecycle: Requirements, Design, Implementation, Testing, and Maintenance. Initially, a comprehensive assessment is conducted through a Well-Architected Review to identify client needs. This is followed by a scoping call to fine-tune the architectural design, upon which a Statement of Work (SoW) is agreed and signed.
The implementation phase kicks in next, closely adhering to the approved designs. Rigorous testing ensures that all components meet the client's specifications and industry standards. Finally, clients have the option to either manage the deployed solutions themselves or to enroll in Cloud303's Managed Services for ongoing maintenance, an option many choose due to their high satisfaction with the services provided.
Optimizing with AWS Batch
Cloud303 leveraged AWS Batch as the ideal solution for their HPC workload, as the customer would only need to pay for resources used.
To start, a templatized Rosetta environment was created using Docker containers so the jobs could seamlessly scale. Then, multiple compute environments were deployed (for testing as well as production jobs). To optimize costs, S3 buckets were used to house data. To give faster access to storage and ensure data did not leave Solugen’s VPC, VPC endpoints were created.
One goal was to simplify Solugen’s experience as much as possible, so the data pipeline starts with the upload of an input file to S3. That file contains all the relevant instructions. The runtime will download the file, read the instructions and start the job based on those instructions. It is also possible to use the environment to spin up a single server (which picks up the job from S3, runs the job, then uploads the output artifact back to S3).
OpenMPI Framework
The more elegant solution, however, and the one that truly changed Solugen’s workflow, was the parallel computing solution that was designed. By leveraging the OpenMPI framework in the runtime environment, multiple nodes could be spun up by AWS Batch to process a single job. A number of instances could be spun up - one assigned as the master and the rest of them being worker nodes. The worker nodes would report their unique ID to the master and once the master had enough nodes to run the submitted job, it would run the Rosetta script while OpenMPI managed the computational distribution between the many worker nodes.
Building an ephemeral, distributed workload like this does have one significant challenge compared to a single server - storage. To solve that, Cloud303 incorporated an EFS file system to serve as a common storage solution. All worker nodes were mounted to the EFS share as a local drive so all artifacts produced by the cluster ended up in the same place when the nodes finished processing. Then the master node would compile the artifacts into a deliverable and upload them to an S3 bucket..
Working with Cloud303 and leveraging AWS resources has been a game-changer for us. The computational bottleneck was a major hurdle to our growth. Thanks to the new infrastructure, we're not only performing more experiments at a fraction of the time, but we're also focusing on what we do best: innovating in biotechnology
The challenge with Solugen was fascinating because it wasn't just about offloading computational work to the cloud. It was about doing so efficiently, economically, and securely, all while keeping their unique scientific requirements in mind. Utilizing AWS Batch along with OpenMPI really took their workflow to the next level.
Prior to this solution, Solugen was running Rosetta on a single EC2 instance and jobs would take about two weeks to complete. Due to the massive amount of parallelization that the new solution enables, a job that took two weeks before now takes about two hours to complete. This has been an enormous benefit to their business, contributing significantly to the efficiency of their workload. In September 2021, Solugen completed a US$357 million Series C round.