by Michael Person and Carter Brown
At Expedition (EXP), we develop state of the art systems for our customers that leverage cutting edge Machine Learning algorithms. While our solutions often overlap with published research, each program requires algorithmic customization, exploration, and experimental hypothesis verification. It was clear early on that, to meet our customers’ requirements, we needed a way to quickly perform experiments and obtain trusted and repeatable results. Unfortunately, modern Deep Learning models frequently require memory beyond the limits of typical laptop GPU memory.
As a “cloud-first” company, EXP explores deep learning architectures and parameters with broad sets of experiments scaled on AWS resources – experiments that would be difficult and costly to perform without the scaling capabilities of the cloud. Most of our broad experimentation is performed on AWS, with the remainder on an DGX class server on site.
Compute and Jenkins
We currently utilize Jenkins, an open-source automation tool, to interface with AWS EC2 instances for our experiments. Our Jenkins-based experiment runner was developed for IC3D, a point cloud processing program, in late 2018 and has evolved into a template that has been adopted and improved by many subsequent projects since its initial development. As our machine learning automation as increasingly adopted containers, we’re currently evaluating alternatives to Jenkins for this role.
While Jenkins is primarily used for automating builds and Continuous Integration tests, its flexibility allowed us to connect with other cloud services to provide an extensible experiment running platform. We use the Jenkins’s EC2 plugin to spin up an arbitrary number of GPU-enabled nodes to run experiments using specific feature branches from Git. This way each developer can test their experiment independently of another developer’s code. Each experiment can be modified with a custom configuration, specifying hyperparameter values, data splits, and finetuning checkpoints. These configurations can be set individually or via configuration files stored in S3 or Git. After cloning a feature branch, data stored in AWS S3 is downloaded and mounted into a Docker container for the experiment. The Docker image used to run the experiment are built using base images either stored in public Docker Hub repos or our own Elastic Container Registry (ECR) repos as well as including packages from the public PyPi server and proprietary packages from our internal PyPi server. Throughout the duration of the model’s training, hooks save TensorBoard summaries and model checkpoints to S3 as well as log relevant hyperparameters and evaluation metrics in Elasticsearch. Once the model has converged, the experiment’s results and metrics are logged and the EC2 node is terminated.
Experiment Scaling and Tracking
Version-controlled experiments and configuration files coupled with robust data provenance practices, such as the file versioning features in S3, provides experiment reproducibility. The Jenkins EC2 plugin enables developers to easily spin up simultaneous experiments to rapidly gain insights into the tested hypothesis. TensorBoard summaries in S3 provide our developers with real-time views into the model’s learning dynamics. We also make Kibana dashboards from queries of our experiment’s Elasticsearch data to showcase progress on a given dataset or model. Finally, we use carefully designed AWS IAM roles and Cloud Formation templates to ensure our Jenkins nodes and cloud resources adhere to the Principle of Least Privilege to maintain proper security.
We also utilize AWS resource tags in our Cloud Formation templates to ensure proper billing to each project since there are many different projects with active AWS resources at any given time. Our experiment runner provides developers with a platform that abstracts away infrastructure details, allowing them to focus on the experimentation process.
Machine Learning is a rapidly advancing area of research, and the tools, workflows, and industry best practices surrounding these algorithms is under constant evolution. Experiment running platforms alongside things such as Data Version Control and Continuous Integration and Deployment are becoming specifically tailored for machine learning and the foundation for modern MLOps practices, analogous to more standard software DevSecOps practices. When our Jenkins based experiment runner was originally developed, it was superior and more flexible to the available alternatives. As more MLOps-centric tools have been released and improved (e.g., AWS SageMaker, Kubernetes’s Kubeflow), we have reevaluated our Jenkins based experiment runner against them. Due to either large premiums in compute cost or unjustifiable requirements for self-managed infrastructure, our Jenkins experiment runner continued to be the better overall option. With the latest features released in December during the 2020 AWS re:Invent conference, price decreases for SageMaker instances, and automated Kubernetes infrastructure management tools like eksctl, the value proposition of adopting some of these tools is changing.
Iterative Improvements and the Future
An exciting component of being at the forefront of Machine Learning research and implementation is surveying state of the art research and tools, and balancing adoption costs with long term strategy. Our corporate focus on iterative improvement has helped develop our ML experiment running platform into its current state and position us as an adopter of more productive processes in the future. If a dynamic environment where you tackle modern software-meets-ML sounds interesting to you, head over to our careers page and apply now!