Succes Stories

Cloud HPC Cluster

MLOps @ FAANG company

Client
challenge


The AI/ML explosion required more researchers and more GPU/Researcher. Onprem HPC clusters have a few advantages (customizable, performance, security and control) while having many disadvantages (massive upfront investment, long ROI, takes years to build making fast pacing hardware obsolete).

Cloud HPC while not being as performant due to cloud constraints (hardware co-location, storage technologies, network constraints) provide a flexible and cost effective environment making it ideal for testing cutting edge hardware and when having overflow capacity.

HPC Cloud solutions have limited features. Ingratiation with internal services was a priority.

Solution
delivered


HPC Slurm cluster deployed on AWS using AWS ParallelCluster as base layer and boosted with many custom features to get a production ready environment: Secure access internal users. Unix users management. Secure access. 2FA. S3 data pipelines. Support for Multiple FSx for Lustre. Slurm partitions and limits. Slurm Accounting. Observability. Hardware testing. Login Nodes. Support for multiple tenants on different accounts. Persistent $HOME. Lustre eviction. Capacity planning. Custom safeguards for AWS services.

Over time an Azure cluster was also added to the stack using Cycle Cloud.

Tech stack:  Terraform. Packer. AWS (EC2 + EFA, FSx, EFS, S3, SES, SNS, SQS, Step Functions, Cognito DynamoDB, CloudWatch). PyTorch + NCCL. DUO.

Business Results


500 + researchers
20 + clusters
5+ accounts/tenants
5000 + GPUs under management
multiple PB on S3/FSx

AWS ParallelCluster took many ideas from this engagement.