HPC Slurm cluster deployed on AWS using AWS ParallelCluster as base layer and boosted with many custom features to get a production ready environment: Secure access internal users. Unix users management. Secure access. 2FA. S3 data pipelines. Support for Multiple FSx for Lustre. Slurm partitions and limits. Slurm Accounting. Observability. Hardware testing. Login Nodes. Support for multiple tenants on different accounts. Persistent $HOME. Lustre eviction. Capacity planning. Custom safeguards for AWS services.
Over time an Azure cluster was also added to the stack using Cycle Cloud.
Tech stack: Terraform. Packer. AWS (EC2 + EFA, FSx, EFS, S3, SES, SNS, SQS, Step Functions, Cognito DynamoDB, CloudWatch). PyTorch + NCCL. DUO.