About Me
I'm Ritvik Chemudupati, an AI Infrastructure & Engineering professional with 3 years at Deloitte where I've built ML systems on AWS and Kubernetes. I specialize in moving teams from experimental models to reliable, scalable deployments by designing distributed ML infrastructure, automating CI/CD pipelines, and operating GPU-accelerated workloads. My focus is on creating systems that empower data scientists while maintaining production reliability.
My work has helped reduce infrastructure costs by 20% (~$600K+ annually) while supporting 75+ data scientists across multiple teams and managing 4-8 production Kubernetes clusters. I'm passionate about bridging the gap between ML research and production systems.

Experience
Platform Engineer
- •Reduced Kubernetes infrastructure costs by 20% (~$600K annually) by consolidating from 7-8 to 3-4 clusters per account, rightsizing pods, tightening autoscaling bounds, culling idle notebooks, and removing orphaned EBS volumes and ECR images across multiple AWS accounts using AWS Cost Explorer.
- •Deployed NVIDIA Morpheus on AWS EKS as sole owner of a production GPU inference platform serving 3 teams, configuring GPU scheduling with taints, tolerations, node selectors, IRSA for pod-level AWS credentials, and Lambda-to-EKS authentication through the aws-auth ConfigMap. Built Ansible automation for idempotent redeployment across clusters.
- •Built a self-service CI/CD platform for ML deployments on KServe, eliminating manual platform engineer involvement through GitHub OIDC federation with AWS for keyless ECR pushes and ArgoCD Image Updater for automatic rollouts. Scaled platform usage to 75+ data scientists across teams, surfacing VPC IP constraints at production-scale load.
- •Operated Kubeflow on AWS EKS and helped recover a full production outage by manually reconstructing cluster state without
terraform apply, restoring AZ-specific EBS volumes, creating AZ-targeted Kubernetes Jobs, and recovering data with Python-based S3 sync scripts. - •Provisioned EKS infrastructure using Terraform, including node groups, GPU p3/g5 and CPU m5 instance types, node labels for workload isolation, and staging/production cluster configuration.
- •Deployed Robust Intelligence AI Firewall in a hybrid architecture, resolving production networking failures involving AWS hairpin NAT limitations, ALB annotations, and security group rules to connect the vendor control plane to on-cluster services.
- •Implemented observability with Prometheus and Grafana across multiple Kubernetes clusters, building dashboards from pod logs and metrics for resource utilization and pipeline health.
Impact
Technical Skills
Languages
Python
Bash
C++
SQLCloud Platforms
EKS
EC2
S3
IAM
Lambda
SageMaker
ECR
AzureContainer Orchestration
Kubernetes
DockerHelm
Knative
KustomizeMLOps & ML Platforms
Kubeflow
KServe
MLflow
NVIDIA Morpheus
Robust Intelligence
Hopsworks
PyTorchCI/CD & GitOps
GitHub Actions
ArgoCD
GitInfrastructure as Code
Terraform
AnsibleMonitoring & Observability
Prometheus
Grafana
CloudWatch
KubecostCertifications

Getting Started with Deep Learning
NVIDIA

Fundamentals of Accelerated Data Science
NVIDIA
Education

Master of Science in Computer Science
Specialization: Artificial Intelligence
Georgia Institute of Technology

Bachelor of Engineering in Computer Science
Birla Institute of Technology and Science, Pilani