About Me

I'm Ritvik Chemudupati, an AI Infrastructure & Engineering professional with 3 years at Deloitte where I've built ML systems on AWS and Kubernetes. I specialize in moving teams from experimental models to reliable, scalable deployments by designing distributed ML infrastructure, automating CI/CD pipelines, and operating GPU-accelerated workloads. My focus is on creating systems that empower data scientists while maintaining production reliability.

My work has helped reduce infrastructure costs by 20% (~$600K+ annually) while supporting 75+ data scientists across multiple teams and managing 4-8 production Kubernetes clusters. I'm passionate about bridging the gap between ML research and production systems.

Download Resume

Experience

AI/ML Infrastructure Engineer

Deloitte

Hyderabad, India

July 2022 - July 2025

20% cost reduction~$600K annual savings75+ data scientists3 production teams

•Reduced Kubernetes infrastructure costs by 20% (~$600K annually) by consolidating from 7-8 to 3-4 clusters per account, rightsizing pods, tightening autoscaling bounds, culling idle notebooks, and removing orphaned EBS volumes and ECR images across multiple AWS accounts using AWS Cost Explorer.
•Deployed NVIDIA Morpheus on AWS EKS as sole owner of a production GPU inference platform serving 3 teams, configuring GPU scheduling with taints, tolerations, node selectors, IRSA for pod-level AWS credentials, and Lambda-to-EKS authentication through the aws-auth ConfigMap. Built Ansible automation for idempotent redeployment across clusters.
•Designed and operated end-to-end ML model serving infrastructure on Kubernetes, covering model packaging, inference service configuration on KServe, GPU resource allocation, and production rollout via GitOps, reducing deployment time for data science teams from days to hours.
•Built a self-service CI/CD platform for ML deployments on KServe, eliminating manual platform engineer involvement through GitHub OIDC federation with AWS for keyless ECR pushes and ArgoCD Image Updater for automatic rollouts. Scaled platform usage to 75+ data scientists across teams, surfacing VPC IP constraints at production-scale load.
•Operated Kubeflow on AWS EKS and helped recover a full production outage by manually reconstructing cluster state without terraform apply, restoring AZ-specific EBS volumes, creating AZ-targeted Kubernetes Jobs, and recovering data with Python-based S3 sync scripts.
•Provisioned EKS infrastructure using Terraform, including node groups, GPU p3/g5 and CPU m5 instance types, node labels for workload isolation, and staging/production cluster configuration.
•Deployed Robust Intelligence AI Firewall in a hybrid architecture, resolving production networking failures involving AWS hairpin NAT limitations, ALB annotations, and security group rules to connect the vendor control plane to on-cluster services.
•Implemented observability with Prometheus and Grafana across multiple Kubernetes clusters, building dashboards from pod logs and metrics for resource utilization and pipeline health.