MLOps Engineer - AWS, ML Infrastructure - Advertising Services market
5+ years
Long-term (40h)
Advertising Services
Full Remote
AWS
ML Infrastructure
Python
Terraform
CI/CD
Docker
Requirements
Must-haves
- 5+ years of DevOps, MLOps, or Cloud Infrastructure Engineering experience
- Experience with AWS services (CDK, Lambda, EC2, S3, SageMaker, CloudWatch)
- Proficiency with Infrastructure as Code (IaC) tools (Terraform, CloudFormation)
- Strong experience with Python for scripting and automation
- Proficiency with containerization using Docker
- Experience building and maintaining CI/CD pipelines for ML workflows
- Deep knowledge of ML model lifecycle management, including deployment, monitoring, and retraining
- Based in Brazil, Argentina, Paraguay, Colombia, or Mexico
- Strong communication skills in both spoken and written English
Nice-to-haves
- Startup experience
- AWS Certifications (e.g. DevOps Engineer, Solutions Architect, Machine Learning Specialty)
- Background in software engineering or ML/AI infrastructure
- Bachelor’s Degree in Computer Engineering, Computer Science, or equivalent
What you will work on
ML Infrastructure Architecture & Automation
- Design, provision, and manage AWS infrastructure for ML workloads using AWS CDK and CloudFormation
- Architect secure, scalable, and cost-efficient ML environments for experimentation, training, and inference
- Implement cloud-native services (e.g. EC2, ECS, Lambda, S3, RDS, SageMaker, Bedrock, Step Functions)
- Apply best practices for security, compliance, and disaster recovery in ML infrastructure
Model Deployment & CI/CD
- Design and maintain CI/CD pipelines for training, deployment, and retraining of models using CodePipeline, CodeBuild, GitHub Actions, or similar
- Automate testing, versioning, and rollback strategies for applications and ML models
- Build and manage Docker containers for microservices and ML applications
MLOps Enablement
- Collaborate with ML engineers to deploy, monitor, and maintain models in SageMaker
- Develop end-to-end pipelines for data preprocessing, feature engineering, training, inference, and retraining
- Integrate model monitoring, drift detection, and automated retraining triggers
Monitoring, Observability & Performance
- Implement observability frameworks for ML workloads using CloudWatch, DataDog, and other tools
- Track inference latency, accuracy, and resource usage to optimize performance
- Troubleshoot production ML systems and lead incident resolution
Collaboration & Documentation
- Partner with software, ML, and data teams to promote MLOps best practices
- Maintain clear documentation for infrastructure, deployments, and operational processes
- Contribute to code reviews and architectural discussions