MLOps Engineer

Required skills

Requirements

5+ years of DevOps, MLOps, or Cloud Infrastructure Engineering experience
Experience with AWS services (CDK, Lambda, EC2, S3, SageMaker, CloudWatch)
Proficiency with Infrastructure as Code (IaC) tools (Terraform, CloudFormation)
Strong experience with Python for scripting and automation
Proficiency with containerization using Docker
Experience building and maintaining CI/CD pipelines for ML workflows
Deep knowledge of ML model lifecycle management, including deployment, monitoring, and retraining
Based in Brazil, Argentina, Paraguay, Colombia, or Mexico
Strong communication skills in both spoken and written English

Startup experience
AWS Certifications (e.g. DevOps Engineer, Solutions Architect, Machine Learning Specialty)
Background in software engineering or ML/AI infrastructure
Bachelor’s Degree in Computer Engineering, Computer Science, or equivalent

ML Infrastructure Architecture & Automation

Design, provision, and manage AWS infrastructure for ML workloads using AWS CDK and CloudFormation
Architect secure, scalable, and cost-efficient ML environments for experimentation, training, and inference
Implement cloud-native services (e.g. EC2, ECS, Lambda, S3, RDS, SageMaker, Bedrock, Step Functions)
Apply best practices for security, compliance, and disaster recovery in ML infrastructure

Model Deployment & CI/CD

Design and maintain CI/CD pipelines for training, deployment, and retraining of models using CodePipeline, CodeBuild, GitHub Actions, or similar
Automate testing, versioning, and rollback strategies for applications and ML models
Build and manage Docker containers for microservices and ML applications

MLOps Enablement

Collaborate with ML engineers to deploy, monitor, and maintain models in SageMaker
Develop end-to-end pipelines for data preprocessing, feature engineering, training, inference, and retraining
Integrate model monitoring, drift detection, and automated retraining triggers

Monitoring, Observability & Performance

Implement observability frameworks for ML workloads using CloudWatch, DataDog, and other tools
Track inference latency, accuracy, and resource usage to optimize performance
Troubleshoot production ML systems and lead incident resolution

Collaboration & Documentation

Partner with software, ML, and data teams to promote MLOps best practices
Maintain clear documentation for infrastructure, deployments, and operational processes
Contribute to code reviews and architectural discussions