MLOps Engineer
CNTXT AI
Description
Job role:
A dedicated startup is being formed to industrialize and scale a secure, AI-enabled, multi-source decision-support software offering. The platform is a multi-sensor fusion and agentic AI solution connecting to diverse data sources (for example geospatial layers, imagery, video, and other operational signals). This role will support the delivery of a scalable product and contribute to establishing the processes, standards, and collaboration practices required for sustainable growth.
The Cloud Infrastructure Engineer is responsible for designing, deploying, and maintaining secure, scalable, and highly available cloud environments. This role focuses on building robust infrastructure on AWS (or multi-cloud environments, if applicable), automating operational processes, and ensuring the reliability and performance of cloud-based systems. The ideal candidate combines deep technical expertise with strong problem-solving skills and a passion for automation and cloud-native technologies.
Job Responsibilities
- Design and operate end-to-end ML/LLM delivery pipelines: data to training/fine-tuning to evaluation to packaging to deployment
- Build CI/CD for models and services, including automated testing, validation gates, and rollback strategies
- Standardize experiment tracking, model/version lineage, and artifact management (datasets, prompts, checkpoints, embeddings)
- Implement monitoring and observability: latency, cost, drift, quality signals, and safety/guardrails metrics
- Optimize inference performance and cost (batching, caching, quantization, hardware choices)
- Define and enforce environment and dependency management across dev/stage/prod
- Work with engineering on scalable serving patterns (APIs, streaming, event-driven), and with security on access controls and secrets
- Support release readiness: runbooks, incident response, SLOs/SLAs, and post-release stability tracking
- Coordinate with procurement and legal where needed for tooling, cloud services, and vendor onboarding
- Startup mode: hands-on, flexible, comfortable pivoting, and able to unblock teams quickly
- Interfaces / stakeholders
Qualifications & Experience
- Typically 5+ years in MLOps/DevOps/Data Platform roles, including production deployments of ML and/or LLM-powered systems.
- Experience in fast-paced product environments preferred.
- Tools (examples)
- ML lifecycle: MLflow / Weights & Biases / equivalent
- Serving: FastAPI, Triton (plus), Ray Serve (plus)
- Orchestration: Airflow/Dagster (plus)
- Observability: Prometheus/Grafana, OpenTelemetry, ELK
- Cloud: AWS/Azure/GCP (or private cloud)
- KPIs
- Deployment frequency and lead time for model releases
- Production stability: incident rate, MTTR, SLO compliance
- Model quality health: drift detection coverage, evaluation gate pass rate
- inference cost and latency improvements
- Reproducibility and traceability coverage (lineage completeness)
Competencies
Strong MLOps fundamentals: model lifecycle, reproducibility,
evaluation, deployment, monitoring
Proficiency with containers and orchestration (Docker; Kubernetes
is a plus)
CI/CD and automation (GitHub Actions/GitLab CI/Jenkins),
infrastructure-as-code (Terraform is a plus)
Experience with model serving patterns (REST/gRPC), and
observability tools
Comfort with cloud primitives (compute, storage, networking) and
cost management practices
Clear communication and documentation; strong ownership and
operational discipline