Job role:

A dedicated startup is being formed to industrialize and scale a secure, AI-enabled, multi-source decision-support software offering. The platform is a multi-sensor fusion and agentic AI solution connecting to diverse data sources (for example geospatial layers, imagery, video, and other operational signals). This role will support the delivery of a scalable product and contribute to establishing the processes, standards, and collaboration practices required for sustainable growth.

The Cloud Infrastructure Engineer is responsible for designing, deploying, and maintaining secure, scalable, and highly available cloud environments. This role focuses on building robust infrastructure on AWS (or multi-cloud environments, if applicable), automating operational processes, and ensuring the reliability and performance of cloud-based systems. The ideal candidate combines deep technical expertise with strong problem-solving skills and a passion for automation and cloud-native technologies.

Job Responsibilities

Design and operate end-to-end ML/LLM delivery pipelines: data to training/fine-tuning to evaluation to packaging to deployment
Build CI/CD for models and services, including automated testing, validation gates, and rollback strategies
Standardize experiment tracking, model/version lineage, and artifact management (datasets, prompts, checkpoints, embeddings)
Implement monitoring and observability: latency, cost, drift, quality signals, and safety/guardrails metrics
Optimize inference performance and cost (batching, caching, quantization, hardware choices)
Define and enforce environment and dependency management across dev/stage/prod
Work with engineering on scalable serving patterns (APIs, streaming, event-driven), and with security on access controls and secrets
Support release readiness: runbooks, incident response, SLOs/SLAs, and post-release stability tracking
Coordinate with procurement and legal where needed for tooling, cloud services, and vendor onboarding
Startup mode: hands-on, flexible, comfortable pivoting, and able to unblock teams quickly
Interfaces / stakeholders

Qualifications & Experience

Typically 5+ years in MLOps/DevOps/Data Platform roles, including production deployments of ML and/or LLM-powered systems.
Experience in fast-paced product environments preferred.
Tools (examples)
ML lifecycle: MLflow / Weights & Biases / equivalent
Serving: FastAPI, Triton (plus), Ray Serve (plus)
Orchestration: Airflow/Dagster (plus)
Observability: Prometheus/Grafana, OpenTelemetry, ELK
Cloud: AWS/Azure/GCP (or private cloud)
KPIs
Deployment frequency and lead time for model releases
Production stability: incident rate, MTTR, SLO compliance
Model quality health: drift detection coverage, evaluation gate pass rate
inference cost and latency improvements
Reproducibility and traceability coverage (lineage completeness)

Competencies

 Strong MLOps fundamentals: model lifecycle, reproducibility,

evaluation, deployment, monitoring

 Proficiency with containers and orchestration (Docker; Kubernetes

is a plus)

 CI/CD and automation (GitHub Actions/GitLab CI/Jenkins),

infrastructure-as-code (Terraform is a plus)

 Experience with model serving patterns (REST/gRPC), and

observability tools

 Comfort with cloud primitives (compute, storage, networking) and

cost management practices

 Clear communication and documentation; strong ownership and

operational discipline

MLOps Engineer

Description

More jobs