Skip to main content

Senior SRE / Platform Engineer (m/f/d)

SimScale GmbH

Home Office, DERemoteFull-Time3w ago

Description

The Role

We are looking for a Senior SRE / Platform Engineer (m/f/d) to own and improve the cloud infrastructure behind SimScale's browser-based simulation platform. The role spans AWS and EKS, observability, disaster recovery, security and compliance controls, multi-region architecture, elastic GPU/HPC capacity, and internal developer tooling.

SimScale's engineering teams run workloads directly on AWS; you will build the standards, guardrails, and self-service tooling that let them do so safely, raising reliability and security without slowing engineering velocity. You will join a small, tightly knit infrastructure team supporting 50+ engineers across the company. This is a hands-on senior individual contributor role; people management is not required, but there is a genuine path toward tech-lead ownership as the team grows.

Your Opportunity

  • Evolve our Kubernetes platform: Evaluate and adopt technologies such as Kubernetes Gateway API and service mesh patterns, and coordinate platform evolution across 10+ engineering teams.
  • Take observability to the next level: Drive organization-wide adoption of OpenTelemetry for distributed tracing and metrics, and help teams define meaningful SLOs.
  • Shape multi-region architecture and data residency: Support our move from an EU-centered footprint toward a global, multi-cloud architecture that satisfies disaster-recovery and data-residency requirements.
  • Own cloud cost and efficiency at scale: Keep petabyte-scale infrastructure cost-efficient, secure, and well-instrumented.
  • Improve tooling: Build self-service AWS account provisioning, guardrails and AI-assisted automations that help engineering teams manage infrastructure safely and efficiently at scale.

What We Expect from You

  • 5+ years of professional experience in SRE, platform, or infrastructure engineering.
  • Software development experience: Your background is rooted in software development, and you moved into SRE from there. You write production-quality software in at least one of Python, Go, Rust, or Java.
  • Strong systems foundation: You understand Linux internals and distributed systems well enough to debug complex production behavior.
  • Hands-on cloud and infrastructure experience: AWS (or GCP), declarative infrastructure (Terraform), gitops-workflow (ArgoCD) and container orchestration (Kubernetes).
  • Observability and reliability experience: You have worked with OpenTelemetry, Prometheus, distributed tracing, monitoring, and meaningful SLOs/SLIs.
  • Production debugging depth: You can investigate complex failures, communicate clearly during incidents, and turn findings into durable improvements.
  • Security and compliance awareness: You understand how infrastructure decisions affect access control, auditability, disaster recovery, logging, and standards such as SOC 2.
  • Clear communication: You can explain trade-offs to engineering teams and help others adopt better platform practices without unnecessary friction.

Bonus Points

  • An open source portfolio or contributions.
  • Prior technical leadership experience, especially in infrastructure, reliability, or platform engineering.

Location: Remote (within CET ±5h)

What you can expect from us

  • Join a dedicated, supportive team with unlimited growth opportunities and leadership potential
  • Make an impact quickly by sharing ideas and contributing to creative, goal-oriented projects
  • Work in a diverse, inclusive environment with colleagues from over 35 countries
  • Enjoy flexible hours and the freedom to work remotely from anywhere in the world
  • Access comprehensive health coverage, retirement plans, paid time off, and wellness support
  • Enjoy fresh office lunches or gift cards as a remote employee
  • Grow as a professional with online/offline

More jobs