Skip to main content

Cloud Reliability Engineer (SRE)

Adobe

HamburgOn-siteFull-Time3d ago

Description

The Opportunity Our team builds and operates the cloud platforms behind Photoshop, Firefly, Express, and other Adobe creative tools used by millions of people every day. Behind those creative tools is a large microservice landscape running on Kubernetes and AWS.

We're looking for a hands-on Site Reliability / DevOps Engineer to help keep that platform fast, available, efficient, and secure as it grows. Reliability here isn't a fixed target — it's a property we continually push forward, increasingly with agentic and LLM-assisted workflows driving triage, root-cause analysis, and remediation.

This is a role for someone who likes to go deep on distributed systems, cares about operational excellence, and wants ownership over the reliability of services that sit at the heart of Adobe's Creative Cloud. You'll join a globally distributed Reliability Engineering team working alongside engineers in Hamburg, the US, and India.

What You'll Do

  • Improve the reliability, scalability, performance, security, and cost-efficiency of the platform's microservices running on Kubernetes and AWS.
  • Build and maintain strong observability using metrics, logs, traces, dashboards, and meaningful alerting. Use monitoring solutions like Prometheus, New Relic, Grafana, and Splunk. This helps us detect and understand issues before customers do.
  • Own infrastructure-as-code and automated delivery with Terraform, Kubernetes, Helm, ArgoCD, and CI/CD pipelines — keeping infrastructure across AWS repeatable, consistent, reviewable, and auditable.
  • Drive down toil with AI-assisted and agentic automation — auto-remediation, self-healing workflows, and LLM-generated runbooks and IaC — rather than hand-crafting one-off scripts, so the team's effort compounds.
  • Help grow a shared automation platform that tackles auto-remediation, self-healing workflows, and infrastructure-as-code — where AI accelerates the build, and every contribution compounds the team's capability.
  • Partner with engineering teams, e.g. to forecast capacity based on usage trends or implement new technologies to ensure the platform scales to meet growing demand.
  • Contribute to the security and compliance posture of the platform, partnering with collaborators on controls, evidence, and audit readiness throughout daily reliability work.
  • Help set the bar for how the team uses AI in operations — choosing where agentic and LLM-assisted tooling adds real leverage, and where human judgment must stay in the loop.
  • Participate in healthy, sustainable on-call rotation, and help continuously improve our runbooks and operational practices.
  • Collaborate across Adobe's global Reliability organization to advance the shared mission of "delivering better software faster."

What You Need To Succeed We don't expect any single person to check every box. When you bring most of the core skills below and are excited about the rest, we'd love to hear from you.

  • Several years of professional experience operating, scaling, or building distributed systems in production (SRE, DevOps, platform, or backend engineering backgrounds all welcome).
  • Hands-on production experience with AWS and with container orchestration on Kubernetes (plus tooling like Docker, Helm, and ArgoCD).
  • Practical experience with infrastructure-as-code, ideally Terraform, and with modern GitOps based CI/CD workflows.
  • Experience with monitoring and observability solutions — for example Prometheus, New Relic, Grafana, or Splunk.
  • A modern, AI-forward mindset: you reach for agentic and LLM-assisted tooling to do the work, and you have the judgment to know where it accelerates you and where humans must stay in the loop.
  • Enough programming ability to read, debug, and contribute to services and tooling. These are largely Java/Spring services, so comfort reading and debugging Java is valuable, and Python is a strong advantage for automation and tooling.
  • We expect enough

More jobs in Hamburg