(Senior) Site Reliability Engineer (m/f/d) in Berlin or Konstanz
KNIME
Description
Mission The Site Reliability Engineer at KNIME ensures that our next-generation cloud platform is built, operated, and scaled with reliability, security, and cost-efficiency as first-class engineering outcomes. Through active on-call ownership and hands-on engagement with engineering teams, this role sets and drives the operational standards that keep KNIME SaaS stable, observable, and production-ready at scale.
Role Overview We are currently designing, building and launching our next generation of products and services in the cloud. We are looking for someone eager to be a part of this innovative process. This includes adapting our industry leading data science and analytics platform into a managed platform capable of serving thousands of users. The ability to handle Infrastructure as Code development is crucial as we strive to match our quality and functionality with innovative solutions that can address growth, cost and durability concerns. You will work in a cloud platform team interfacing with multiple development teams to drive KNIME products into production ready environments.
Responsibilities
- Using code to automate the deployment and operations of large scale SaaS systems. Experience with Kubernetes operators is a plus.
- Building out infrastructure as code using tools such as Helm, Terraform, Amazon CloudFormation and Azure ARM.
- Participating in on-call rotations, incident triage and mitigation, troubleshooting issues in live environments, providing root cause analysis and issue resolution.
- Setting standards for product deployments including reliability, scalability, traceability and monitoring. Communicate with product and development teams to help drive adoption.
- Instrument deployed systems for performance, reliability and cost effectiveness.
- Embeds with product and engineering to lead planning, own dependency risk, and drive consistent adoption of reliability and operational standards across teams.
Requirements
- You hold one or more current certifications on a cloud platform such as AWS, Kubernetes, Linux or similar technologies
- Strong cloud experience with at least one among AWS and Azure cloud providers. The ideal candidate would master both. You possess in-depth knowledge of VPC, IAM, EKS, ECR, EC2, S3, RDS, CloudWatch and their counterparts in the Azure environments.
- Have experience deploying software systems to a Kubernetes environment. Have a working knowledge of Kubernetes concepts and the ability to craft deployment solutions using common Kubernetes patterns.
- Scripting knowledge in Python, Shell are required, additional programming experience in Go, Java are a plus.
- Systems level knowledge and experience with Linux. Expertise in networking, including security, routing, load balancers, and firewalls.
- Knowledge of best practices around service telemetry, including metrics aggregation, distributed logging, and tracing in large, distributed systems
- Working knowledge of OAuth/OIDC identity providers such as Keycloak
- Working knowledge of relational databases such as Postgres
- Ability to work independently and within a team environment. This includes clear and concise communication across an organization that is geographically and culturally dispersed
What Success Looks Like
- Platform stability at scale: The multi-tenant SaaS platform maintains its availability SLO across all tenant tiers as the infrastructure grows from its current state toward a globally distributed commercial release.
- Operational excellence embedded in engineering: Every team shipping to the platform follows a shared production readiness standard, reducing escaped defects and repeated incidents.
- Automated, scalable operations: Tenant onboarding, deployments, and incident remediation are pipeline-driven, eliminating manual toil and enabling the team to scale without growing headcount at the same rate.
- Commercial readiness: The platform meets enterprise securit