Training Infrastructure Engineer (up to €200k)
Dex
Description
This role is with one of Dex's trusted partner companies. We work closely with their teams to truly understand their culture, goals, and what they're looking for, so we can match you with the right opportunity and give you context about the role before you commit to a process.
If you're interested sign up to Dex to apply. Dex is an AI recruiter agent that helps you run your job search. Tell Dex your stack, seniority, and what you want to build. We will manage your applications and surface other opportunities that are a fit.
The role Imagine a world where silent video comes to life with hyper-realistic, custom-generated sound, speech, and music. This company is building the foundational generative AI models to make that a reality, empowering creators and transforming content across gaming, video platforms, and beyond. They're backed by significant early funding from leading VCs, and are rapidly scaling their engineering team.
This isn't an MLOps role focused on deploying existing models. This is a deep-dive into the full ML training stack, where you'll design and optimize the infrastructure that powers cutting-edge generative AI. You'll own the entire pipeline, from GPU profiling and parallelism strategies to efficient data pipelines and cluster management, directly shaping the foundation for all future models. This is a rare seat for an engineer who lives at the intersection of hardware, distributed systems, and deep learning.
The work
- Architect and implement optimal training strategies (parallelism, precision) for diverse generative AI models, from small to very large scale.
- Profile, debug, and optimize single and multi-GPU operations, digging into hardware-level behavior with tools like Nsight.
- Drive end-to-end efficiency across the entire training pipeline: from data storage and loading to distributed training, checkpointing, and logging.
- Design, deploy, and maintain large-scale ML training clusters, leveraging orchestrators like SLURM.
- Build scalable systems for experiment tracking, data/model versioning, and insights, ensuring reproducibility and rapid iteration.
What You Bring
- You've implemented advanced training and inference optimization techniques at scale, not just read about them.
- Deep understanding of GPU memory hierarchy and computation capabilities, knowing the gap between theoretical and actual performance.
- Proven ability to optimize both memory-bound and compute-bound operations, understanding their interplay.
- Practical expertise with efficient attention algorithms and their performance characteristics across different model scales.
Why apply through Dex This is a highly sought-after role at a rapidly scaling AI company, often hard to find or apply to directly. Apply through Dex to get a direct line to the hiring team, skip the cold application process, and receive a full brief on the company and role before your first interview. We cut through the noise to connect you with opportunities that truly fit.
If you're interested, sign up to Dex to apply - https://jobs.meetdex.ai/jobs/a4935e9b-6e51-434e-b571-54ec66644680
As part of the recruitment process at Dex, we process your personal data in accordance with our Privacy Notice for Job Applicants. This notice explains how and why your data is collected and used, and how you can contact us if you have any concerns.