Technical Product Manager

EuropeRemoteSenior

At Nebius, we’re building a next-generation AI compute platform for large-scale ML training and inference — from a few nodes to thousands of GPUs.
We’re looking for a Technical Product Manager to lead Mission Control — the product area responsible for reliability and performance across the full infrastructure stack.
As PM for Mission Control, you will own foundational capabilities that determine how well AI infrastructure performs in real-world training and inference workloads — from bare metal and networking to scheduler/runtime behavior and user-facing outcomes. This is a deeply technical PM role.

Prior PM title is not mandatory: strong candidates from HPC, ML infrastructure, distributed systems, SRE, cloud engineering, or ML solution architecturewho want to grow into product are welcome.

Your responsibilities will include: 

• Own reliability and performance opportunities across the Nebius stack: from bare metal to applications.
• Define product direction end-to-end: problem discovery → design → delivery → adoption.
• Drive cross-functional execution across compute, networking, storage, observability, platform, and hardware teams.
• Lead deep problem research using customer interviews, analytics, workload studies, and logs investigations.
• Identify and prioritize bottlenecks affecting large-scale training/inference performance and stability.
• Translate advanced ML/infrastructure research into practical, scalable product capabilities.
• Define and operationalize product metrics for cluster experience (e.g. reliability, efficiency, latency-to-start, utilization, throughput).

We expect you to have: 

• 3–5+ years of experience in one or more of: product management, HPC, ML infrastructure/MLOps, distributed systems, SRE, cloud architecture, or GPU platforms.
• Strong technical foundation in distributed systems, cloud infrastructure, or ML platforms.
• Hands-on familiarity with ML orchestration environments (e.g. Slurm, Kubernetes, Ray, or similar).
• Experience delivering technically complex initiatives with multiple engineering teams.
• Strong communication skills and ability to influence engineering, research, and customer stakeholders.
• Experience using analytics and data to prioritize roadmap decisions.
• High ownership, learning speed, and comfort in fast-evolving AI infrastructure environments.

It will be an added bonus if you have: 

• Experience with GPU platforms and HPC technologies (InfiniBand/RDMA, topology-aware systems).
• Familiarity with modern ML training stacks (PyTorch, DeepSpeed, FSDP/ZeRO, NCCL).
• Understanding of training efficiency metrics and operational signals (Goodput, MFU, scheduling quality, health checks).
• Exposure to large-scale LLM training or inference systems.
• Background in observability, performance tuning, or reliability engineering.
• Customer-facing technical experience supporting ML or infrastructure workloads.

Published on: 3/27/2026

Nebius

Nebius

The Nebius AI Cloud brings powerful full-stack infrastructure for AI developers and practitioners across startups, enterprises and science institutes to build and deploy generative AI applications and rapidly deliver scientific breakthroughs by training and running ML models within a secure, high-performance, and cost-optimized cloud environment.

See all 2 jobs at Nebius

Please let Nebius know you found this job on Wantapply.com. It helps us to get more jobs on our site. Thanks!

Similar jobs