This job has expired and no longer accepts applications.

Machine Learning Performance Engineer

NetherlandsRemote

Amsterdam, hybrid

We’re looking for a performance-focused ML Engineer to help speed up large-scale model training by optimizing our internal stack and compute infrastructure. You’ll work across the full training pipeline — from GPU kernels to system-level throughput — applying profiling, CUDA-level tuning, and distributed systems techniques. The goal is to reduce training time, boost iteration speed, and use compute more efficiently.
This is a key role in a growing team building deep technical expertise in ML training systems.

Responsibilities

  • Optimize our model training pipeline to improve both speed and reliability, enabling faster and more efficient experimentation;

  • Apply GPU-level optimization techniques using tools like JAX, Triton, low-level CUDA to improve training performance and efficiency at scale;

  • Identify and resolve performance bottlenecks across the entire ML pipeline from data loading and preprocessing to CUDA kernels;

  • Build tools and extend internal infrastructure to support scalable, reproducible, and high-performance training workflows;

  • Mentor and support engineers and researchers in adopting performance best practices across the team;

  • Help grow the team’s GPU and systems-level capabilities, and contribute to a culture of engineering excellence and rapid experimentation.

Requirements

  • Demonstrated experience optimizing neural network training in production or large-scale research settings - e.g. reducing training time, improving hardware utilization, or accelerating feedback cycles for ML researchers;

  • Extensive practical experience with ML frameworks such as PyTorch or JAX;

  • Hands-on experience with training and optimizing deep learning architectures such as LSTM and Transformer-based models, including different attention mechanisms;

  • Experience working with CUDA, Triton, or other low-level GPU technologies for performance tuning;

  • Proficiency in profiling and debugging training pipelines, using tools such as Nsight/cprofiler/CUDA/gdb/torch profiler;

  • Understanding of distributed training concepts (e.g. data/model/tensor/sequence/pipeline/context parallelism, memory and compute tradeoffs);

  • A collaborative and proactive mindset, with strong communication skills and the ability to mentor teammates and partner effectively within the team;

  • Strong proficiency in Python for building infrastructure-level tooling, debugging training systems, and integrating with ML frameworks and profiling tools;

What we offer

  • Competitive compensation above the market with bonuses twice a year up to 50% of annual salary;

  • Sophisticated internal training and development programs;

  • Comprehensive health insurance;

  • Reimbursement for sports activities;

  • Engaging in corporate events twice a year;

  • High level of influence and ownership of the process;

  • Work closely with experienced team in a flat organizational structure.

Posted on: 8/10/2025

Pinely

Pinely

Pinely is a privately owned and funded algorithmic trading firm specializing in high-frequency and ultra-low latency trading.

Website

See 1 job at Pinely