This job has expired and no longer accepts applications.
Machine Learning Performance Engineer
Amsterdam, hybrid
We’re looking for a performance-focused ML Engineer to help speed up large-scale model training by optimizing our internal stack and compute infrastructure. You’ll work across the full training pipeline — from GPU kernels to system-level throughput — applying profiling, CUDA-level tuning, and distributed systems techniques. The goal is to reduce training time, boost iteration speed, and use compute more efficiently.
This is a key role in a growing team building deep technical expertise in ML training systems.
Responsibilities
Optimize our model training pipeline to improve both speed and reliability, enabling faster and more efficient experimentation;
Apply GPU-level optimization techniques using tools like JAX, Triton, low-level CUDA to improve training performance and efficiency at scale;
Identify and resolve performance bottlenecks across the entire ML pipeline from data loading and preprocessing to CUDA kernels;
Build tools and extend internal infrastructure to support scalable, reproducible, and high-performance training workflows;
Mentor and support engineers and researchers in adopting performance best practices across the team;
Help grow the team’s GPU and systems-level capabilities, and contribute to a culture of engineering excellence and rapid experimentation.
Requirements
Demonstrated experience optimizing neural network training in production or large-scale research settings - e.g. reducing training time, improving hardware utilization, or accelerating feedback cycles for ML researchers;
Extensive practical experience with ML frameworks such as PyTorch or JAX;
Hands-on experience with training and optimizing deep learning architectures such as LSTM and Transformer-based models, including different attention mechanisms;
Experience working with CUDA, Triton, or other low-level GPU technologies for performance tuning;
Proficiency in profiling and debugging training pipelines, using tools such as Nsight/cprofiler/CUDA/gdb/torch profiler;
Understanding of distributed training concepts (e.g. data/model/tensor/sequence/pipeline/context parallelism, memory and compute tradeoffs);
A collaborative and proactive mindset, with strong communication skills and the ability to mentor teammates and partner effectively within the team;
Strong proficiency in Python for building infrastructure-level tooling, debugging training systems, and integrating with ML frameworks and profiling tools;
What we offer
Competitive compensation above the market with bonuses twice a year up to 50% of annual salary;
Sophisticated internal training and development programs;
Comprehensive health insurance;
Reimbursement for sports activities;
Engaging in corporate events twice a year;
High level of influence and ownership of the process;
Work closely with experienced team in a flat organizational structure.
Posted on: 8/10/2025

Pinely
Pinely is a privately owned and funded algorithmic trading firm specializing in high-frequency and ultra-low latency trading.