Site Reliability Engineer

EuropeFranceRemoteSenior

€70k–€100k

We are looking for a Senior Site Reliability Engineer to help scale the AI infrastructure powering Photoroom. You’ll own and evolve the systems responsible for serving millions of machine learning inference requests every day, working closely with our ML, Product, Web and Mobile teams to ensure reliability, performance and scalability as we continue to grow.

Compensation: €70k–€100k* + Stock Options/BSPCE (local currency)

Location: Remote with monthly Paris office visits (fully reimbursed). We can hire people already based in selected countries only: France, Germany, Ireland, Italy, Portugal, Spain, UK, Poland.

About the role

You’ll own the Machine Learning Inference Infrastructure at Photoroom, powering millions of AI requests every day across GPU-based systems.
You’ll be responsible for the infrastructure that deploys and runs machine learning workloads, partnering closely with ML engineers to ensure services remain reliable, scalable and cost-efficient.
You’ll design and build cloud-agnostic infrastructure solutions that support both current and future AI workloads. Requests can range from milliseconds to several seconds and process payloads of tens of megabytes.
You’ll work across the full infrastructure lifecycle, from architecture and implementation through to monitoring, optimisation and incident management, using tools such as Datadog to maintain system health.
You’ll build and improve systems for load balancing, autoscaling, queuing and workload orchestration, ensuring consistent performance under rapidly growing demand.
You’ll work directly with engineers across ML, Product, Mobile and Web teams to identify bottlenecks, improve deployment workflows and enable faster iteration.
You’ll monitor production systems, analyse usage patterns and make infrastructure decisions based on real-world performance and user impact.
Optional: Participate in the team’s on-call rotation to help maintain platform reliability.

About you

You have experience designing and operating large-scale distributed systems with high availability and reliability requirements.
You have hands-on experience with load balancing, autoscaling, queuing systems and traffic management at scale.
You have worked on low-latency, real-time backend systems and understand how to optimise for performance and throughput.
You have experience building resilient, redundant infrastructure capable of handling failures gracefully.
You have designed and operated platforms that deploy and run containerised workloads at high scale while maintaining an excellent developer experience.
You have experience supporting workloads that vary significantly in duration, from milliseconds to several seconds.
You are highly pragmatic and focus on delivering business impact quickly, leveraging existing tools and frameworks where appropriate rather than reinventing solutions.
You demonstrate strong ownership and are comfortable making technical decisions independently while collaborating effectively across teams.
You have previously worked in a high-growth startup or similarly fast-moving environment.
You enjoy learning from others, sharing knowledge and contributing to a collaborative engineering culture.
Experience supporting machine learning infrastructure or GPU workloads is a plus, but not required.
You are fluent in English (French is not required).

If you think you have what it takes but don’t meet every single point above, please still apply. We’d love to chat and see if you could be a great fit.

Published on: 6/3/2026