Site Reliability Engineering (SRE)

EuropeGeorgiaSerbiaRemoteSenior

We currently have several large-scale projects and are expanding our infrastructure team. Our product is an advanced platform for creating and managing AI agents. It can be deployed directly inside a customer’s infrastructure and delivered as an enterprise solution, while also being available as a SaaS version.

Under the hood, there is real-time voice and telephony, GPU and LLM inference, streaming analytics, and all of this runs both in the cloud and on-prem, including in banking environments. There is a lot of infrastructure; it is complex, interesting, and sometimes at the edge of what is possible. That is why we are looking for a strong SRE who, like us, cares about making systems transparent, reliable, and built the right way.

This is a role for a strong, independent engineer. A Senior SRE with real influence and a voice in how things are built and operated.

You will also handle DevOps tasks for the team, but your main focus and area of expertise should be SRE: reliability, observability, incident management, and performance under load.

Requirements

  • 5+ years in SRE/DevOps. You have not just seen production; you have been responsible for the reliability of high-load production systems.

  • Deep, practical understanding of Docker and Kubernetes. You have operated them in production, not just used them in tutorials.

  • Mature understanding of metrics and alerts, with real hands-on experience writing, tuning, and maintaining them.

  • Practical experience with Prometheus, Alertmanager, and Grafana.

  • Ability and willingness to build dashboards and make them clear, useful, and easy to work with.

  • Experience with SLIs/SLOs, reliability management, incident investigation, and postmortems.

  • Experience with load testing and basic capacity planning.

  • Python: you can write code and confidently read and modify other people’s code for automation, exporters, tooling, and related tasks.

  • Cloud experience with GCP and/or AWS, strong Linux skills, and solid networking knowledge at an operational level.

  • DevOps fundamentals: CI/CD and infrastructure as code, including GitHub Actions, Terraform, Ansible, and similar tools.

  • Willingness to understand and support the product in customer environments, including on-prem deployments.

  • Ownership mindset: you take responsibility for a task, drive it to completion, and think one step ahead.

  • Friendly, non-toxic, and pleasant to work with.

  • Strong communication with developers: you can clearly and constructively explain your position, defend it when needed, and find common ground.

  • Willingness and ability to mentor, teach, and share knowledge with others.

  • Analytical mindset: you dig down to the root cause instead of just treating symptoms.

  • Proactivity: you would rather prevent an outage than heroically fight it later.

  • Strong attention to detail and reliability.

Nice to have

  • Experience using AI agents for routine and recurring tasks.

  • Real-time telephony: SIP, FreeSWITCH, RTP, WebRTC.

  • GPU/ML serving: Triton, vLLM, RunPod, Nebius, Lambda, run:ai, DCGM; understanding of the specifics of deploying LLM/ML models.

  • Streaming data and analytics: Kafka, ClickHouse.

  • Deep experience with IaC and GitOps, such as Terraform, Ansible, ArgoCD; logging with Loki/ELK; gRPC.

  • Experience working in isolated and highly secure environments.

  • Experience preparing systems for significant growth in load.

Responsibilities

  • You will be responsible for the reliability of our services: SLIs/SLOs, availability, and identifying and eliminating bottlenecks across the system.

  • You will set up monitoring for services, metrics, alerts, and dashboards. This will rarely come as a clearly defined task; more often, you will decide what is important to measure and bring it to a clear, usable view.

  • You will build and maintain Grafana dashboards that people actually use, both our team and our customers.

  • You will run load testing, analyze the results, and provide recommendations on resources and scaling.

  • You will investigate incidents, participate in on-call rotations, write and lead postmortems, and ensure the same failure does not happen again.

  • You will work closely with developers: communicate and defend your position, challenge technical decisions, and find win-win solutions.

  • You will develop and support Kubernetes-based infrastructure across our clouds, including GCP and AWS, automate routine work, and help with CI/CD and general team tasks.

  • You will take part in delivering and supporting the platform for customers, including on-prem deployments.

  • You will mentor colleagues and help raise the engineering bar across the team.

What we offer

  • The team has built award-winning AI products for tech corporations — devices, voice assistants, products that are actually in the world 

  • Cutting-edge tech stack: Speech Technologies, NLP, Generative AI (LLMs, diffusion models), voice-first agentic architecture with privacy-first and on-premises deployment

  • High engineering bar and real ownership — the team cares about what actually works in production, not what looks good in a demo, and you'll see the impact of your work directly 

  • Fast career progression — a senior-heavy team and a high volume of real problems means you grow faster than you would anywhere else 

  • Startup pace with enterprise stability — real clients, real revenue, no bureaucracy 

  • Fully remote across Europe

  • 21 vacation days + public holidays + 5 sick days 

  • Private English lessons via Preply

Published on: 6/8/2026

Acclaim

Acclaimverified company badge

Acclaim is a voice-first AI customer experience (CX) platform purpose-built for regulated industries including banking, fintech, healthcare, and insurance.

Website

See all 8 jobs at Acclaim

Please let Acclaim know you found this job on Wantapply.com. It helps us to get more jobs on our site. Thanks!

Unlock access with PlusPlus

Similar jobs