Site Reliability Engineer (SRE)

EuropeArmeniaCyprusGeorgiaUnited KingdomRemoteHybridSenior

London, Limassol, Tbilisi, Yerevan

Join Gamingtec as a Site Reliability Engineer and power high-performance iGaming systems with automation, observability, and 24/7 reliability — all in a fully remote, flexible, and rewarding environment.

We are expanding the engineering team responsible for ensuring the stability and predictable behaviour of our distributed services and platforms. The role involves working with production infrastructure, analysing system behaviour, and implementing practices that improve reliability across multiple platforms.

This position is intended for engineers who clearly understand the difference between SRE and DevOps practices, and for whom SLOs, error budgets, and availability targets such as 99.85–99.95% are practical tools rather than abstract concepts.

You will work as part of an SRE shift schedule covering late-evening and night hours (17:00–01:00 and 00:00–08:00 CET, in rotation) to ensure end-to-end ownership of incidents, from user impact to root cause and follow-up improvements.

All you need is:

Core skills:

Strong Linux skills in production environments (debugging, performance, system services);
Solid understanding of networking (TCP/IP, DNS, HTTP, load balancing, TLS);
Hands-on experience operating Kubernetes in production (not just local clusters);
Experience with AWS cloud services (for example: EC2, ALB/NLB, RDS, S3, IAM, EKS or self-managed Kubernetes);
Confident use of Terraform and Ansible in real environments (multi-environment IaC, reusable modules/roles);
Experience with observability tools:

metrics and alerting (Prometheus/Alertmanager or similar),
dashboards (Grafana or similar),
logging (ELK stack, Loki or comparable solutions).

Ability to troubleshoot across application, network, and infrastructure layers, using scripting and tools (Python/Go/Bash, curl, tcpdump, log analysis, etc.);
Experience with containers and image lifecycle (Docker or compatible runtimes).

Experience:

Participation in production incidents and technical post-incident reviews (not just on-call escalation);
2–5 years of practical experience in SRE, infrastructure, platform or production-focused DevOps engineering;
Experience working within CI/CD pipelines (for example: Jenkins, GitLab CI, GitHub Actions, ArgoCD or similar);
Exposure to environments with high availability requirements (e.g. low tolerance to downtime, strict SLAs/SLOs).
Availability to work between 5 PM and 8 AM CET, in one of the following shifts: 17:00–01:00 or 00:00–08:00.

Also, it will be great if you have:

Experience with high-load or real-time systems (payments, finance, gaming, streaming);
Experience with CDNs or real-time log aggregation/analytics;
Familiarity with databases and message systems (for example: PostgreSQL, MySQL, MongoDB, Kafka, Redis, RabbitMQ);
Experience with involving external integrations and third-party APIs (payment providers, KYC, risk/anti-fraud, content providers);
Experience with service meshes, API gateways or ingress controllers (Istio, Linkerd, NGINX, Envoy, etc.).

Your daily adventures will look like:

Contributing to architectural changes affecting the reliability and scalability of services and platforms;
Operating and improving Kubernetes clusters (cluster model, networking, ingress, load balancing);
Working with AWS-based environments (networking, storage, compute, managed services);
Managing infrastructure using Terraform and configuration management with Ansible;
Developing and refining monitoring and observability across platforms (Prometheus, Alertmanager, Grafana, and log aggregation such as ELK / Loki);
Participating in incident handling: initial classification, technical investigation, coordination with product/engineering teams, and following-up improvements;
Reducing operational toil and building tools that support reliability and efficiency (internal utilities, automation, CI/CD improvements);
Collaborating with development teams to embed SRE practices into the lifecycle of services (SLIs/SLOs, error budgets, readiness for production).

Success Metrics:

Maintain and improve SLOs for key services in the 99.85–99.95% availability range, with clear SLIs and error budgets;
Keep unplanned downtime below 1% for critical user-facing functionality;
Ensure that the majority of infrastructure and platform configuration (target ≥ 90–95%) is managed as code (Terraform, Ansible, Kubernetes manifests/Helm charts);
Systematically reduce MTTR (Mean Time To Recovery) for incidents by improving detection, diagnostics and standard operating procedures;
Prevent repeated high-severity incidents by driving post-incident reviews and concrete follow-up actions (configuration changes, automation, runbooks, architectural adjustments);
Maintain up-to-date operational documentation and runbooks for core services, so that incidents can be handled consistently across the team.

Published on: 1/15/2026

Gamingtec

B2B iGaming platform developer: Casino, Sportsbook, Payments Gateway, Affiliate & more. Complete software solutions.

Website

See 1 job at Gamingtec

Please let Gamingtec know you found this job on Wantapply.com. It helps us to get more jobs on our site. Thanks!