Senior Site Reliability Engineer
We are looking for a Senior SRE Engineer to drive the design, implementation, and evolution of our Kubernetes-based platform in a multi-cloud environment (GCP/AWS). At Finom, SREs are not just executors of tasks; you are the architects of reliability.
This role requires strong ownership of reliability, scalability, and platform architecture for high-load, mission-critical systems operating 24/7.
What You Will Be Doing
Lead the Platform Evolution: Design and operate our Kubernetes ecosystem (GKE, multi-cluster) with a focus on high availability and zero-downtime operations.
Build "Paved Roads": Own and evolve our PaaS strategy, using GitOps (ArgoCD) and CI/CD (GitLab) to empower domain teams to deploy independently.
Architect Reliability: Define and implement our observability strategy across metrics, logs, and tracing (Prometheus, VictoriaMetrics, OpenTelemetry).
Drive Infrastructure-as-Code: Lead the automation of our infrastructure using Terraform, ensuring all resources are standardized and version-controlled.
Own the Error Budget: Partner with engineering teams to establish and manage SLOs, SLAs, and incident management frameworks.
Disaster Recovery Mastery: Design and participate in regular DR drills, implementing blue/green and active/passive strategies across regions to ensure service continuity.
Innovate Operations: Proactively apply AI-driven approaches to improve operational efficiency and automated bottleneck detection.
Who You Are
Production K8s Mastery: Strong hands-on experience managing Kubernetes (GKE preferred) in high-load, multi-cluster production environments.
Cloud Infrastructure: Deep experience with GCP (AWS is a strong plus) and Terraform for large-scale infrastructure.
GitOps Expertise: Solid experience with ArgoCD, GitLab CI, and the "Infrastructure as Code" philosophy.
Observability Expert: Deep knowledge of the Prometheus/Grafana stack and implementing tracing/logging at scale.
System Design: Proven ability to design highly available 24/7 systems with automated failover and rollback capabilities.
English Fluency: English level B2+ for effective cross-functional communication.
Nice-to-Haves
Compliance Knowledge: Understanding of banking-grade standards like PCI DSS, GDPR, or ISO 27001.
Distributed Systems: Experience with Kafka (Confluent), RabbitMQ, or managing high-load Redis and PostgreSQL clusters.
AI for Ops: Experience using AI tools to improve alerting, anomaly detection, or engineering efficiency.
Security-Minded: Experience with Vault for secret management and credential rotation.
Our Infrastructure Landscape
Primary Cloud: GCP (~90%)
Orchestration & Deploy: GKE, ArgoCD, GitLab CI
Automation: Terraform
Data & Messaging: PostgreSQL, Kafka, Redis, RabbitMQ
Observability: Prometheus, Grafana, VictoriaMetrics, OpenTelemetry, Cloud Logging
Security: Vault
Published on: 5/27/2026

Finom
Finom is an online payment solution for entrepreneurs that makes it easy to open a business account and securely manage their finances.
Unlock access with Plus
Please let Finom know you found this job on Wantapply.com. It helps us to get more jobs on our site. Thanks!




