···
Log in / Register

Middle/High-Middle DevOps / SRE Engineer

Indeed
Full-time
Onsite
No experience limit
No degree limit
PV49+C7 Lisbon, Portugal
Favourites
Share

Description

Summary: Seeking a Middle/High-Middle DevOps/SRE Engineer to enhance and operate a high-load production platform in GCP/GKE, focusing on reliability, monitoring, incident response, and cost optimization. Highlights: 1. Opportunity to impact reliability, scalability, and developer velocity. 2. Small team with ownership, autonomy, and quick iteration. 3. Strong growth potential into platform ownership and SRE leadership. We’re looking for a **Middle/High\-Middle DevOps / SRE Engineer** to help run and improve our production platform in GCP \+ GKE, fronted by Cloudflare, with observability in Datadog and CI/CD in GitHub Actions. You’ll work closely with Senior/Principal engineers, implementing reliability improvements, expanding monitoring coverage, and reducing operational toil—especially important in a highload system with sudden traffic spikes. **Role Responsibilities** ------------------------- * **Platform Operations (GCP/GKE)** * Operate and support production systems on **GCP**, primarily **GKE** and managed services. * Execute platform improvements and operational tasks delegated by Senior/Principal owners. * **IaC \& Delivery Enablement** * Implement infrastructure changes via **Terraform** (and **Terragrunt** where used). * Maintain and evolve **Helm charts** and Kubernetes manifests. * Improve reliability of **GitHub Actions / CI/CD** workflows and deployment automation. * **Observability \& Monitoring (Datadog)** * Build and maintain Datadog dashboards/monitors and keep alerting healthy. * Close monitoring gaps across critical components; reduce noisy alerts and improve signal quality. * **Incident Response** * Participate in incident response and operational support: triage, mitigation using runbooks, escalation, and follow\-up fixes. * Contribute to postmortems with clear facts, timelines, and actionable remediation tasks. * **Security Basics (DevSecOps)** * Run/configure security tooling and monitoring, help triage findings, and implement fixes under guidance. * Support secure\-by\-default practices (secrets hygiene, access controls, baseline hardening). * **Cost Awareness** * Identify and implement cost optimizations (right\-sizing, waste removal, efficiency improvements) without harming reliability. **Required Qualifications** --------------------------- * Hands\-on production experience with **Kubernetes** (ideally **GKE**) and basic cluster operations. * Working experience with **Terraform** and **Helm** in PR\-based workflows. * Familiarity with GCP services used in SaaS operations (e.g., **Cloud SQL, BigQuery, BigTable, Pub/Sub, Cloud Run, Memorystore**). * Monitoring/alerting and troubleshooting skills (preferably **Datadog**). * Strong scripting/automation mindset to reduce manual work and prevent repetitive incidents. * Reliability awareness: understanding how changes affect availability/latency and how to operate under SLA constraints. **Preferred Qualifications** ---------------------------- * Cloudflare basics (WAF/DNS, edge concepts; Workers/CDN is a plus). * Experience writing/maintaining runbooks and participating in postmortems. * Exposure to **SOC 2 / PCI\-DSS** requirements or willingness to learn. * Experience in high\-load consumer products or game dev. **What Success Looks Like** --------------------------- * Improved monitoring coverage and healthier alerting (less noise, faster detection). * Faster, safer deployments with fewer manual steps and fewer production regressions. * Incidents are triaged effectively and resolved within expected timelines. * Platform reliability improves through steady delivery of operational fixes and automation. * Costs trend in the right direction thanks to recurring optimizations and guardrails. **Why Join Us** --------------- * Cloud\-only, highload environment with real engineering challenges (not “just keep the lights on”). * Small team with ownership, autonomy, and quick iteration. * Strong opportunity to grow into broader platform ownership and SRE leadership paths. * Direct impact on reliability, scalability, and developer velocity. ***Aghanim helps game developers achieve financial and creative independence by providing the solutions they need to launch, run, and grow their businesses.***

Source:  indeed View original post
João Santos
Indeed · HR

Company

Indeed
Cookie
Cookie Settings
Our Apps
Download
Download on the
APP Store
Download
Get it on
Google Play
© 2025 Servanan International Pte. Ltd.