Senior MLOps Platform Architect (AWS | Kubernetes | Terraform)

Negotiable Salary

Indeed

Full-time

Onsite

No experience limit

No degree limit

Rua da Torrinha 127, 4000-007 Porto, Portugal

Favourites

Description

We are hiring a senior MLOps/DevOps/SRE hybrid who can build an entire AI platform infrastructure end\-to\-end. This is not a research role and not a standard ML Engineer role. If you haven’t designed production\-grade MLOps infrastructure, haven’t built CI/CD for ML, or haven’t deployed ML workloads on Kubernetes at scale, this role is not a fit. You will design, build, and own the AWS\-based infrastructure, Kubernetes platform, CI/CD pipelines, and observability stack that supports our AI models (Agentic AI, NLU, ASR, Voice Biometrics, TTS). You will be the technical owner of MLOps infrastructure decisions, patterns, and standards. **Location:** Remote \-Europe (PL/ES/PT/CZ/CY) **Key Responsibilities:** **MLOps Platform Architecture (from scratch)** * Design and build AWS\-based AI/ML infrastructure using **Terraform (required)**. * Define standards for security, automation, cost efficiency, and governance. * Architect infrastructure for ML workloads, GPU/accelerators, scaling, and high availability. **Kubernetes \& Model Deployment** * Architect, build, and operate production Kubernetes clusters. * Containerize and productize ML models (Docker, Helm). * Deploy latency\-sensitive and high\-throughput models (ASR/TTS/NLU/Agentic AI). * Ensure GPU and accelerator nodes are properly integrated and optimized. **CI/CD for Machine Learning** * Build automated training, validation, and deployment pipelines (GitLab/Jenkins). * Implement canary, blue\-green, and automated rollback strategies. * Integrate MLOps lifecycle tools (MLflow, Kubeflow, SageMaker Model Registry, etc.). **Observability \& Reliability** * Implement full observability (Prometheus \+ Grafana). * Own uptime, performance, and reliability for ML production services. * Establish monitoring for latency, drift, model health, and infrastructure health. **Collaboration \& Technical Leadership** * Work closely with ML engineers, researchers, and data scientists. * Translate experimental models into production\-ready deployments. * Define best practices for MLOps across the company. **Qualifications and Skills:** We’re looking for a senior engineer with a strong DevOps/SRE background who has worked extensively with ML systems in production. The ideal candidate brings a combination of infrastructure, automation, and hands\-on MLOps experience. * **5\+ years** in a Senior DevOps, SRE, or MLOps Engineering role supporting production environments. * Strong experience designing, building, and maintaining **Kubernetes clusters** in production. * Hands\-on expertise with **Terraform** (or similar IaC tools) to manage cloud infrastructure. * Solid programming skills in **Python or Go** for building automation, tooling, and ML workflows. * Proven experience creating and maintaining **CI/CD pipelines** (GitLab or Jenkins). * Practical experience deploying and supporting **ML models** in production (e.g., ASR, TTS, NLU, LLM/Agentic AI). * Familiarity with ML workflow orchestration tools such as **Kubeflow**, **Apache Airflow**, or similar. * Experience with experiment tracking and model registry tools (e.g., **MLflow**, **SageMaker Model Registry**). * Exposure to deploying models on **GPU** or specialized hardware (e.g., **Inferentia**, **Trainium**). * Solid understanding of cloud infrastructure on **AWS**, including networking, scaling, storage, and security best practices. * Experience with deployment tooling (Docker, Helm) and observability stacks (Prometheus, Grafana). ##### ##### **Ways to Know You’ll Succeed** * You enjoy building platforms from the ground up and owning technical decisions. * You’re comfortable collaborating with ML engineers, researchers, and software teams to turn research into stable production systems. * You like solving performance, automation, and reliability challenges in distributed systems. * You bring a structured, pragmatic, and scalable approach to infrastructure design. * Energetic and proactive individual, with a natural drive to take initiative and move things forward. * Enjoys working closely with people \- researchers, ML engineers, cloud architects, product teams. * Comfortable sharing ideas openly, challenging assumptions, and contributing to technical discussions. * Collaborative mindset: you like to build together, not work in isolation. * Strong ownership mentality \- you enjoy taking responsibility for systems end\-to\-end. * Curious, hands\-on, and motivated by solving complex technical challenges. * Clear communicator who can translate technical work into practical recommendations. * Thrives in a fast\-paced environment where you can experiment, improve, and shape how things are done. **What we offer** * Competitive fixed compensation based on experience and expertise. * Work on cutting\-edge AI systems used globall. * Dynamic, multi\-disciplinary teams engaged in digital transformation. * Remote\-first work model * Long\-term B2B contract * 20\+ days paid time off * Apple gear * Training \& development budget **Our Core values at TheHRchapter** ️ Transparency: We believe in transparent and smooth recruitment processes. You will get feedback from us. ️ Candidate experience: Perfect blend between automated and humanized recruitment processes. Don't hesitate to ask us for feedback, anytime. ️ Talented pool: We bring highly\-skilled motivated candidates to our clients. Our candidates match their company values and management style. ️ Diversity and inclusion: There is no place for discrimination and intolerance. We care about diversity awareness and respect for any differences.

Source: indeed View original post