Senior DevOps Engineer
DeepChecks
Senior DevOps Engineer
About the job
We’re looking for a Senior DevOps Engineer to join our team! If you are an experienced DevOps engineer with a proven track record of building and maintaining robust infrastructure at scale, we want to hear from you! You will play a crucial role in enabling Deepchecks to scale its product into new markets, collaborating with our brilliant team to deliver groundbreaking abilities in AI evaluation and observability.
What you will do
You will be architecting and maintaining the infrastructure that powers our LLM evaluation and monitoring platform. You’ll be responsible for designing and implementing scalable solutions that can handle large-scale ML inference workloads, both in cloud and on-premises environments. This includes building and optimizing our Kubernetes clusters, implementing robust monitoring solutions, and ensuring high availability of our services. You’ll work closely with our engineering team to streamline development and deployment processes. This role offers a unique opportunity to work with cutting-edge LLM technologies while building enterprise-grade infrastructure solutions.
We’ll be going through a lot together, so we want your character and mindset to be a good fit for a fast-moving startup.
You’re a great match if:
- You have 5+ years of experience in DevOps roles.
- Kubernetes & Helm: You’ve built production-grade k8s clusters from scratch and have experience with container orchestration in production environments.
- AWS Infrastructure: You’ve designed and implemented AWS deployments with complex networking, IAM, and significant infra (EKS, DBs, queues, and more).
- Infrastructure as Code: You have expertise in creating modular Terraform codebases managing multiple environments.
- Cross-Cloud: You’ve built and maintained systems that operate across cloud and on-premises environments, managing the complexity of different deployment models.
- Scripting: You have extensive, hands on experience with Bash and other scripting languages.
- Production Ownership: You’ve demonstrated end-to-end ownership of production systems, including monitoring, incident response, and proactive improvement of operational procedures.
- Team player: You’ve worked in cross-functional teams and have been an enabler for everyone around you.
Nice to Have:
- Experience with GitHub Actions CI/CD, Datadog, and on-premises monitoring solutions
- Hands-on experience with Redis, RDS, and GPU infrastructure for ML inference
- Experience with SecOps – Cloud and application security, vulnerability handling, compliance (SOC 2, ISO etc)
- Previous startup experience or experience building DevOps from scratch