About the Company
AGENTIC SOFTWARE, INC. is a U.S.-based company specializing in AI-powered software that helps businesses modernize and automate their operations. We are expanding our engineering team to strengthen the reliability and resilience of our platforms.
We are seeking an experienced Lead DevOps Engineer to join us on a remote basis. The ideal candidate will be a leader who can drive observability improvements, guide incident response, and mentor engineers to build systems with resilience in mind.
What You'll Do
- Design and enforce reliability standards (SLIs, SLOs) and error budgets across engineering teams.
- Architect, implement, and maintain observability systems (monitoring, logging, alerting) to ensure system visibility.
- Build and maintain infrastructure automation using modern infrastructure-as-code tooling (Terraform, CloudFormation).
- Lead incident response processes, conduct postmortems, and drive preventive improvements to eliminate operational friction.
- Collaborate with engineering leadership to guide system architecture decisions that improve resilience, fault tolerance, and scalability.
- Mentor junior and mid-level engineers, fostering operational excellence and best practices.
What We're Looking For
- 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
- Strong proficiency in scripting languages (e.g., Python, Bash) and infrastructure-as-code tools (Terraform).
- Deep familiarity with observability ecosystems such as Prometheus, ELK/EFK, or equivalent.
- Hands-on experience with cloud platforms (AWS preferred) and containerized environments.
- Demonstrated ability to lead incident management efforts and influence reliability strategy.
- Advanced English (B2 or equivalent) is required for effective communication with our distributed team.
- Bachelor's degree in Computer Science or equivalent professional experience.
Nice to Have
- Experience with chaos engineering or advanced resiliency testing.
- Background in designing fault-tolerant distributed systems.
- Solid understanding of CI/CD systems and deployment automation.
Position Details
- Employment Type: Full-time Contractor (Long-term collaboration).
- Compensation: $35–$45 USD per hour, based on experience.
- Engagement: Indefinite, long-term position with a focus on high-impact initiatives.