Requirements & Qualification
- Bachelor’s Degree Software Engineering / Computer Science / Information Technology / Project Management or any related field.
- Minimum 8 years working experience or similar role is required for this position.
- Strong knowledge of Linux/Unix systems and administration.
- Strong knowledge of cloud infrastructure and services, particularly AWS.
- Experience with containerization and orchestration technologies such as Docker and Kubernetes.
- Experience with automation and configuration management tools such as AWS CDK, Terraform, Ansible, Puppet, or Chef.
- Experience with monitoring and logging tools such as Prometheus, Grafana, ELK or equivalent.
- Experience in implementing observability platforms using any product suites like DataDog, NewRelic, ELK, and Prometheus.
- Familiarity with build tools like GitLab CI, Travis, or equivalent.
- Strong scripting skills in languages such as Bash, Python, or Ruby.
- Experience with networking concepts and protocols.
- Experience with database management and administration.
- Familiarity with service-mesh technologies such as Istio and Linkerd.
- Experience with modern cloud development practices (microservices architectures, REST interfaces, etc.)
- Experience with source code management using Git.
- Deep hands-on technical expertise and problem-solving skills.
- Strong understanding of software development methodologies and principles.
- Strong problem-solving and analytical skills.
- Good communication and collaboration skills.
Responsibilities
- Designing, implementing, and maintaining high-availability systems.
- Proactively monitoring and troubleshooting production systems to identify and resolve issues.
- Creating and maintaining automated systems for deployment, scaling, and monitoring.
- Managing incident response and post-mortem analysis to prevent similar issues in the future.
- Collaborating with development teams to resolve any production issues in a timely manner.
- Continuously improving the performance, scalability, and reliability of the systems they manage.
- Participating in on-call rotation for incident response.
- Developing and maintaining documentation for processes and procedures.