Engineer IV, Software – SRE
Required Knowledge and Skills:
- Implement tools and processes necessary to achieve required SLOs for this company's Platform.
- Define and implement CI/CD pipelines.
- Automate delivery of platform services using infrastructure-as-a-code. Build self-service playbooks for platform which can be consumed across globally distributed teams at this company.
- Define and implement incident response management process, deploy necessary tools.
- Fix support and escalation issues.
- Conduct post-incident reviews.
- Collaborate with application and business stakeholders to ensure high-quality product is developed and deployed in production. Work diligently with other engineering teams to ratify release processes necessary to meet business goals.
- Drive continuous improvement process
- Expert knowledge of one of the major public cloud platforms (Azure, AWS, GCP)
- Hands-on programming experience in Python or other object-oriented programming languages.
- Expert knowledge of Infrastructure and Application Monitoring tools: Prometheus, Grafana, DataDog, etc
- Experience implementing IaC concepts using Terraform, Chef, Puppet.
- Experience with Elasticsearch, Kibana
- Experience administering Databases
- Expert in Linux administration.
- Expert knowledge of Docker, Helm.
- Experience implementing CI/CD for cloud native applications.
- Experience with deploying applications that utilize Service Mesh
- Experience administering Kubernetes clusters.
- Experience defining and implementing incident response management processes.
Preferred Knowledge and Skills:
- Bachelor’s degree
- 8+ years’ experience in software engineering
- Master’s degree
- Understanding of GitOps principals.
- Experience implementing secure and compliant Kubernetes platforms.
- Experience deploying and managing stateful distributed service in Kubernetes.
- Experience with security scanning tools.
- Experience with intrusion detection systems.
- Experience with various messaging systems, such as Kafka or RabbitMQ
- Working knowledge of Databricks, Team Foundation Server, TeamCity, Octopus deploys and DataDog
- Corporate office/lab environment.
- Ability to travel 10% of the time.
- Position can be Remote