The Senior Site Reliability Engineer will be responsible for ensuring the reliability, performance, and scalability of our blockchain platform. The ideal candidate will have extensive experience with Kubernetes, CI/CD, Terraform, and public cloud providers (AWS & IBM). This role involves collaborating with engineering teams, implementing robust infrastructure solutions, and driving continuous improvement in our operations.
Responsibilities:
- Guidance and Mentorship: Provide technical guidance and mentorship to engineers, fostering a culture of learning and collaboration.
- Decision Making: Assist stakeholders in making informed technical decisions that align with best practices and business goals.
- Knowledge Sharing: Actively share knowledge and expertise with team members to enhance overall team capability.
- Kubernetes: Manage and optimize Kubernetes clusters to ensure high availability, performance, and scalability.
- CI/CD Pipelines: Design, implement, and maintain continuous integration and continuous deployment pipelines to streamline development and deployment processes.
- Terraform: Utilize Terraform for infrastructure as code, ensuring consistent and repeatable infrastructure deployments.
- AWS & IBM: Leverage public cloud services from AWS and IBM to build and maintain scalable and resilient infrastructure solutions.
- Monitoring and Optimization: Implement monitoring and alerting systems to proactively manage and optimize cloud infrastructure.
- Incident Management: Lead incident response efforts to quickly diagnose and resolve reliability and performance issues.
- Continuous Improvement: Identify areas for improvement in infrastructure and operations, implementing solutions to enhance reliability and efficiency.
- Security: Ensure infrastructure security best practices are followed and proactively address potential vulnerabilities.
- Cross-functional Collaboration: Work closely with development, product, and operations teams to ensure alignment and effective communication.
- Stakeholder Engagement: Engage with stakeholders to understand their needs and translate them into technical requirements and solutions.
- Documentation: Maintain comprehensive documentation of infrastructure, processes, and procedures to ensure knowledge transfer and operational continuity.
Requirements
- Experience: 5-15 years of experience in site reliability engineering, DevOps, or a related field.
- Technical Skills: Proficiency in Kubernetes, CI/CD, Terraform, and public cloud providers (AWS & IBM).
- Soft Skills: Strong communication skills, with a low-ego, approachable demeanor.
- Problem-Solving: Excellent problem-solving skills and the ability to work independently as a self-starter.
- Education: Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
Benefits
- Fully remote, work from home environment
- Flexible working hours
- Paid Time-Off
- Periodic in-person offsites globally (travel permitting)
- Long-term incentive programs
- Continued education support
- Advancement opportunity