Back to jobs

Site Reliability Engineer

Job description

Site Reliability Engineer

Halifax

 

Our client a dynamic Information Technology services company that partners with leading global organizations to deliver innovative, high-quality IT solutions is looking for a Site Reliability Engineer Engineer.

As a Site Reliability Engineer (SRE), you will be responsible for ensuring the reliability, availability, and performance of our mission-critical applications and systems. You will be part of a team that bridges the gap between development and operations, leveraging your deep technical expertise and problem-solving skills to implement best practices in infrastructure automation, monitoring, scaling, and incident response.

As member of the team you will mentor and guide more junior SREs, work with cross-functional teams, and drive improvements to systems and processes. A passion for building highly resilient, scalable, and efficient systems is key to success in this role.

The SRE will play a key role helping Team Leads and Senior SRE to cover the gap between the organization as a customer and the team as a Service Provider. Is expected from the SRE being able to lead/mentor/inspire people while can deliver superb technical knowledge to troubleshoot or improve systems.

 

Responsibilities

  • Reliability & Availability:
    • Maintain and improve system reliability, uptime, and performance across production environments.
    • Set and track service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs).
    • Drive improvements in incident response processes, ensuring systems are fault-tolerant and highly available.
  • Automation & Infrastructure:
    • Design, implement, and maintain automation tools to deploy and manage infrastructure at scale.
    • Collaborate with software engineering teams to integrate reliability practices into CI/CD pipelines.
    • Improve the scalability, resilience, and efficiency of cloud infrastructure.
  • Monitoring & Observability:
    • Implement and maintain monitoring systems and alerts to ensure proactive identification of issues.
    • Define key performance metrics and implement logging, monitoring, and alerting solutions across all services and platforms.
  • Incident Management & Root Cause Analysis:
    • Lead and participate in incident response efforts, performing root cause analysis and post-mortems to prevent recurrence.
    • Champion a culture of blameless post-mortems and continuously improve incident response playbooks.
  • Mentorship & Collaboration:
    • Provide technical leadership and mentorship to junior SREs and other team members.
    • Work closely with engineering teams to ensure reliability is a key consideration in application design and development.
    • Foster a culture of collaboration between development, operations, and SRE teams to ensure continuous improvement in service reliability.
  • Continuous Improvement:
    • Advocate for and implement changes to improve performance, reduce toil, and optimize resource utilization.
    • Drive the evolution of operational tooling and processes to enhance the quality of service provided to customers.

 

Qualifications

 

7+ years experience working as a Site Reliability Engineer is required.

  • Infrastructure Automation & Configuration Management:
    • Proficiency in infrastructure-as-code (IaC) and automation tools such as Terraform, Ansible, AWX.
    • Knowledge of KVM Hypervisor
    • Experience with containerization technologies like Docker and Kubernetes.
    • Knowledge of cloud platforms such as AWS (specially S3), Google Cloud, or Azure is a plus.
  • Monitoring & Observability:
    • Hands-on experience with monitoring tools such as Zabbux, Prometheus, Grafana, , etc.
    • Strong understanding of logging and tracing technologies (e.g., ELK stack, Fluentd, OpenTelemetry).
    • Experience with Redis, RabbitMQ
  • Distributed Systems & Networking:
    • Solid understanding of distributed system design principles (CAP theorem, eventual consistency, etc.).
    • Familiarity with network protocols and debugging tools (e.g., TCP/IP, HTTP/HTTPS, DNS, SMTP, load balancing).
    • Solid knowledge of HTTP/HTTPS related services like Apache WebServer, HAProxy, Proxies HTTP/HTTPS and Mail services (PostFix, StrongMail)
    • Familiarity with distributed concepts like GEO DNS, CLB (Cloud Load Balancing), etc.
    • Deep knowledge of operational security (Service hardening, WAF, Honeypots, SIEMs)
    • Mastering IPv4 fundamentals (including basic knowledge of routing protocols like BGP and OSPF). IPv6 experience is a plus
  • Incident Management & Root Cause Analysis:
    • Proven ability to lead post-incident reviews and write detailed post-mortems.
    • Experience with incident management tools.
  • CI/CD & DevOps Practices:
    • Experience with CI/CD tools (e.g., Jenkins, GitLab, GitHub) and implementing continuous integration pipelines.
    • Understanding of DevOps practices for automation, testing, and deployment in production environments.
    • Basic knowledge of DevSecOps practices
  • Scripting & Programming:
    • Proficiency in scripting languages such as Python, Go, or Bash.
    • Familiarity with at least one systems-level programming language (e.g., C++, Rust) is a plus.
  • Performance Tuning & Optimization:
    • Ability to analyze and improve the performance of complex and potentially distributed systems (CPU, memory, I/O, bandwidth).
    • Familiarity with profiling tools and techniques for identifying bottlenecks in production environments.

 

This is a fantastic opportunity to join a growing team.   The company offers a competitive compensation package, medical & health benefits and RRSP matching.

If this sounds like the ideal position for you then apply today!