Job description

Site Reliability Engineer

Halifax

Our client a dynamic Information Technology services company that partners with leading global organizations to deliver innovative, high-quality IT solutions is looking for a Site Reliability Engineer Engineer.

As a Site Reliability Engineer (SRE), you will be responsible for ensuring the reliability, availability, and performance of our mission-critical applications and systems. You will be part of a team that bridges the gap between development and operations, leveraging your deep technical expertise and problem-solving skills to implement best practices in infrastructure automation, monitoring, scaling, and incident response.

As member of the team you will mentor and guide more junior SREs, work with cross-functional teams, and drive improvements to systems and processes. A passion for building highly resilient, scalable, and efficient systems is key to success in this role.

The SRE will play a key role helping Team Leads and Senior SRE to cover the gap between the organization as a customer and the team as a Service Provider. Is expected from the SRE being able to lead/mentor/inspire people while can deliver superb technical knowledge to troubleshoot or improve systems.

Responsibilities

Reliability & Availability:

Maintain and improve system reliability, uptime, and performance across production environments.
Set and track service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs).
Drive improvements in incident response processes, ensuring systems are fault-tolerant and highly available.

Automation & Infrastructure:

Design, implement, and maintain automation tools to deploy and manage infrastructure at scale.
Collaborate with software engineering teams to integrate reliability practices into CI/CD pipelines.
Improve the scalability, resilience, and efficiency of cloud infrastructure.

Monitoring & Observability:

Implement and maintain monitoring systems and alerts to ensure proactive identification of issues.
Define key performance metrics and implement logging, monitoring, and alerting solutions across all services and platforms.

Incident Management & Root Cause Analysis:

Lead and participate in incident response efforts, performing root cause analysis and post-mortems to prevent recurrence.
Champion a culture of blameless post-mortems and continuously improve incident response playbooks.

Mentorship & Collaboration:

Provide technical leadership and mentorship to junior SREs and other team members.
Work closely with engineering teams to ensure reliability is a key consideration in application design and development.
Foster a culture of collaboration between development, operations, and SRE teams to ensure continuous improvement in service reliability.

Continuous Improvement:

Advocate for and implement changes to improve performance, reduce toil, and optimize resource utilization.
Drive the evolution of operational tooling and processes to enhance the quality of service provided to customers.

Qualifications

7+ year’s experience working as a Site Reliability Engineer is required.

Infrastructure Automation & Configuration Management:

Proficiency in infrastructure-as-code (IaC) and automation tools such as Terraform, Ansible, AWX.
Knowledge of KVM Hypervisor
Experience with containerization technologies like Docker and Kubernetes.
Knowledge of cloud platforms such as AWS (specially S3), Google Cloud, or Azure is a plus.

Monitoring & Observability:

Hands-on experience with monitoring tools such as Zabbux, Prometheus, Grafana, , etc.
Strong understanding of logging and tracing technologies (e.g., ELK stack, Fluentd, OpenTelemetry).
Experience with Redis, RabbitMQ

Distributed Systems & Networking:

Solid understanding of distributed system design principles (CAP theorem, eventual consistency, etc.).
Familiarity with network protocols and debugging tools (e.g., TCP/IP, HTTP/HTTPS, DNS, SMTP, load balancing).
Solid knowledge of HTTP/HTTPS related services like Apache WebServer, HAProxy, Proxies HTTP/HTTPS and Mail services (PostFix, StrongMail)
Familiarity with distributed concepts like GEO DNS, CLB (Cloud Load Balancing), etc.
Deep knowledge of operational security (Service hardening, WAF, Honeypots, SIEMs)
Mastering IPv4 fundamentals (including basic knowledge of routing protocols like BGP and OSPF). IPv6 experience is a plus

Incident Management & Root Cause Analysis:

Proven ability to lead post-incident reviews and write detailed post-mortems.
Experience with incident management tools.

CI/CD & DevOps Practices:

Experience with CI/CD tools (e.g., Jenkins, GitLab, GitHub) and implementing continuous integration pipelines.
Understanding of DevOps practices for automation, testing, and deployment in production environments.
Basic knowledge of DevSecOps practices

Scripting & Programming:

Proficiency in scripting languages such as Python, Go, or Bash.
Familiarity with at least one systems-level programming language (e.g., C++, Rust) is a plus.

Performance Tuning & Optimization:

Ability to analyze and improve the performance of complex and potentially distributed systems (CPU, memory, I/O, bandwidth).
Familiarity with profiling tools and techniques for identifying bottlenecks in production environments.

This is a fantastic opportunity to join a growing team. The company offers a competitive compensation package, medical & health benefits and RRSP matching.

If this sounds like the ideal position for you then apply today!

Site Reliability Engineer

Consultant

Kellie

Principal Consultant

Site Reliability Engineer

Job description