- Posted 07 November 2025
- LocationHalifax
- Job type Permanent
- DisciplineInformation Technology
- Reference56393
Site Reliability Engineer
Job description
Site Reliability Engineer
Halifax
Our client a dynamic Information Technology services company that partners with leading global organizations to deliver innovative, high-quality IT solutions is looking for a Site Reliability Engineer Engineer.
As a Site Reliability Engineer (SRE), you will be responsible for ensuring the reliability, availability, and performance of our mission-critical applications and systems. You will be part of a team that bridges the gap between development and operations, leveraging your deep technical expertise and problem-solving skills to implement best practices in infrastructure automation, monitoring, scaling, and incident response.
As member of the team you will mentor and guide more junior SREs, work with cross-functional teams, and drive improvements to systems and processes. A passion for building highly resilient, scalable, and efficient systems is key to success in this role.
The SRE will play a key role helping Team Leads and Senior SRE to cover the gap between the organization as a customer and the team as a Service Provider. Is expected from the SRE being able to lead/mentor/inspire people while can deliver superb technical knowledge to troubleshoot or improve systems.
Responsibilities
- Reliability & Availability:
- Maintain and improve system reliability, uptime, and performance across production environments.
- Set and track service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs).
- Drive improvements in incident response processes, ensuring systems are fault-tolerant and highly available.
- Automation & Infrastructure:
- Design, implement, and maintain automation tools to deploy and manage infrastructure at scale.
- Collaborate with software engineering teams to integrate reliability practices into CI/CD pipelines.
- Improve the scalability, resilience, and efficiency of cloud infrastructure.
- Monitoring & Observability:
- Implement and maintain monitoring systems and alerts to ensure proactive identification of issues.
- Define key performance metrics and implement logging, monitoring, and alerting solutions across all services and platforms.
- Incident Management & Root Cause Analysis:
- Lead and participate in incident response efforts, performing root cause analysis and post-mortems to prevent recurrence.
- Champion a culture of blameless post-mortems and continuously improve incident response playbooks.
- Mentorship & Collaboration:
- Provide technical leadership and mentorship to junior SREs and other team members.
- Work closely with engineering teams to ensure reliability is a key consideration in application design and development.
- Foster a culture of collaboration between development, operations, and SRE teams to ensure continuous improvement in service reliability.
- Continuous Improvement:
- Advocate for and implement changes to improve performance, reduce toil, and optimize resource utilization.
- Drive the evolution of operational tooling and processes to enhance the quality of service provided to customers.
Qualifications
7+ year’s experience working as a Site Reliability Engineer is required.
- Infrastructure Automation & Configuration Management:
- Proficiency in infrastructure-as-code (IaC) and automation tools such as Terraform, Ansible, AWX.
- Knowledge of KVM Hypervisor
- Experience with containerization technologies like Docker and Kubernetes.
- Knowledge of cloud platforms such as AWS (specially S3), Google Cloud, or Azure is a plus.
- Monitoring & Observability:
- Hands-on experience with monitoring tools such as Zabbux, Prometheus, Grafana, , etc.
- Strong understanding of logging and tracing technologies (e.g., ELK stack, Fluentd, OpenTelemetry).
- Experience with Redis, RabbitMQ
- Distributed Systems & Networking:
- Solid understanding of distributed system design principles (CAP theorem, eventual consistency, etc.).
- Familiarity with network protocols and debugging tools (e.g., TCP/IP, HTTP/HTTPS, DNS, SMTP, load balancing).
- Solid knowledge of HTTP/HTTPS related services like Apache WebServer, HAProxy, Proxies HTTP/HTTPS and Mail services (PostFix, StrongMail)
- Familiarity with distributed concepts like GEO DNS, CLB (Cloud Load Balancing), etc.
- Deep knowledge of operational security (Service hardening, WAF, Honeypots, SIEMs)
- Mastering IPv4 fundamentals (including basic knowledge of routing protocols like BGP and OSPF). IPv6 experience is a plus
- Incident Management & Root Cause Analysis:
- Proven ability to lead post-incident reviews and write detailed post-mortems.
- Experience with incident management tools.
- CI/CD & DevOps Practices:
- Experience with CI/CD tools (e.g., Jenkins, GitLab, GitHub) and implementing continuous integration pipelines.
- Understanding of DevOps practices for automation, testing, and deployment in production environments.
- Basic knowledge of DevSecOps practices
- Scripting & Programming:
- Proficiency in scripting languages such as Python, Go, or Bash.
- Familiarity with at least one systems-level programming language (e.g., C++, Rust) is a plus.
- Performance Tuning & Optimization:
- Ability to analyze and improve the performance of complex and potentially distributed systems (CPU, memory, I/O, bandwidth).
- Familiarity with profiling tools and techniques for identifying bottlenecks in production environments.
This is a fantastic opportunity to join a growing team. The company offers a competitive compensation package, medical & health benefits and RRSP matching.
If this sounds like the ideal position for you then apply today!