Sr. Site Reliability Engineer
Qode · Austin, TX / Fort Mill, SC (Hybrid)
About The Role
- Sr. Site Reliability Engineer
- at Qode
HybridSouth Carolina, South Carolina, United StatesFull-time
- Posted few hours ago
- Share job
- Apply now
Description
Role: Sr. Site Reliability Engineer (SRE) – Unified Observability & AIOps
Location: Austin, TX / Fort Mill, SC (Hybrid)
Job Type: Full Time
Role Summary
We are seeking a Senior SRE with strong expertise in Unified Observability, proactive detection, AIOps, and GenAI-driven operations to support complex, distributed financial services platforms. The role requires hands-on experience designing SLI/SLO-driven monitoring, dynamic thresholds, intelligent alerting, and AI/ML-based anomaly detection across multi-stream architectures.
Key Responsibilities
Observability & Reliability Engineering
- Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
- Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
- Build actionable dashboards for operations, engineering, and leadership
- Implement alerting strategies using static and dynamic thresholds
Proactive Detection & AIOps
- Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
- Transition monitoring from reactive alerts to proactive insights
- Implement noise reduction, alert correlation, and root cause analysis
- Apply baseline modeling, seasonality detection, and anomaly scoring
Distributed Systems & Dependency Analysis
Monitor and troubleshoot multi-service architectures involving
Microservices
Downstream APIs
- Kafka / streaming platforms
- Cloud infrastructure (Terraform, IaC)
Identify whether issues originate from
- Upstream/downstream dependencies
- Streaming platform
Infrastructure
Application code
Tooling & Platforms
Deep hands-on experience with Dynatrace (mandatory)
Experience with
OpenTelemetry
Prometheus / Grafana
ELK / EFK
- Cloud-native monitoring (AWS/Azure/GCP)
- Strong JSON-based telemetry manipulation and enrichment
GenAI & LLM Enablement
Apply GenAI / LLMs for
- Incident summarization
- Root cause explanation
- Runbook recommendations
- Auto-remediation suggestions
- Collaborate with platform teams to operationalize GenAI safely
Required Skills & Experience
- ✅ 15+ years in SRE / Production Engineering
- ✅ Strong Unified Observability background (not infra-only)
- ✅ Hands-on Dynatrace experience (metrics, traces, logs, Davis AI)
- ✅ SLI/SLO engineering experience in production systems
- ✅ Experience implementing dynamic thresholds and anomaly detection
- ✅ Knowledge of AI/ML concepts applied to Ops (AIOps)
- ✅ Distributed systems troubleshooting expertise
- ✅ Experience with Kafka or streaming data platforms
Differentiators (Highly Valued)
Experience in financial services or regulated environments
Proven reduction of alert noise and MTTR using AIOps
GenAI / LLM integration into operations workflows
Visit website
Qode is dedicated to helping technical talent around the world find meaningful careers that match their skills and interests. Our platform provides a range of resources and tools that empower job seekers to take control of their careers and connect with top employers across a variety of industries. We believe that every individual deserves to find work that they're passionate about, and we are committed to making that vision a reality.
Qode's team of experienced professionals is passionate about creating a better world of work by providing innovative solutions that improve the job search process for both job seekers and employers. We believe in transparency, trust, and collaboration, and we strive to build strong relationships with our customers and partners. Through our platform, we aim to create a more engaged and fulfilled global workforce that drives innovation and growth.
This listing was posted by a verified recruiter at Qode. Report this listing
JobSpring