Site-Reliability Engineer

TechDigital

Scottsdale, AZ

JOB DETAILS
SKILLS
Ansible, Application Performance Management, Application Programming Interface (API), Artificial Intelligence (AI), Caching, Cloud Computing, Continuous Deployment/Delivery, Continuous Integration, DNS (Domain Name System), Debugging Skills, Distributed Computing, GCP (Good Clinical Practices), GitHub, GraphQL, HTTP (HyperText Transport Protocol), High Availability, Home Automation, Identify Issues, Java, Linux Administration, Linux Operating System, Load Balancing, Memory Hardware, Microsoft SQL Server, Microsoft Windows System Administration, Network Protocols, Node.js, Oracle Database, PostgreSQL, Programming Languages, Python Programming/Scripting Language, Redis, Reliability Engineering, Reporting Dashboards, Rust Programming Language, SQL (Structured Query Language), Scripting (Scripting Languages), Software Administration, Splunk, TCP/IP (Transmission Control Protocol/Internet Protocol), Time Tracking, Transaction Processing/Management
LOCATION
Scottsdale, AZ
POSTED
1 day ago
Required Skills
· Service reliability/operation experience running large-scale, high-performance applications in a hybrid environment (on-prem and cloud).
· Experience in writing automation scripts and building dashboards for Application Performance management to manage Transaction journeys.
· Experience working with Programming languages such as Go, Python, Java, Rust etc.
· Working knowledge on with one or more databases- Oracle, SQL Server, Redis, Clickhouse, postgres, Mongo or any time-series databases
· Experience in transitioning platforms to the cloud and Containerization – GCPand Rancher
· Experience maintaining containerized app in GKE/RKE/AKE environments.
· Experience Implementing Cloud observability using OTEL to enable real-time monitoring, distributed tracing and incident resolution.
· Experience working with specific GraphQL Framework (Apollo, Prisma, Hasura etc...).
· Experience using knowledge of networking protocols such as TCP/IP, HTTP, DNS, Load balancing and service mesh to troubleshoot issues in high pressure situations.

Preferred Skills:
· Proven experience managing Application availability, building creative solutions to manage repetitive activities, improving gating and detect for applications at every touchpoint for a 24 x 7 High availability platform exposed to critical clients and customers.
· Working knowledge of Monitoring tools - Splunk, App-dynamics, grafana/Prometheus and Dynatrace.
· Experience with tools like Rally, Confluence and other CI/CD extenders.
· Hands-on experience with implementing in-memory caching solutions. Experience on Redis DB is a plus.
· Excellent debugging skills across variety of integrated technical platforms on API gateway.
· Hands-on with GCS, Cloud SQL, Spanner and Firestore.
· Extensive experience in Enterprise level Infrastructure and Operations.
· Experience in High Availability and distributed systems, Linux and Windows administration, troubleshooting and support.
· Monitor and troubleshoot HashiCorp Vault environments, ensuring minimal downtime and rapid recovery from incidents.
· Working knowledge on Vertex AI, Gen AI and Bigquery

Mandatory skills:
· Google Cloud Platform (GCP) Containerization, Kubernetes
· Infrastructure as Code (Terraform), CI/CD (GitHub Actions), and Helm
· Automation and scripting using Python, Ansible, and Node.js
· Monitoring and observability with Prometheus and Grafana
· Linux systems and troubleshooting

About the Company

T

TechDigital

COMPANY SIZE
100 to 499 employees
INDUSTRY
Other/Not Classified