DevOps SRE

eTeam Inc.

Coppell, TX

JOB DETAILS
SALARY
$75–$82.14 Per Hour
SKILLS
Amazon Web Services (AWS), Analysis Skills, Applications Security, Artificial Intelligence (AI), Automation, Banking Services, Best Practices, Capacity Analysis, Capacity Management, Capacity and Performance Management, Capital Markets, Cloud Computing, Computer Science, Consulting, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, DevOps, Disaster Recovery, Docker, Documentation, Enterprise Applications, Financial Services, GCP (Good Clinical Practices), Go Programming Language (Golang), High Availability, Hybrid Cloud, Identify Issues, Improvement Metrics, Incident Management, Incident Response, Information Technology & Information Systems, Java, Knowledge Base, Knowledge Management, Machine Tool, Mentoring, Messaging Technology, Metrics, Microsoft Windows Azure, Operational Audit, Operations Processes, Performance Engineering, Performance Metrics, Performance Testing, Process Improvement, Production Support, Production Systems, Project/Program Management, Protective Services, Python Programming/Scripting Language, Release Management/Engineering, Reliability Engineering, Reporting Dashboards, Risk, Risk Analysis, Risk Management, Root Cause Analysis, Scripting (Scripting Languages), Software Administration, Software Engineering, Splunk, Technical Support, Time Management, Unix Shell Programming
LOCATION
Coppell, TX
POSTED
17 days ago
Job Title: Application Support Engineer
Location: Dallas, TX / Jersey City, NJ (Hybrid)
Shift Schedule: Monday – Friday (9am – 5pm)
Type: Contract-to-Hire (CTH)
Duration: 6 months.

Position Overview
We are seeking an experienced Application Support Engineer with a strong background in Site Reliability Engineering (SRE), DevOps, Production Support, Observability, Automation, and Cloud Technologies. This role will be responsible for ensuring the reliability, scalability, performance, and operational readiness of enterprise applications running in production environments.
The ideal candidate will work closely with Development, Infrastructure, Release Management, Risk, Security, and Application Support teams to drive operational excellence through automation, monitoring, resiliency engineering, and continuous improvement initiatives.
This position requires a hands-on engineer who can proactively identify risks, improve application recovery capabilities, lead incident management efforts, and champion SRE best practices across the organization.

Key Responsibilities
Site Reliability & Operational Excellence
  • Partner with development teams during design reviews, Sprint Zero, and delivery planning to ensure Non-Functional Requirements (NFRs) are incorporated into solutions.
  • Drive application resiliency, fault tolerance, disaster recovery, observability, and high availability initiatives.
  • Ensure proper support for holiday processing, special business events, and critical operational activities.
Release & Deployment Readiness
  • Collaborate with Major Release Management teams to ensure production releases meet organizational SRE standards.
  • Validate deployment readiness through monitoring, observability, resiliency, and recovery testing.
  • Ensure all releases include proper documentation, runbooks, and knowledge base articles.
Monitoring & Observability
  • Design, implement, and optimize monitoring and alerting solutions.
  • Define and manage:
    • Service Level Indicators (SLIs)
    • Service Level Objectives (SLOs)
    • Operational dashboards
    • Alerting frameworks
  • Utilize AI/ML-based analytics for:
    • Anomaly detection
    • Incident correlation
    • Predictive risk identification
Incident Management & Root Cause Analysis
  • Participate in major incident response activities.
  • Lead troubleshooting efforts for critical production issues.
  • Conduct Root Cause Analysis (RCA) and implement preventive measures.
  • Develop automated recovery processes and operational playbooks to reduce Mean Time to Recovery (MTTR).
Automation & Self-Healing
  • Design and implement automation frameworks to eliminate manual operational tasks.
  • Develop intelligent remediation solutions and self-healing mechanisms.
  • Build operational tooling using scripting and programming languages.
Operational Readiness & Risk Management
  • Participate in project management and operational readiness reviews.
  • Present application support readiness to stakeholders.
  • Identify operational risks and ensure mitigation plans are established.
  • Partner with Risk, Security, and Compliance teams to ensure governance requirements are met.
Capacity Planning & Performance Engineering
  • Perform capacity analysis and performance testing.
  • Ensure applications can scale effectively under increasing workloads.
  • Identify and resolve performance bottlenecks.
Metrics & Continuous Improvement
  • Define and track operational KPIs and reliability metrics.
  • Drive continuous improvement initiatives to increase platform stability and operational maturity.
  • Measure effectiveness of automation and resiliency improvements.
Mentorship & SRE Culture
  • Promote SRE best practices across Development and Support teams.
  • Mentor engineers on observability, automation, reliability engineering, and operational excellence.
  • Leverage AI-enabled tools to improve:
    • Monitoring
    • Performance
    • Security
    • Knowledge management
    • Code analysis


Required Qualifications
  • Bachelor's Degree in Computer Science, Engineering, Information Technology, or equivalent practical experience.
  • Minimum 8 years of experience in:
    • Application Support
    • Site Reliability Engineering (SRE)
    • DevOps Engineering
    • Production Support
    • Infrastructure Engineering
Technical Skills
SRE / DevOps
  • Strong experience implementing SRE principles and best practices.
  • Hands-on experience with:
    • CI/CD Pipelines
    • Infrastructure as Code (IaC)
    • Automation Frameworks
    • Release Management
Monitoring & Observability
  • Expertise with monitoring platforms such as:
    • Grafana
    • Prometheus
    • Splunk
    • Dynatrace
    • AppDynamics
    • Similar observability tools
Programming & Automation
  • Strong scripting and programming experience in one or more of:
    • Python
    • Java
    • Go (Golang)
    • Shell Scripting
Cloud & Infrastructure
  • Experience working with:
    • AWS
    • Azure
    • Google Cloud Platform (GCP)
    • Hybrid Cloud Environments
  • Knowledge of containerized platforms such as:
    • Docker
    • Kubernetes
    • OpenShift
Incident & Problem Management
  • Hands-on experience with:
    • Incident Management
    • Problem Management
    • Root Cause Analysis (RCA)
    • Major Incident Response
Reliability & Performance Engineering
  • Experience with:
    • Disaster Recovery Testing
    • Resiliency Testing
    • Performance Testing
    • Capacity Planning
Enterprise Technologies
  • Understanding of:
    • Messaging Systems
    • Data Platforms
    • Batch Processing Systems
    • Real-Time Processing Systems
    • AI/ML Concepts


Preferred Qualifications
  • Cloud Certifications (AWS, Azure, GCP) preferred.
  • Experience in:
    • Financial Services
    • Banking
    • Capital Markets
    • Highly Regulated Environments
  • Exposure to AI-enabled operational tooling and analytics.

About the Company

e

eTeam Inc.

Looking for a great job? Join eTeam. We’re looking for talented staffing professionals to join our staff. We also provide contract assignments and full-time jobs at Fortune 2000 Companies. We’ve been named one of the best companies to work for by Staffing Industry Analysts and New Jersey Business.
COMPANY SIZE
100 to 499 employees
INDUSTRY
Other/Not Classified
FOUNDED
1998
WEBSITE
www.eteaminc.com