DevOps SRE

eTeam Inc.

Coppell, TX

Apply

JOB DETAILS

SALARY

$75–$82.14 Per Hour

SKILLS

Amazon Web Services (AWS), Analysis Skills, Applications Security, Artificial Intelligence (AI), Automation, Banking Services, Best Practices, Capacity Analysis, Capacity Management, Capacity and Performance Management, Capital Markets, Cloud Computing, Computer Science, Consulting, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, DevOps, Disaster Recovery, Docker, Documentation, Enterprise Applications, Financial Services, GCP (Good Clinical Practices), Go Programming Language (Golang), High Availability, Hybrid Cloud, Identify Issues, Improvement Metrics, Incident Management, Incident Response, Information Technology & Information Systems, Java, Knowledge Base, Knowledge Management, Machine Tool, Mentoring, Messaging Technology, Metrics, Microsoft Windows Azure, Operational Audit, Operations Processes, Performance Engineering, Performance Metrics, Performance Testing, Process Improvement, Production Support, Production Systems, Project/Program Management, Protective Services, Python Programming/Scripting Language, Release Management/Engineering, Reliability Engineering, Reporting Dashboards, Risk, Risk Analysis, Risk Management, Root Cause Analysis, Scripting (Scripting Languages), Software Administration, Software Engineering, Splunk, Technical Support, Time Management, Unix Shell Programming

LOCATION

Coppell, TX

POSTED

17 days ago

Job Title: Application Support Engineer
Location: Dallas, TX / Jersey City, NJ (Hybrid)
Shift Schedule: Monday – Friday (9am – 5pm)
Type: Contract-to-Hire (CTH)
Duration: 6 months.

Position Overview
We are seeking an experienced Application Support Engineer with a strong background in Site Reliability Engineering (SRE), DevOps, Production Support, Observability, Automation, and Cloud Technologies. This role will be responsible for ensuring the reliability, scalability, performance, and operational readiness of enterprise applications running in production environments.
The ideal candidate will work closely with Development, Infrastructure, Release Management, Risk, Security, and Application Support teams to drive operational excellence through automation, monitoring, resiliency engineering, and continuous improvement initiatives.
This position requires a hands-on engineer who can proactively identify risks, improve application recovery capabilities, lead incident management efforts, and champion SRE best practices across the organization.

Key Responsibilities
Site Reliability & Operational Excellence

Partner with development teams during design reviews, Sprint Zero, and delivery planning to ensure Non-Functional Requirements (NFRs) are incorporated into solutions.
Drive application resiliency, fault tolerance, disaster recovery, observability, and high availability initiatives.
Ensure proper support for holiday processing, special business events, and critical operational activities.

Release & Deployment Readiness

Collaborate with Major Release Management teams to ensure production releases meet organizational SRE standards.
Validate deployment readiness through monitoring, observability, resiliency, and recovery testing.
Ensure all releases include proper documentation, runbooks, and knowledge base articles.

Monitoring & Observability

Design, implement, and optimize monitoring and alerting solutions.
Define and manage:
- Service Level Indicators (SLIs)
- Service Level Objectives (SLOs)
- Operational dashboards
- Alerting frameworks
Utilize AI/ML-based analytics for:
- Anomaly detection
- Incident correlation
- Predictive risk identification

Incident Management & Root Cause Analysis

Participate in major incident response activities.
Lead troubleshooting efforts for critical production issues.
Conduct Root Cause Analysis (RCA) and implement preventive measures.
Develop automated recovery processes and operational playbooks to reduce Mean Time to Recovery (MTTR).

Automation & Self-Healing

Design and implement automation frameworks to eliminate manual operational tasks.
Develop intelligent remediation solutions and self-healing mechanisms.
Build operational tooling using scripting and programming languages.

Operational Readiness & Risk Management

Participate in project management and operational readiness reviews.
Present application support readiness to stakeholders.
Identify operational risks and ensure mitigation plans are established.
Partner with Risk, Security, and Compliance teams to ensure governance requirements are met.

Capacity Planning & Performance Engineering

Perform capacity analysis and performance testing.
Ensure applications can scale effectively under increasing workloads.
Identify and resolve performance bottlenecks.

Metrics & Continuous Improvement

Define and track operational KPIs and reliability metrics.
Drive continuous improvement initiatives to increase platform stability and operational maturity.
Measure effectiveness of automation and resiliency improvements.

Mentorship & SRE Culture

Promote SRE best practices across Development and Support teams.
Mentor engineers on observability, automation, reliability engineering, and operational excellence.
Leverage AI-enabled tools to improve:
- Monitoring
- Performance
- Security
- Knowledge management
- Code analysis

Required Qualifications

Bachelor's Degree in Computer Science, Engineering, Information Technology, or equivalent practical experience.
Minimum 8 years of experience in:
- Application Support
- Site Reliability Engineering (SRE)
- DevOps Engineering
- Production Support
- Infrastructure Engineering

Technical Skills
SRE / DevOps

Strong experience implementing SRE principles and best practices.
Hands-on experience with:
- CI/CD Pipelines
- Infrastructure as Code (IaC)
- Automation Frameworks
- Release Management

Monitoring & Observability

Expertise with monitoring platforms such as:
- Grafana
- Prometheus
- Splunk
- Dynatrace
- AppDynamics
- Similar observability tools

Programming & Automation

Strong scripting and programming experience in one or more of:
- Python
- Java
- Go (Golang)
- Shell Scripting

Cloud & Infrastructure

Experience working with:
- AWS
- Azure
- Google Cloud Platform (GCP)
- Hybrid Cloud Environments
Knowledge of containerized platforms such as:
- Docker
- Kubernetes
- OpenShift

Incident & Problem Management

Hands-on experience with:
- Incident Management
- Problem Management
- Root Cause Analysis (RCA)
- Major Incident Response

Reliability & Performance Engineering

Experience with:
- Disaster Recovery Testing
- Resiliency Testing
- Performance Testing
- Capacity Planning

Enterprise Technologies

Understanding of:
- Messaging Systems
- Data Platforms
- Batch Processing Systems
- Real-Time Processing Systems
- AI/ML Concepts

Preferred Qualifications

Cloud Certifications (AWS, Azure, GCP) preferred.
Experience in:
- Financial Services
- Banking
- Capital Markets
- Highly Regulated Environments
Exposure to AI-enabled operational tooling and analytics.

About the Company

eTeam Inc.

Looking for a great job? Join eTeam. We’re looking for talented staffing professionals to join our staff. We also provide contract assignments and full-time jobs at Fortune 2000 Companies. We’ve been named one of the best companies to work for by Staffing Industry Analysts and New Jersey Business.

COMPANY SIZE

100 to 499 employees

INDUSTRY

Other/Not Classified

FOUNDED

1998

WEBSITE

www.eteaminc.com

Resume Resources

Free Resume Templates Free Resume Builder