Senior Reliability Engineer

Thunderhawk Technology Partners LLC

Hamilton, NJ

JOB DETAILS
SKILLS
Amazon Web Services (AWS), Applications Security, Artificial Intelligence (AI), Automation, Bash Scripting, Best Practices, Budgeting, Cloud Architecture, Cloud Computing, Communication Skills, Computer Security, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Design Flows, DevOps, Disaster Recovery, Documentation, Documentation Design, GCP (Good Clinical Practices), High Availability, Identify Issues, Incident Management, Incident Response, Linux Operating System, Microsoft Windows Azure, Microsoft Windows Operating System, Model Review, Operating Systems, Performance Engineering, Performance Management, Performance Testing, Problem Solving Skills, Process Improvement, Public Cloud, Python Programming/Scripting Language, Reliability Engineering, Risk, Risk Management, Root Cause Analysis, Scripting (Scripting Languages), Service Level Agreement (SLA), Software Administration, Software Development Lifecycle (SDLC), Systems Engineering, Technical Leadership, Technical Writing, Threat Modeling, Unix Operating Systems, User Interface/Experience (UI/UX), Windows PowerShell
LOCATION
Hamilton, NJ
POSTED
30+ days ago
Reliability Engineer
12+ months contract
Location: Hamilton, New Jersey
3-4 days onsite

Position
The Reliability Engineer is responsible for architecting, implementing, and operating resiliency and observability solutions across cloud, data, application, and AI domains. This role emphasizes operational excellence, automation, and proactive risk mitigation to ensure high availability, performance, and recoverability of critical business platforms. The Reliability Engineer will drive continuous improvement in reliability and security controls, reduce operational risk, and enable rapid recovery and business continuity.

Responsibilities
Proactive Testing & Continuous Monitoring
Design and implement capabilities to proactively test operational performance and user experience end-to-end.
Build and maintain monitoring solutions for applications and security controls to detect unexpected behaviors and silent failures before they impact clients or end users.

Automation & Self-Service Delivery
Engineer automated reliability controls for incident triage, remediation, and production artifact generation (e.g., documentation, design reviews, threat models).
Integrate reliability automation into CI/CD pipelines, infrastructure-as-code, and cloud-native services to reduce manual effort and scale efficiently.

Business Resiliency & Survivability
Introduce runtime protection against threats and failures for business-critical applications and infrastructure.
Ensure systems are reproducibly built and can sustain disruptions with minimal impact.
Lead technical investigations, root cause analysis, and incident response for service disruptions, performance degradation, and availability risks.
Rapid Incident Response & Recovery
Unify technical and security incident response practices, including SLA management, root cause analysis, and problem resolution.

Provide quick cold start recovery capabilities to rebuild the business on demand after disruption.
Document and communicate technical risks, solutions, and best practices to technical and non-technical stakeholders.

Continuous Improvement & Knowledge Sharing
Fix root causes to eliminate repeat problems and prevent recurrence.
Document designs, data flows, changes, and troubleshooting runbooks to ensure shared learnings and prevent history from repeating.
Educate teams on reliability principles and foster a culture of operational excellence.

Qualifications
" Strong engineering fundamentals in systems, software, and cloud infrastructure.
" Deep knowledge of networking, operating systems (Windows, Linux, Unix), and distributed/cloud architectures.
" Experience with application reliability, performance engineering, and secure software development lifecycle.
" Expertise in monitoring, observability, incident response, and disaster recovery for cloud and data platforms.
" Familiarity with SRE principles, service-level objectives (SLOs), and error budgets.

Skills
" Proficiency in scripting and automation (Python, PowerShell, Bash, etc.).
" Proficiency in reliability automation tools, CI/CD, infrastructure-as-code, and DevOps practices.
" Hands-on experience with public cloud infrastructure and security (Azure preferred; AWS/GCP a plus)
" Ability to communicate complex technical concepts clearly and collaborate across teams
" Track record of engineering excellence, integrity, and continuous learning

About the Company

T

Thunderhawk Technology Partners LLC