Senior Reliability Engineer-Customer Data Platform

TekWissen LLC

Atlanta, GA

JOB DETAILS
SALARY
$56.46–$56.46
SKILLS
Access Control, Analysis Skills, Application Programming Interface (API), Artificial Intelligence (AI), Automation, Bridge Building, Capacity Management, Civil Engineering, Cloud Computing, Communication Skills, Computer Science, Computer Security, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Cryptography, Customer Experience, Customer/Client Research, Data Management, Data Processing, Database Extract Transform and Load (ETL), Distributed Computing, Diversity, Ecosystems, Failover, High Availability, Incident Management, Information Technology & Information Systems, Information/Data Security (InfoSec), Leadership, Machine Tool, Maintain Compliance, Mentoring, Metrics, Microsoft Windows Azure, Multiplatform/Cross-Platform, Offshoring, On Call, Operational Audit, Organizational Skills, Privacy Controls, Problem Solving Skills, Procedure Implementation, Production Control, Production Support, Python Programming/Scripting Language, Regulatory Requirements, Release Management/Engineering, Reliability Engineering, Root Cause Analysis, Sales Pipeline, Scripting (Scripting Languages), Security Auditing, Security Compliance, Security Monitoring, Service Level Agreement (SLA), ServiceNow, Software Patches, Splunk, Strategic Planning, System Integration (SI), Systems Reliability, Team Lead/Manager, Time Management, Unix Shell Programming, Workforce Management
LOCATION
Atlanta, GA
POSTED
1 day ago
Overview:
TekWissen is a global workforce management provider headquartered in Ann Arbor, Michigan that offers strategic talent solutions to our clients world-wide. Our client provider of digital technology and transformation, information technology and services
Position:Senior Reliability Engineer - Customer Data Platform
Location: Atlanta , GA
Duration: 6 Months
Job Type: Temporary Assignment
Work Type:Onsite
JOB SUMMARY
  • We are seeking a Senior Reliability Engineer to own production excellence for our Customer Data Platform (CDP) the authoritative source of truth for customer data across the entire US adult population.
  • An authoritative platform is only authoritative if it is available, secure, and timely. This role ensures exactly that: high availability, operational resilience, and compliance for the critical data systems that power customer experiences across every touchpoint.
  • You will lead 24x7 production support, incident management, platform governance, and security compliance ensuring CDP remains the trusted foundation the business depends on.
  • You will act as the bridge between engineering, platform, security, and compliance teams, driving the operational discipline that keeps CDP resilient, secure, and audit-ready at all times.
Job Responsibilities :
  • KTLO Leadership and Production Support
  • Lead KTLO operations including 24x7 monitoring, incident management, and on-call processes understanding that CDP downtime directly impacts customer experiences and business decisions
  • Oversee production support for data pipelines, APIs, and platform services across Azure and Databricks ecosystems
  • Manage job orchestration and monitoring (e.g., Control-M), ensuring SLA adherence and timely resolution - because timeliness is a core promise of the authoritative source of truth
  • Establish and enforce runbooks, SOPs, and escalation procedures tailored to CDP's criticality
  • Drive root cause analysis (RCA) and implement preventive measures to reduce recurring issues and protect data trust.
  • Reliability Engineering and Operations
  • Improve system reliability through automation, observability, proactive monitoring, and near-real-time availability targets
  • Define and track SLAs, SLIs, and SLOs for critical CDP systems with metrics aligned to data freshness, accuracy, and availability commitments
  • Partner with engineering teams to implement resiliency patterns, failover strategies, and capacity planning for population-scale data processing
  • Identify and eliminate operational bottlenecks and manual processes that threaten CDP's reliability and timeliness
  • Compliance, Security, and Governance
  • Lead execution of compliance mandates, audits, and regulatory requirements impacting CDP systems - ensuring the platform that holds data for the entire US adult population meets the highest security standards
  • Manage and remediate security violations, vulnerabilities, and policy breaches with urgency
  • Oversee access controls, audit readiness, and governance processes in collaboration with security teams - protecting the trust that makes CDP authoritative
  • Ensure adherence to data protection and privacy standards across all customer data systems
  • Platform Maintenance and Operational Hygiene
  • Manage patching, upgrades, and vulnerability remediation across CDP platforms
  • Lead password and credential rotation processes across systems and integrations
  • Ensure operational readiness for infrastructure and platform changes with zero-downtime deployment practices
  • Coordinate with vendors and platform teams for issue resolution and maintenance activities
  • Collaboration and Leadership
  • Lead and coordinate onshore/offshore support teams, ensuring effective coverage and handoffs for 24x7 operations
  • Partner with Data Engineering, AI/ML, and Platform teams to ensure operability and supportability of all CDP systems
  • Provide operational readiness reviews for new deployments and features before they enter production
  • Mentor team members and drive a culture of accountability, ownership, and continuous improvement
Education and Work Experience:
  • Bachelor's degree in Computer Science, Engineering, or related field
  • 6+ years of experience in production support, SRE, or platform operations roles
  • Proven experience managing 24x7 support models and distributed teams
  • Experience supporting large-scale data platforms in cloud environments (Azure preferred)
  • Experience with security compliance and audit processes for systems handling sensitive customer data
Technical Skills:
  • Strong experience with Azure ecosystem (ADLS, Databricks, ADF, Event Hub, etc.)
  • Experience with job orchestration tools (Control-M or similar)
  • Solid understanding of data pipelines, ETL/ELT processes, and distributed systems at scale
  • Experience with monitoring and observability tools (e.g., Azure Monitor, Log Analytics, Splunk, Prometheus)
  • Familiarity with incident management tools and processes (PagerDuty, ServiceNow, etc.)
  • Experience with CI/CD pipelines and release management
  • Knowledge of security practices, access control, encryption, and compliance frameworks relevant to customer data
  • Scripting experience (Python, Shell) for automation and operational tooling
Knowledge, Skills, and Abilities:
  • Strong operational mindset with unwavering focus on stability, reliability, and uptime for a platform the entire business depends on
  • Ability to manage high-pressure production incidents and drive resolution with urgency and precision
  • Deep understanding of why platform reliability and security are foundational to CDP's authority as the source of truth
  • Strong problem-solving and root cause analysis skills
  • Excellent coordination and communication across engineering, security, and business teams
  • Ability to balance short-term fixes with long-term reliability improvements
  • Leadership skills in managing global support teams and rotations.
TekWissen Group is an equal opportunity employer supporting workforce diversity.

About the Company

T

TekWissen LLC

WE THE TEKWISSEN PEOPLE

TekWissen offers you a broader portfolio of services, industry-leading solutions, and the meaningful innovations that give you greater flexibility and speed to respond to market dynamics, reduced costs and risk to improve enterprise performance, and increased productivity to enable growth.

To keep pace with global market demands, TekWissen keeps its finger on the pulse of change. Our organized approach to guiding a project from its inception to closure. Managing projects is becoming more and more important as we enter the digital era. To cope with the pace that this transition demands, a method is required to manage projects so they can yield quality work, while incorporating efficient use of time and resources.

Project involves identifying which quality standards are relevant to the project and determining how to satisfy them.

It is important to perform quality planning during the Planning Process and should be done alongside the other project planning processes because changes in the quality will likely require changes in the other planning processes, or the desired product quality may require a detailed risk analysis of an identified problem. It is important to remember that quality should be planned, designed, then built in, not added on after the fact.

Capabilities and accomplishments in one TekWissen business enhance the opportunity for success in the others. Put simply, TekWissen's unique combination of attributes promotes success.



COMPANY SIZE
100 to 499 employees
INDUSTRY
Computer/IT Services
FOUNDED
2009
WEBSITE
http://www.tekwissen.com/