Production Support Lead - Data

TekWissen LLC

Hendron, VA

JOB DETAILS
SALARY
$43.43–$43.43
SKILLS
Analysis Skills, Automation, Budgeting, Cadence, Change Management, Communication Skills, Continuous Improvement, Corrective Action, Corrective and Preventative Action (CAPA) Systems, Cross-Functional, Data Management, Data Quality, Data Sets, Documentation, Environmental Management, High Availability, High Reliability, ITIL (IT Infrastructure Library), Identify Issues, Incident Management, Incident Response, Information Technology & Information Systems, Instrumentation, Java, Knowledge Management, Leadership, Linux Operating System, Machine Tool, Maintain Compliance, Mentoring, Metrics, Microservices, Microsoft Windows Azure, Multiplatform/Cross-Platform, Offshoring, On Call, Operational Support, Oracle, Performance Analysis, Product Testing, Production Support, Python Programming/Scripting Language, Quality Engineering, Quality Management, Reconciliation, Release Management/Engineering, Reliability Engineering, Replication and Remote Mirroring, Reporting Dashboards, Reporting Skills, Root Cause Analysis, Service Level Agreement (SLA), Snowflake Schema, Splunk, Standard Operating Procedures (SOP), Team Lead/Manager, Time Tracking, Unix Shell Programming, Validation Plan, Workforce Management
LOCATION
Hendron, VA
POSTED
3 days ago
Overview:
TekWissen is a global workforce management provider headquartered in Ann Arbor, Michigan that offers strategic talent solutions to our clients world-wide. Our client provider of digital technology and transformation, information technology and services
Position: Production Support Lead - Data
Location: Hendron , VA
Duration: 6 Months
Job Type: Temporary Assignment
Work Type: Hybrid
Job Description:
Production Support Lead - DataONE Enterprise Platform (24x7 Ops) Job Title Production Support Lead / Platform Operations Lead - DataONE (24x7 Onshore + Offshore) Location Onsite/Hybrid (US Onshore) + Offshore leadership coordination (India) Experience 10-15+ years overall; 5+ years leading production support/operations teams for enterprise platforms Work Model 24x7 coverage ownership (on-call leadership, shift rotations, incident commander duties).
Role Summary:
  • We are seeking a highly accountable Production Support Lead to own 24x7 operational stability of the DataONE enterprise data platform.
  • This role will lead onshore and offshore teams (Data Engineering, SRE, Data Quality) to ensure high availability, reliability, performance, and data correctness across a platform powered by Snowflake, Kafka, Oracle GoldenGate, and Java-based microservices hosted on Azure/Kubernetes.
  • The ideal candidate is a strong operations leader with hands-on technical depth, proven incident/problem management, strong stakeholder communication, and the ability to build a mature Ops readiness model (runbooks, SOPs, monitoring, SLIs/SLOs, automation).
Key Responsibilities:
  • A) 24x7 Operations Ownership Own end-to-end production support for DataONE platform with round-the-clock coverage across onshore/offshore teams
  • Establish shift model, on-call rotations, escalations, and operational governance
  • Ensure adherence to SLAs/OLAs for incidents, requests, data issues, and platform uptime B) Incident, Problem & Major Incident Management Serve as Incident Commander for Severity 1/2 outages and customer-impacting events Drive triage, containment, recovery, and communication cadence (internal + customer).
  • Lead root cause analysis (RCA), corrective actions (CAPA), and prevention via automation/engineering fixes
  • Implement problem management to reduce repeat incidents C) Observability, Monitoring & Reliability Engineering Partner with SRE team to establish/operate: Platform health dashboards, alerting, and run-time instrumentation Monitoring for Snowflake workloads, Kafka lag, GoldenGate replication, microservices health, Kubernetes, and data pipeline
  • SLAs Define and track SLIs/SLOs, error budgets, and availability/performance targets D) Data Correctness & Data Quality Operations Oversee data-quality processes with DQ engineers: Source-to-target reconciliation, anomaly detection, threshold-based alerting Governance of data validation gates before publishing downstream datasets
  • Ensure data incident handling is as rigorous as platform incidents (impact, ETA, mitigation, prevention) E) Release, Change & Environment Management Coordinate with engineering teams for release management, change windows, and deployment readiness
  • Ensure rollback plans, validation steps, and post-release monitoring are standard practice Manage environment stability across dev/test/prod, access provisioning, and audit readiness F) Runbooks, SOPs & Knowledge Management Build and maintain high-quality operational documentation: Runbooks, SOPs for common tickets, troubleshooting guides, escalation trees Platform known issues registry and operational playbooks
  • Ensure continuous knowledge acquisition and reduce dependency on individuals G) Stakeholder Management & Reporting Provide crisp, outcome-driven status to customer leadership: uptime/availability, incident trends, backlog health, repeat issues, MTTR, change success rate Translate technical problems into business impact and propose corrective actions H) Team Leadership & Delivery Excellence Lead and mentor a multi-discipline team: Data Engineers (platform pipelines) SREs (reliability/observability/automation)
  • Data Quality Engineers (validation/reconciliation/data incidents) Drive culture of ownership, operational discipline, and continuous improvement
Required Skills & Qualifications:
  • Operations / Leadership Strong experience leading 24x7 production operations for enterprise systems/platforms
  • Deep knowledge of ITIL processes: incident/problem/change management Proven track record of improving MTTR, reducing repeat incidents, and operationalizing platforms
  • Technical (must-have working knowledge) Snowflake: monitoring, performance troubleshooting, workload tuning concepts Kafka: consumer lag, throughput, partitions, operational triage Oracle GoldenGate (or similar replication): monitoring replication health, lag, failure patterns
  • Java microservices: logs/metrics/tracing, dependency failures, deployment issues Azure + Kubernetes: container operations, scaling, pod restarts, config/secrets, networking basics
  • Strong troubleshooting in Linux, log analysis, and automation (Shell/Python) Communication
  • Ability to communicate clearly with non-technical stakeholders
  • Strong documentation and executive status reporting skills
Preferred Skills:
  • Experience supporting telecom-scale or large enterprise data platforms
  • Familiarity with observability tools: Splunk, Grafana, Prometheus, ELK, Dynatrace
  • Exposure to data observability/data quality tooling and automated validation frameworks
  • Experience running Ops maturity initiatives (SLOs, error budgets, automation-first, self-healing patterns)
  • Key Deliverables (First 30-60 Days) Implement 24x7 support model: RACI + escalation + shift/on-call schedule
  • Establish operational dashboards and alerts across Snowflake/Kafka/GoldenGate/services Publish runbooks + SOPs for top ticket categories
  • Start weekly operational reporting: incidents, MTTR, repeat issues, backlog, release health Define initial SLIs/SLOs and action plan to stabilize key failure areas
TekWissen Group is an equal opportunity employer supporting workforce diversity.

About the Company

T

TekWissen LLC

WE THE TEKWISSEN PEOPLE

TekWissen offers you a broader portfolio of services, industry-leading solutions, and the meaningful innovations that give you greater flexibility and speed to respond to market dynamics, reduced costs and risk to improve enterprise performance, and increased productivity to enable growth.

To keep pace with global market demands, TekWissen keeps its finger on the pulse of change. Our organized approach to guiding a project from its inception to closure. Managing projects is becoming more and more important as we enter the digital era. To cope with the pace that this transition demands, a method is required to manage projects so they can yield quality work, while incorporating efficient use of time and resources.

Project involves identifying which quality standards are relevant to the project and determining how to satisfy them.

It is important to perform quality planning during the Planning Process and should be done alongside the other project planning processes because changes in the quality will likely require changes in the other planning processes, or the desired product quality may require a detailed risk analysis of an identified problem. It is important to remember that quality should be planned, designed, then built in, not added on after the fact.

Capabilities and accomplishments in one TekWissen business enhance the opportunity for success in the others. Put simply, TekWissen's unique combination of attributes promotes success.



COMPANY SIZE
100 to 499 employees
INDUSTRY
Computer/IT Services
FOUNDED
2009
WEBSITE
http://www.tekwissen.com/