The role focuses on maintaining and optimizing existing data pipelines, consolidating data from multiple sources, and working closely with business stakeholders to define and implement transformation logic.
Job Responsibilities
Core Responsibilities
· Maintain and monitor existing production data pipelines (little to no new pipeline build)
· Manage and troubleshoot AWS Glue jobs
· Validate and verify data outputs (including SageMaker checks)
· Perform cost optimization across AWS workloads and SQL queries
· Consolidate data from 20+ sandbox tables into a single scalable "gold” table
· Build datasets supporting:Daily snapshots, Monthly snapshots, Client- and crew-level metrics
· Translate business metrics into reliable transformation logic
Tech Stack (Must-Have)
· PySpark
· Python
· SQL
· AWS, with emphasis on: AWS Glue, Cloud cost awareness
· Experience supporting production ETL pipelines
· Strong query optimization for performance and cost efficiency
Nice-to-Have / Preferred
· Experience with Amazon SageMaker (monitoring / validation)
· Prior experience in the Vanguard environment
· Hybrid background as Data Analyst → Data Engineer
· Exposure to cloud cost-optimization initiatives
Experience Level
· 5–8 years overall IT experience
· Strong hands-on data engineering background
· Experience working with analyst-driven or business-led data use cases
Non-Technical Traits (Important)
· Strong business understanding of data, metrics, and tables
· Comfortable with ambiguous or evolving requirements
· Able to collaborate closely with analysts during discovery/research
· Works well in Kanban / Agile environments
· Strong communicator and team collaborator
Ideal Candidate Profile (Quick Check)
· Can hit the ground running
· More focused on pipeline maintenance and optimization than greenfield builds
· Strong AWS + PySpark engineer with business-facing experience
· Comfortable delivering value quickly in a short-term assignment
· Highly collaborative, delivery-focused team