Data engineer Gen AI/ Python

Siritech Solutions Corp

whippany, NJ

JOB DETAILS
JOB TYPE
Contractor
SKILLS
Amazon Web Services (AWS), Apache Kafka, Apache Spark, Application Programming Interface (API), Artificial Intelligence (AI), Cloud Computing, Continuous Deployment/Delivery, Continuous Integration, Data Cleaning, Data Management, Data Modeling, Data Quality, Data Sets, Database Extract Transform and Load (ETL), Docker, GCP (Good Clinical Practices), High Throughput, Mathematics, Microsoft Windows Azure, Modeling Languages, Natural Language Processing (NLP), NoSQL, Performance Analysis, Performance Modeling, Performance Testing, Precision Testing, Python Programming/Scripting Language, Quality Management, SQL (Structured Query Language), Software Engineering, Training Data Sets, Unstructured Data
LOCATION
whippany, NJ
POSTED
30+ days ago

Role: Data Engineer Gen AI / Python
Location: Whippany NJ (2 daysoffice) - hybrid

Pleasee share me tha profiles at jaya@siritechsol.com

Job Description:

 

Experienced and skilled in designing, building, and maintaining high-quality data pipelines, preprocessing workflows, and vector databases required for training, fine-tuning, and deploying Large Language Models (LLMs). Build and maintain high-throughput data pipelines, infrastructure, and storage solutions specifically to feed, train, and deploy AI/ML models, implementing RAG (Retrieval-Augmented Generation) systems, data cleaning, and model evaluation to ensure efficient, scalable, and reliable LLM applications.

Required Skills & Qualifications

Strong proficiency in Python is essential, along with SQL and NoSQL for data management.
Experience with LangChain, LlamaIndex, Hugging Face Transformers, and OpenAI API
Experience with Apache Spark, Kafka, or modern data stack tools.
Knowledge of NLP techniques, word embeddings, tokenization, and vector mathematics.
Familiarity with TensorFlow, PyTorch, or Hugging Face
Familiarity with cloud platforms (AWS, GCP, Azure), CI/CD, Docker, and Kubernetes.
Key Responsibilities

Design and build robust ETL/ELT pipelines for unstructured text data, including scraping, cleaning, deduplication, and transformation for LLM training.
Build and maintain vector search solutions (e.g., Pinecone, Milvus, Weaviate, Chroma) to store and retrieve embeddings for RAG systems.
Prepare high-quality datasets for fine-tuning adapters (e.g., LoRA) and train LLMs using frameworks like PyTorch or TensorFlow.
Implement Retrieval-Augmented Generation using frameworks like LangChain or LlamaIndex to connect LLMs to company data.
Develop evaluation frameworks for model performance, testing for accuracy, hallucination, and bias, and monitor deployed models.
Create APIs and internal web tools for data annotation, curation, and model interaction


About the Company

S

Siritech Solutions Corp