Infrastructure Engineer

Advanced Tech Placement

Roseland, NJ

JOB DETAILS
SKILLS
Amazon Web Services (AWS), Analysis Skills, Artificial Intelligence (AI), Automation, Business Operations, Cloud Computing, Communication Skills, Continuous Improvement, Cost Control, Cross-Functional, Data Management, Data Modeling, Data Processing, Database Design, DevOps, Emerging Technology, Establish Priorities, GPU (Graphics Processing Unit), GitHub, High Availability, Identify Issues, Incident Response, Infrastructure Software, Leadership, Machine Tool, Mine Production, Operational Improvement, Operational Strategy, Operational Support, Operations Planning, Organizational Skills, Performance Management, Problem Solving Skills, Production Systems, Python Programming/Scripting Language, Reliability Engineering, Risk, Risk Analysis, Risk Management, Software Development, Standards Development, Team Player
LOCATION
Roseland, NJ
POSTED
19 days ago

We are looking for a Infrastructure Engineer

We are seeking a highly skilled Infrastructure Engineer to help design, build, automate, and operate scalable, high-availability production infrastructure in a fast-paced enterprise technology environment. This individual will play a key role in driving reliability, automation, cloud infrastructure strategy, operational excellence, and AI-enabled engineering practices across mission-critical systems.

Responsibilities:

  • Design, build, automate, and support large-scale, highly available cloud infrastructure environments
  • Manage and optimize containerized production platforms and orchestration environments
  • Develop and maintain Infrastructure as Code (IaC) solutions using tools such as Terraform or Pulumi
  • Build automation tooling, operational utilities, and platform enhancements using Python or Go
  • Drive infrastructure reliability, scalability, observability, and resiliency initiatives
  • Partner closely with engineering, product, security, AI/ML, and platform teams to support enterprise-wide initiatives
  • Implement and maintain monitoring, logging, alerting, and performance management solutions
  • Troubleshoot complex production issues and proactively identify systemic risks or operational weaknesses
  • Lead infrastructure improvements with a focus on reversibility, risk mitigation, and minimizing production blast radius
  • Create operational standards, automation frameworks, and deployment strategies that improve engineering velocity and reliability
  • Support AI-driven infrastructure operations, intelligent automation initiatives, and AI-assisted engineering workflows
  • Evaluate and implement emerging AI-enabled operational tooling to improve efficiency, incident response, automation, and developer productivity
  • Collaborate with engineering teams supporting AI/ML workloads, data platforms, and model deployment pipelines
  • Own infrastructure initiatives end-to-end, including architecture, implementation, rollout, rollback planning, and operational support

Requirements:

  • 5 years of experience in Infrastructure Engineering, DevOps, Site Reliability Engineering, or similar roles supporting large-scale production environments
  • Hands-on experience operating containerized production environments and orchestration platforms in enterprise or high-growth environments
  • Strong experience with Kubernetes, Helm, and Infrastructure as Code tools such as Terraform or Pulumi
  • Experience supporting cloud infrastructure environments, preferably AWS
  • Proficiency in Python or Go for automation, tooling, and infrastructure development
  • Strong experience with monitoring, observability, and logging platforms such as Prometheus, Grafana, ELK, or equivalent technologies
  • Experience implementing resilient infrastructure designs focused on scalability, reliability, rollback strategies, and operational safety
  • Strong understanding of infrastructure tradeoffs involving reliability, cost optimization, deployment velocity, and operational risk
  • Demonstrated experience leveraging AI-assisted engineering tools and agentic AI workflows within day-to-day development and operational practices
  • Experience utilizing AI-enabled platforms such as Claude Code, Codex, GitHub Copilot, or similar tools to improve automation, troubleshooting, deployment efficiency, and operational workflows
  • Familiarity with infrastructure requirements supporting AI/ML environments, including compute scalability, data processing pipelines, model deployment, or GPU-enabled workloads is highly desirable

Required Skills:

  • Excellent communication and cross-functional collaboration skills
  • Strong analytical and problem-solving capabilities
  • Ability to challenge assumptions, identify operational gaps, and recommend innovative infrastructure solutions
  • Proven ownership mindset with experience leading infrastructure initiatives from concept through production deployment
  • Strong organizational skills with the ability to prioritize and execute in fast-paced environments
  • Passion for continuous improvement, emerging technologies, and modern AI-enabled operational practices

Preferred Skills:

  • Software engineering background with experience building and maintaining production-grade applications, services, libraries, or internal frameworks
  • Ability to read, troubleshoot, and modify application codebases supporting infrastructure platforms
  • Experience bridging infrastructure engineering and software development practices
  • Experience building reusable platform tooling, developer enablement frameworks, or internal infrastructure products
  • Experience supporting enterprise-scale cloud transformation or modernization initiatives
  • Exposure to MLOps, AI infrastructure, vector databases, model serving frameworks, or intelligent automation platforms
  • Experience supporting AI/ML engineering teams through scalable infrastructure and deployment automation

About the Company

A

Advanced Tech Placement