Network Engineer, Reliability & Observability

Fluidstack Ltd

New York City, NY

JOB DETAILS
SALARY
$150,000–$250,000 Per Year
SKILLS
Agile Programming Methodologies, Artificial Intelligence (AI), Atlassian JIRA, BGP, Computer Networks, Data Analysis, Data Collection, Data Management, Data Processing, Debugging Skills, Electricity, Electronics, Establish Priorities, Ethernet, Extreme Programming, Failure Analysis, Hardware Repair, IP (Internet Protocol) Routing, ITIL (IT Infrastructure Library), Identify Issues, Incident Response, Leadership, Logistics, Machine Learning, Manufacturing Operations, Metrics, Network Administration/Management, Network Architecture/Engineering, Network Monitoring, Network Operations Center, Network System Hardware, Operational Strategy, Operational Support, Optical Ethernet, Optics, Problem Solving Skills, Process Development, Process Engineering, Process Improvement, Quality Assurance, Quality Assurance Methodology, Quality Management, Reliability Engineering, Return on Capital Employed (ROCE), Sales Pipeline, Service Level Agreement (SLA), Software Development, Software Testing, Sourcing Strategy, Team Player, Telemetry, Test Driven Development (TDD), Topology
LOCATION
New York City, NY
POSTED
30+ days ago

About Fluidstack At Fluidstack, we're building the infrastructure for abundant intelligence. We partner with top AI labs, governments, and enterprises - including Mistral, Poolside, Black Forest Labs, Meta, and more - to unlock compute at the speed of light. We're working with urgency to make AGI a reality. As such, our team is highly motivated and committed to delivering world-class infrastructure. We treat our customers' outcomes as our own, taking pride in the systems we build and the trust we earn. If you're motivated by purpose, obsessed with excellence, and ready to work very hard to accelerate the future of intelligence, join us in building whats next. About the Role Fluidstack is seeking a Network Engineer, Reliability & Observability to serve as a reliability engineer championing and building process, data collections, and reliability metrics with the objective of improving the quality and reliability of AI networks from deployment through the full lifecycle of operations. This role is focused on developing processes, systems, tools, data and data pipelines, and observability to improve the quality of networks and deliver automated metrics (24x7) as well as periodic reliability reports for both internal and external customers. This role is ideal for experienced network operators who are passionate about reliability and have experience designing and building full lifecycle software such as Quality Assurance audits, circuit audits, periodic audits, failure rates and failure analysis. You are passionate about hardware (electronics and optics), software development, and you value and promote the use of data to make informed decisions in deployment, operations, and strategic sourcing. Responsibilities Ownership of Quality Assurance: Design, develop, and support QA process for network hardware and networks. Pipelines: Develop and deploy serverless workflows, server based, and manually triggered data pipelines producing network quality and reliability observability for internal and external customers. Deployment and Operations Support: Support full lifecycle data collection and analysis partnering with Deployment, Operations, DC hardware, and logistics teams to produce data that drives process improvements and delivers on SLA and SLOs. Process Engineering: Develop, pilot, and deploy process improvements for deployment and repair to produce data and consume data with Machine Learning to fulfill our mission. Cross-Team Collaboration: Own without ego and execute in a collaborative team with design, deployment, operations engineers and software developers. Subject Matter Expert: In at least two or more deep subjects such as IP routing, optics, optical transport, Ethernet, RDMA/RoCE, or electrical power. About You We are looking for a strong operations background with: 5+ years in network engineering and at least 3+ years in operations with significant hands-on operational experience. Youve run production networks or compute, responded to incidents at all hours, and debugged complex failures under pressure. You understand the difference between "working" and "production-ready". Requirements Datacenter Fabric Expertise: Deep experience operating modern datacenter networks including EVPN/VXLAN, BGP, CLOS topologies, and high-radix switching. Incident Response Excellence: Proven ability to lead incident response, perform systematic troubleshooting, and drive issues to resolution. Matrix Leadership Experience: You understand how to build relationships with onsite teams, coordinate physical infrastructure work, and represent network engineering in a field environment. Operational Pragmatism: You balance perfection with progress. You can troubleshoot with imperfect information, make pragmatic decisions under time pressure, and prioritize based on business impact. Self-Driven: You embrace complex challenges with undefined process and key results. You can dive in to learn, but zoom back out to build Objectives, develop Key Results and build a software development project and pipeline in Jira solo. Nice to Haves AI/HPC Fabric Operations: Experience operating AI/ML or HPC fabrics with RDMA (RoCEv2), lossless Ethernet (PFC, ECN), or high-performance networking. Reliability Engineering: You have experience with observability and reliability engineering from network operations or in manufacturing quality. Hardware Repair Experience: Hands-on experience coordinating hardware repairs, RMAs, and physical infrastructure work. Observability & Monitoring: Familiarity with network monitoring platforms, alerting systems, and telemetry collection. Software Development: You have experience with ITIL, Agile (xP), and TDD including developing and leading programs and projects. Salary & Benefits Competitive total compensation package (salary + equity). Retirement or pension plan, in line with local norms. Health, dental, and vision insurance. Generous PTO policy, in line with local norms. The base salary range for this position is $150,000 - $250,000 per year, depending on experience, skills, qualifications, and location. Fluidstack is an Equal Employment Opportunity Employer All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans' status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.

About the Company

F

Fluidstack Ltd