The Team
You will join a high-availability Neocloud data center operations team supporting GPU-accelerated cloud infrastructure used for AI training, inference, and high-performance workloads. The team operates in a 24/7 production environment where reliability, speed, and operational excellence directly impact customer experience and platform performance.
This role sits at the heart of day-to-day operations, working closely with Data Center IT Managers, Infrastructure Engineering, Network, Security, and Vendor partners to keep systems running smoothly and incidents resolved quickly.
The Role
As IT Support Manager, you will lead the front-line IT support and operations function responsible for maintaining the health, availability, and performance of data center IT infrastructure. You will manage teams handling incident response, hardware troubleshooting, change execution, and service delivery, with strong exposure to GPU-based servers, enterprise networking, and colocation environments. This is a hands-on leadership role focused on operational execution, team performance, and service quality.
Key Responsibilities
Day-to-Day Operations & Incident Management
• Lead daily IT operations across data center environments, ensuring high availability and SLA adherence. • Own incident management, including triage, escalation, coordination, and communication. • Driveroot cause analysis (RCA) and follow-through on corrective and preventive actions. • Ensure operational readiness for GPU-dense infrastructure, including power, cooling, and hardware health monitoring.
Team Leadership & Service Delivery
• Manage, schedule, and develop IT support engineers operating in shift-based / 24×7 environments. • Define and track KPIs, SLAs, and service quality metrics. • Provide hands-on guidance during complex troubleshooting scenarios. • Maintain consistent operational standards through runbooks, SOPs, and playbooks.
Hardware Support & Asset Operations
• Oversee diagnosis and resolution of issues related to servers, GPU systems, networking equipment, and cabling. • Manage hardware lifecycle activities, including installations, upgrades, swaps, and decommissioning. • Coordinate RMAs, spare parts, inventory accuracy, and asset tracking.
Change, Process & Continuous Improvement
• Execute approved changes and maintenance activities with minimal risk. • Identify recurring issues and drive process improvement to reduce incidents and MTTR. • Ensure adherence to ITIL / ITSM operational processes.
Vendor & Compliance Support
• Act as the operational interface to vendors, OEMs, and colocation providers for day-to-day support issues. • Support audits, compliance checks, and operational controls related to asset handling and access. • Ensure secure handling, storage, and decommissioning of IT assets.
Required Experience & Skills
3-5+ years of experience in IT support or data center operations, including people management. Strong hands-on experience with server hardware, including exposure to GPU-based systems. Solid understanding of data center operations, networking basics, and structured cabling. Experience leading incident response and operational troubleshooting. Working knowledge of ITIL / ITSM frameworks. Comfortable working with Linux systems and basic command-line tools. Strong organizational skills and ability to prioritize in high-pressure environments. Clear, concise communication skills for technical and non-technical stakeholders.
Nice to Have
Experience in Neocloud, hyperscale, or AI/HPC environments. Prior ownership of 24/7 support operations. ITIL certification. Familiarity with GPU health monitoring, firmware, or platform tooling. Experience working with colocation facilities.