Lead Site Reliability Engineer (SRE)

Macpower Digital Assets Edge Private Limited

Rockville, MD

JOB DETAILS
SKILLS
AWS Lambda, Access Control, Amazon CloudFront, Amazon Elastic Compute Cloud (EC2), Amazon Simple Storage Service (S3), Amazon Web Services (AWS), Analysis Skills, Ansible, Automation, Business Continuity Planning (BCP), Capacity Management, Chef (Configuration Management), Cloud Architecture, Cloud Computing, Communication Skills, Computer Science, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Contract Negotiation, DHCP (Dynamic Host Configuration Protocol), DNS (Domain Name System), DevOps, Disaster Recovery, Due Diligence, Electrical Engineering, Git, HIPAA (Health Insurance Portability and Accountability Act), High Availability, High Reliability, IT Service Management (ITSM), ITIL (IT Infrastructure Library), Incident Management, Information Technology & Information Systems, Jenkins, Leadership, Linux Operating System, Local Area Network (LAN), Maintain Compliance, Mentoring, Microsoft Active Directory, Microsoft Windows Azure, Microsoft Windows System Administration, Operational Improvement, Operations, Operations Management, Operations Processes, Performance Analysis, Problem Solving Skills, Process Improvement, Public Key Infrastructure (PKI), Puppet (Configuration Management), Python Programming/Scripting Language, Regulatory Compliance, Reliability Engineering, Root Cause Analysis, Scripting (Scripting Languages), Secure/SSH File Transfer Protocol (SFTP), Single Sign-On (SSO), Software Engineering, Software Patches, Source Code/Configuration Management (SCM), Standard Operating Procedures (SOP), Systems Administration/Management, Team Lead/Manager, Test Plan/Schedule, Time Management, VMWare, Vendor/Supplier Management, Wide Area Network (WAN), Windows PowerShell
LOCATION
Rockville, MD
POSTED
12 days ago
Note: This is a fully hands-on role. Architect-level applicants will not be considered.
Key Focus Areas:
  • Manage and optimize control towers, organizational policies, and multi-account environments.
  • Oversee AWS backups, SSM patching, AMI deployments, and configuration pushes across multiple accounts.
  • Manage and maintain core AWS services including EC2, ECS, EKS, RDS, S3, SageMaker, CloudFront, and Lambda.
  • Implement S3, SFTP, and site externalization methods.
  • Develop Infrastructure as Code (IaC) using Terraform, CloudFormation, and Python.
  • Manage IAM policies, access controls, and permissions.
Core Responsibilities:
  • Manage and maintain cloud infrastructure to ensure high availability, reliability, and performance.
  • Serve as the primary escalation point for all cloud infrastructure issues.
  • Monitor cloud resource performance and cost efficiency.
  • Lead major incident management and communicate timely updates to stakeholders.
  • Perform due diligence and impact analysis before implementing changes to cloud platforms.
  • Lead and mentor a team of cloud engineers to ensure performance and collaboration.
  • Manage daily operations and ensure alignment with organizational objectives.
  • Develop and implement incident management processes and conduct root cause analysis.
  • Identify and automate repetitive infrastructure tasks using IaC principles.
  • Continuously improve operational processes and standard operating procedures.
  • Implement and enforce security controls, ensuring compliance with standards such as GDPR and HIPAA.
  • Monitor cloud usage and conduct capacity planning to balance efficiency and scalability.
  • Develop and test disaster recovery and business continuity plans.
  • Collaborate with IT, business units, and vendors to deliver scalable cloud solutions.
  • Document cloud configurations, processes, and reports, ensuring accessibility and version control.
Technical Skills:
  • Proficiency in AWS (EC2, ECS, EKS, RDS, S3, Lambda, SageMaker, CloudFront).
  • Experience with Azure and OCI cloud environments.
  • Infrastructure as Code (Terraform, CloudFormation, Ansible, Puppet, Chef).
  • Scripting in Python and PowerShell.
  • Strong understanding of cloud architecture, monitoring, and automation tools.
  • System administration experience (Windows, Linux, VMware, Active Directory, Azure AD SSO).
  • Strong networking knowledge (DNS, DHCP, PKI, LAN/WAN).
Leadership and Behavioral Skills:
  • Demonstrated experience in leading teams and managing cloud operations.
  • Strong communication and stakeholder management across technical and business functions.
  • Proactive problem-solver with excellent analytical and root cause analysis skills.
  • Self-motivated with a continuous improvement mindset.
  • Experienced in vendor management and contract negotiations.
Basic Qualifications:
  • Bachelor's degree in Computer Science, Information Technology, Electrical Engineering, or equivalent.
  • Experience in cloud operations and team leadership in technical environments.
Preferred Certifications and Experience:
  • AWS Certified Solutions Architect Associate or Professional.
  • Microsoft Certified: Azure Architect.
  • Familiarity with DevOps tools (CI/CD, Jenkins, Git).
  • Experience with ITIL or ITSM frameworks.

About the Company

M

Macpower Digital Assets Edge Private Limited