Software Development Engineer III - Infrastructure (SDE III)

Valiant Harbor International, LLC

Washington, DC(remote)

JOB DETAILS
SKILLS
Access Control, Application Programming Interface (API), Artificial Intelligence (AI), Authentication, Automation, Budgeting, Coding Standards, Communication Skills, Computer Science, Continuous Deployment/Delivery, Continuous Integration, Cost Control, Customer Relations, Data Management, Debugging Skills, Design Patterns Programming Methodologies, Distributed Computing, Ecosystems, Engineering, Error Handling, Expense Tracking, Government, Incident Response, Interoperability, MCP - Microsoft Certified Professional, Mentoring, Microsoft Windows Azure, OAuth, On Call, Privacy Controls, Product Shipments, Production Systems, Project/Program Management, Python Programming/Scripting Language, Quality Assurance Methodology, Quality Metrics, Semantic Search, Shallow Parsing, Software Design, Software Development, Software Engineering, Startup, System Integration (SI), Team Lead/Manager, Team Player, Technical Writing, Traffic Shaping, Willing to Travel, Writing Skills
LOCATION
Washington, DC
POSTED
2 days ago
Valiant Harbor International is seeking a Software Development Engineer III – Infrastructure (SDE III) to support the Director’s Office at the Advanced Research Projects Agency for Health (ARPA-H). This candidate will contribute to the General Research Assistant and Content Engine (GRACE) development team in building the next generation of agentic AI to transform how ARPA-H Program Managers accelerate research, make decisions, and ship products at scale. GRACE is ARPA-H’s production AI assistant, and ARPA-H’s intention is to evolve it into an ecosystem of autonomous, multi-agent systems.  This is a full-time, remote position.  The candidate must be able to travel within the U.S.  

Key Responsibilities:
  • Manage end-to-end backend infrastructure for GRACE on Microsoft Azure:
    • Azure Functions, Azure API Management, Azure Container Apps, and Azure OpenAI Service.
    • Manage storage, retrieval pipelines, vector databases, and document indexing that power GRACE's internal knowledge search.
    • Authentication and identity integration, including ARPA-H Entra ID and application-level access control.
    • Implement and maintain infrastructure as code for all environments.
    • Own CI/CD pipelines, deployment automation, and release processes including canary and gradual rollouts.
    • Be responsible for production system basics (e.g., monitoring, alerting, logging, distributed tracing, SLOs, and incident response runbooks).
    • Manage secrets, API keys, and credential rotation across all integrations with external providers.
    • Monitor for cost-related efficiencies across all LLM providers; track spending, set budgets, build guardrails, and optimize for cost-per-query without sacrificing quality.
  • Agentic AI and Protocol Infrastructure:  
    • Manage the backend implementation of MCP, including MCP server hosting, tool registration, versioning, and lifecycle management on Azure.
    • Implement and evolve A2A communication patterns to enabling GRACE agents interoperability with internal/external systems.
    • Design and maintain LLM orchestration, routing, and multi-model switching infrastructure across OpenAI GPT, Anthropic Claude, and Google Gemini families.
    • Build and operate RAG pipelines; document ingestion, chunking, embedding, and semantic search.
    • Implement robust fallback, retry, circuit-breaker, and graceful degradation patterns for all AI service dependencies.
    • Manage tool-calling infrastructure; to include registration, execution, error handling, and observability for all GRACE tools.
  • Manage observability and production quality:  
    • Build and maintain end-to-end observability for agentic workflows: latency, throughput, error rates, token usage, and LLM quality metrics.
    • Implement LLM evaluation pipelines including safety checks, regression monitoring, and grounding assessment.
    • Define and enforce system-level SLOs for availability, response time, and tool call reliability.
    • Manage alerting and on-call runbooks.
  • Collaborate and foster teamwork:
    • Establish and improve coding standards, design review processes, and testing practices.
    • Communicate technical decisions in writing and in conversation to both engineers and non-engineers.
    • Mentor and guide other engineers.
    • Think inventively and consider other perspectives; work backward from the user to understand problems before proposing solutions.
    • Ensure strict privacy, security, and compliance in all systems, integrations, and data handling.

Required Qualifications:
  • Bachelor's or Master's in Computer Science, Software Engineering, or related field, or equivalent practical experience.
  • 7+ years of professional software engineering experience building and operating production systems.
  • Proven experience in high-velocity environments shipping complex products end-to-end.
  • Strong proficiency in backed languages (to include Python); familiarity with modern backend frameworks and async patterns.
  • Solid understanding of distributed systems, APIs, data pipelines, and software design patterns.
  • Hands-on experience on Microsoft Azure: Azure Functions, API Management, Container Apps, and Azure OpenAI Service.
  • Experience with containerization, CI/CD, and infrastructure as code.
  • Strong understanding of authentication and identity systems (OAuth2, OIDC, Azure Entra ID or equivalent).
  • Demonstrated experience/ability with production systems (having been on-call, debugged incidents, etc.).
  • Excellent communication and team building skills; focused on making others around them better.

Preferred Qualifications:
  • Hands-on experience building and operating MCP servers in production, including tool registration, versioning, and hosting on Azure Functions or equivalent serverless infrastructure.
  • Experience implementing A2A communication patterns and multi-agent orchestration frameworks.
  • Significant experience building on top of LLMs in production (tool-calling, RAG, multi-step reasoning, multi-model routing, and context window management).
  • Ability to demonstrate considerations for cost-per-query, context budgets, and prompt efficiency as first-class engineering concerns.
  • Experience managing multi-provider LLM integrations, including rate limits, fallback routing, and API versioning.
  • Experience in security-conscious engineering within regulated or government environments.
  • Previous track record in startup or early-stage environments (0-to-1 product building, comfort with ambiguity, and a high sense of urgency).
  • Experience in big tech building customer-facing platforms or developer infrastructure at scale.
  • Familiarity with vector databases, embedding pipelines, and semantic search infrastructure.

Salary Range:  Negotiable   
 
EEO Statement: Valiant Harbor International, LLC is an Equal Opportunity/Affirmative Action employer. Valiant Harbor International prohibits discrimination with respect to the hiring or promotion of individuals, conditions of employment, disciplinary and discharge practices, or any other aspect of employment on the basis of sex, race, color, age, national origin, religion, disability, marital status, sexual orientation, gender identity, pregnancy, veteran status, or any other protected class.  If you are an individual with a disability and require a reasonable accommodation to complete any part of the application process, or are limited in the ability or unable to access or use this online application process and need an alternative method for applying, you may contact (202) 417-6705 for assistance.

About the Company

V

Valiant Harbor International, LLC