Software Development Engineer 3

Talent Software Services

Redmond, WA

Apply

JOB DETAILS

SALARY

$40–$50 Per Hour

JOB TYPE

Full-time, Employee

SKILLS

Analysis Skills, Automation, Business Operations, Cloud Computing, Communication Skills, Computer Engineering, Computer Maintenance, Computer Science, Continuous Deployment/Delivery, Continuous Integration, Data Collection, Debugging Skills, DevOps, Documentation, Environmental Health, Hardware Configuration Management, Hardware-Software Integration, Hybrid Cloud, Identify Issues, Incident Management, Incident Response, Information Technology & Information Systems, Leadership, Metrics, Operational Audit, Operational Communications, Operational Improvement, Operational Support, Operations Processes, Problem Solving Skills, Product Development, Product Support, Prototyping, Quality Assurance, Quality Metrics, Record Keeping, Reliability Analysis, Reliability Engineering, Reporting Dashboards, Risk Analysis, Sales, Software Administration, Software Development, Software Engineering, Software Testing, Software Validation, Systems Analysis, Team Player, Technical Delivery, Technical Leadership, Technical Operations, Telemetry, Testing, Validation Documentation, Validation Testing

LOCATION

Redmond, WA

POSTED

4 days ago

Typical Day in the Role
• Purpose of the Team: The purpose of this team is to support a confidential companies device project by ensuring software quality and stability during the internal self-host program.

• Key projects: This role will contribute to monitoring device health through telemetry dashboards, investigating issues, assigning bugs, and gathering logs (including hands-on device support) to ensure stability in production.

Candidate Requirements
• Best vs. Average: The ideal resume would contain experience with Android (mobile OS), 5-7 years minimum is required for the role but more experience be a bonus, should be able to work independently

# Senior Operations / Reliability Engineer

## Summary

We are seeking a **Senior Operations / Reliability Engineer** to support live operations, service reliability, release stability, and prototype device monitoring for a new hardware and software product. This role will focus on monitoring telemetry, diagnosing live issues, validating software releases, supporting incident response, and helping improve operational readiness across services, applications, and prototype device environments.

This is an engineering-oriented operations role. The ideal candidate will be comfortable working with logs, dashboards, alerts, deployment signals, and live system behavior, while partnering closely with software engineers, QA, infrastructure teams, PMs, and product leadership.

The role will be strongly supported by experienced engineers on the team, who will provide technical guidance on service architecture, prototype device workflows, telemetry interpretation, release processes, and complex debugging. The engineer will collaborate closely with these senior team members while taking ownership of day-to-day monitoring, release validation, live issue triage, documentation, and operational reporting.

# Scope of Work & Responsibilities
## Live Monitoring & Telemetry

* Monitor telemetry from services, applications, and prototype devices to assess operational health.
* Observe dashboards, alerts, logs, and metrics to identify anomalies, failures, performance degradation, or emerging reliability risks.
* Analyze real-time metrics and logs to support troubleshooting across cloud, on-premises, and prototype device environments.
* Triage operational issues and communicate findings clearly to engineering, QA, PM, and product teams.
* Provide actionable insights based on telemetry trends, system behavior, and recurring failure patterns.
* Help improve monitoring coverage, alert quality, dashboard usefulness, and operational visibility.

## Release & Service Operations

* Support software releases by validating deployments, monitoring live systems, and assessing post-deployment stability.
* Track service health during rollouts, ring deployments, updates, and release validation windows.
* Identify, debug, and help resolve live issues affecting services, devices, internal users, or product readiness.
* Partner with engineering teams to support mitigations, fixes, rollbacks, or follow-up validation.
* Assist with post-release verification and stabilization reporting.
* Document release observations, risks, incidents, and readiness concerns.

## Incident Response & Reliability Support

* Support incident response by gathering data, summarizing impact, identifying suspected causes, and tracking mitigation progress.
* Participate in post-incident reviews and help document lessons learned.
* Recommend improvements to monitoring, alerting, operational procedures, and service reliability practices.
* Maintain clear records of incidents, recurring issues, known risks, and follow-up actions.
* Help reduce operational toil by identifying repeatable troubleshooting steps, documentation gaps, and automation opportunities.

## On-Site Hardware & Environment Support

* Perform in-person troubleshooting for self-hosted systems, prototype devices, or test environments when telemetry or dashboards indicate issues.
* Assist with device configuration, deployment, validation, and live verification.
* Run smoke checks or readiness checks to confirm device, service, and environment health.
* Maintain documentation of hardware configurations, operational procedures, environment setup, and observed issues.
* Coordinate with engineering and infrastructure teams to resolve environment or device-level reliability problems.

## Collaboration & Communication

* Work closely with software, QA, infrastructure, PM, and product teams to support operational readiness and release reliability.
* Communicate operational status, risks, and technical findings clearly and promptly.
* Provide concise summaries of system health, release readiness, incident status, and recommended next steps.
* Operate independently on assigned areas while escalating appropriately when issues require deeper engineering involvement.

---

# Deliverables

* Real-time telemetry dashboards, monitoring views, and actionable alerting improvements.
* Release verification and stabilization reports.
* Incident reports, issue summaries, and operational analysis for live events.
* Documentation of hardware configurations, device workflows, operational procedures, and troubleshooting steps.
* Service health summaries, risk assessments, and recommendations for reliability improvements.
* Clear communication of live issues, suspected causes, mitigation status, and follow-up actions.
* Recommendations for improving monitoring, alerting, release validation, and operational readiness.

---

# Qualifications

* Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, or a related technical field, or equivalent practical experience.
* 5–7 years of relevant experience in software engineering, DevOps, SRE, production operations, infrastructure, service reliability, or related technical operations roles.
* Experience monitoring live services, applications, infrastructure, or device environments.
* Experience using dashboards, alerts, logs, metrics, and telemetry to diagnose system health and troubleshoot issues.
* Experience supporting software releases, deployments, production validation, or service rollouts.
* Ability to investigate technical issues, summarize findings, and communicate risks clearly to engineering and product teams.
* Experience documenting incidents, operational procedures, known issues, and troubleshooting steps.
* Familiarity with CI/CD workflows, cloud or hybrid infrastructure, release validation, and incident response practices.
* Strong problem-solving skills, communication skills, and ability to work independently in a fast-moving engineering environment.

Explain a typical day in the role.:
A typical day may include reviewing dashboards and alerts, checking telemetry from recent builds or deployments, investigating anomalies, and summarizing operational health for the team. The engineer may help validate a software rollout, monitor ring deployments, troubleshoot prototype device issues on-site, or gather logs and metrics for an active investigation. They will work closely with engineering, QA, PM, and infrastructure teams to communicate issues, document findings, verify fixes, and identify improvements to monitoring, alerting, and reliability practices.

What is the ideal background of a candidate for this role?:
The ideal candidate will have a software engineering, DevOps, SRE, production engineering, service operations, or infrastructure background. They should be comfortable diagnosing live system issues, interpreting logs and telemetry, validating deployments, and working closely with engineering teams to resolve reliability problems.

A strong candidate will have experience with operational monitoring, alerting systems, cloud or hybrid environments, CI/CD or release workflows, incident response, and technical troubleshooting. Experience with prototype devices, hardware/software integration, or on-site lab environments is a strong plus.

What are unique selling points that would get candidates interested in your role over another?:
This role offers the opportunity to work on an interesting new hardware product and the software and services that support it. Candidates will gain hands-on exposure to prototype devices, live telemetry, release operations, and real-world reliability challenges. The role is supported by a strong engineering team with experienced technical leaders, a collaborative culture, and meaningful opportunities to improve operational practices for a developing product area.

How will contractor performance be measured?:
Contractor performance will be measured by the quality and timeliness of operational monitoring, issue triage, release validation, incident documentation, and reliability recommendations. Success will also be evaluated based on the contractor's ability to identify meaningful risks, communicate findings clearly, support live issue resolution, maintain accurate operational documentation, and collaborate effectively with engineering, QA, PM, infrastructure, and product teams.

Top 3 Must-Have HARD Skills & years of experience for each:
1. **Software engineering, DevOps, SRE, or production operations experience** — 5+ years
2. **Monitoring, telemetry analysis, logging, and live issue troubleshooting** — 3+ years
3. **Ability to independently drive technical work and deliver operational value** — 3+ years

About the Company

Talent Software Services

Resume Resources

Free Resume Templates Free Resume Builder