CDAO Advana - Site Reliability Engineering Lead - Model Serving
Join GDIT and be a part of the team of men and women that solve some of the world's most complex technical challenges. The CDAO Advana team is seeking an Site Reliability Engineering Lead - Model Serving, to join their efforts in the DC area. Advana is the Chief Digital and Artificial Intelligence Offices (CDAO) enterprise-wide, multi-domain data, analytics, and artificial intelligence (AI) platform that provides all DoD military and civilian decision makers, analysts, and builders with unprecedented access to enterprise data, tools, and capabilities. This is a proposal with award expected June 2026. If interested, please apply as we are interviewing and making contingent offers now. Duties include: Owns production reliability strategy for artificial intelligence and machine learning model serving across Advana enclaves supporting Department of Defense missions, Joint Staff analysts, Combatant Command elements, and Senior Executive Service leadership. Defines servicelevel objectives, alerting philosophy, operational runbooks, and release safety patterns governing production deployment of model artifacts across multiple security domains. Establishes reliability governance across serving surfaces by developing operational standards, oncall expectations, escalation pathways, and incident response patterns aligned with enterprise DevSecOps practices. Implements reliability engineering methodologies using Kubernetes, Prometheus, Grafana, Elastic Stack, GitLab Continuous Integration, VMware environments, and hardened deployment pipelines to maintain operational stability, mission assurance posture, and crossdomain readiness. Develops automated reliability checks integrated into deployment workflows to validate performance, latency, availability, and operational suitability of productionready models. Leads coordination with Platform One, Cloud One, multinational engineering teams, and crossservice mission partners to align reliability strategy with evolving architectures, security requirements, and mission priorities. Produces missioncritical deliverables including servicelevel objective documentation, alerting configurations, operational runbooks, reliability scorecards, incident postaction reports, and release safety assessments. Strengthens program value by advancing operational readiness, reducing mission risk, and reinforcing deployment consistency across all enclaves. Supports Tier4 incident response actions by maintaining authoritative reliability artifacts required for rapid triage, operational continuity, and sustained mission performance.
Basic Qualifications: - BS degree; additional years of experience may be considered in lieu of degree
- 8+ years of experience developing reliability strategy
- AI and machine learning experience
- CompTia Security+
- TS with SCI eligibility
WHAT CAN GDIT OFFER YOU? - Excellent customizable health benefits (Medical, Dental and Vision)
- 401K with company match
- Educational Assistance and eLearning
- Flexible work week
- Internal mobility team dedicated to employee advancement
- Rewards and Recognition programs
- Innovative and collaborative environment encouraging of highly motivated critical thinking
|