HPC Technical Lead
GDIT is looking for an HPC Technical Lead that will provide deep technical expertise and leadership for the WCOSS production environment, ensuring stability, performance, and scalability of NOAAs operational HPC systems. This role focuses on technical execution, troubleshooting, and guiding the team in implementing best practices for HPC operations. Key Responsibilities - Technical Leadership
- Serve as the primary technical authority for all HPC operational matters.
- Lead and drive rootcause analysis and problem resolution for critical incidents across compute, storage, and interconnect components.
- System Performance & Reliability
- Monitor, analyze, and optimize performance across compute nodes, interconnects, and storage subsystems.
- Conduct proactive health checks and performance tuning to ensure system readiness for 247 missioncritical NOAA workloads.
- Change & Configuration Management
- Lead technical planning and readiness reviews for all system upgrades, patches, and enhancements.
- Maintain configuration baselines and ensure compliance with security requirements (e.g., RMF/STIG).
- Collaboration & Customer Technical Interface
- Act as the primary technical point of contact for NWS/NOAA for detailed HPC discussions, system behavior, and operational issues.
- Coordinate with NOAA scientific teams to understand modeling workload needs and optimize HPC resources accordingly.
Required Skills - 10+ years of handson HPC systems administration and troubleshooting experience, including Cray, SGI, or comparable largescale systems.
- Extensive experience supporting Federal HPC environments, demonstrating readiness for NOAA/NWS operational environments.
- Deep Linux expertise, including SLES, RHEL, and CentOS across multiple HPC platforms.
- Strong technical experience with HPC storage (e.g., Lustre), interconnects (e.g., InfiniBand), and performance tuning of largescale computing systems.
- Proven leadership of HPC technical teams, including mentoring and directing system administrators and engineers supporting very large core.
- Demonstrated success performing root cause analysis, escalated troubleshooting, and incident recovery in production HPC environments.
- Experience implementing STIG/RMF security controls across HPC systems and applying DoDgrade configuration compliance.
- Excellent communication skills, capable of translating complex technical issues to customers and stakeholders.
Preferred Qualifications - Prior experience supporting NOAA/NWS operational HPC systems, especially in realtime weather or climate modeling environments.
- Experience designing or improving HPC system architectures, monitoring frameworks, or performance analysis pipelines.
- Experience presenting technical findings to scientific, engineering, or federal customer groups.
|