Senior HPC Architect
San Jose, California
Open to Remote
Direct Hire
$165k - $175k
An industry leader in the chip space is seeking a highly skilled HPC Queue Architect to design, implement, and optimize our high-performance computing (HPC) queuing systems. The ideal candidate will have a deep understanding of HPC architecture and workload management, and will play a critical role in ensuring efficient resource allocation and job scheduling across our HPC infrastructure.
Key Responsibilities:
-
Queue Management:
- Design and implement queuing systems for optimal workload management and resource allocation.
- Monitor and analyze queue performance, identifying bottlenecks and proposing improvements.
-
System Architecture:
- Collaborate with HPC engineers to develop and maintain the overall architecture of the HPC environment.
- Ensure that the queuing system integrates seamlessly with HPC resources, storage, and networking components.
-
Job Scheduling:
- Develop and manage job scheduling policies to optimize resource utilization and minimize job wait times.
- Implement and configure scheduling software (e.g., Slurm, PBS, Torque) to meet the needs of diverse workloads.
-
Performance Tuning:
- Conduct performance analysis and benchmarking of queuing systems and HPC resources.
- Provide recommendations for hardware and software upgrades to enhance system performance.
-
Collaboration and Support:
- Work closely with researchers and users to understand their HPC needs and provide support for job submissions and troubleshooting.
- Develop and maintain documentation for queuing systems and job scheduling processes.
Qualifications:
-
Education:
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
-
Experience:
- 3+ years of experience in HPC architecture, job scheduling, and workload management.
- Hands-on experience with queuing systems and job schedulers in HPC environments.
-
Skills:
- Strong understanding of HPC hardware, networking, and storage technologies.
- Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis.
- Familiarity with cluster management tools and performance monitoring software.
-
Certifications:
- Relevant certifications in HPC or cloud computing are a plus.
Personal Attributes:
- Strong analytical and problem-solving skills.
- Excellent communication and interpersonal abilities.
- Ability to work collaboratively in a fast-paced, team-oriented environment.
- Detail-oriented with a focus on optimizing performance and user experience.
What We Offer:
- Competitive salary and comprehensive benefits package.
- Opportunities for professional development and continuing education.
- A collaborative and innovative work environment.