Senior HPC Architect

San Jose, California

Open to Remote

Direct Hire

$165k - $175k

An industry leader in the chip space is seeking a highly skilled HPC Queue Architect to design, implement, and optimize our high-performance computing (HPC) queuing systems. The ideal candidate will have a deep understanding of HPC architecture and workload management, and will play a critical role in ensuring efficient resource allocation and job scheduling across our HPC infrastructure.

Key Responsibilities:

  • Queue Management:

    • Design and implement queuing systems for optimal workload management and resource allocation.
    • Monitor and analyze queue performance, identifying bottlenecks and proposing improvements.
  • System Architecture:

    • Collaborate with HPC engineers to develop and maintain the overall architecture of the HPC environment.
    • Ensure that the queuing system integrates seamlessly with HPC resources, storage, and networking components.
  • Job Scheduling:

    • Develop and manage job scheduling policies to optimize resource utilization and minimize job wait times.
    • Implement and configure scheduling software (e.g., Slurm, PBS, Torque) to meet the needs of diverse workloads.
  • Performance Tuning:

    • Conduct performance analysis and benchmarking of queuing systems and HPC resources.
    • Provide recommendations for hardware and software upgrades to enhance system performance.
  • Collaboration and Support:

    • Work closely with researchers and users to understand their HPC needs and provide support for job submissions and troubleshooting.
    • Develop and maintain documentation for queuing systems and job scheduling processes.

Qualifications:

  • Education:

    • Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • Experience:

    • 3+ years of experience in HPC architecture, job scheduling, and workload management.
    • Hands-on experience with queuing systems and job schedulers in HPC environments.
  • Skills:

    • Strong understanding of HPC hardware, networking, and storage technologies.
    • Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis.
    • Familiarity with cluster management tools and performance monitoring software.
  • Certifications:

    • Relevant certifications in HPC or cloud computing are a plus.

Personal Attributes:

  • Strong analytical and problem-solving skills.
  • Excellent communication and interpersonal abilities.
  • Ability to work collaboratively in a fast-paced, team-oriented environment.
  • Detail-oriented with a focus on optimizing performance and user experience.

What We Offer:

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional development and continuing education.
  • A collaborative and innovative work environment.

Posted by: Scott Brosnan

Specialization: Linux / Unix