AI ML HPC Principal Engineer

Chan Zuckerberg Biohub - San Francisco

The Opportunity

The Chan Zuckerberg Biohub Network has an immediate opening for an AI/ML High Performance Computing (HPC) Principal Engineer.  The CZ Biohub Network is composed of several new institutes that the Chan Zuckerberg Initiative created to do great science that cannot be done in conventional environments.  The CZ Biohub Network brings together researchers from across disciplines to pursue audacious, important scientific challenges. The Network consists of four institutes throughout the country; San Francisco, Silicon Valley, Chicago and New York City.  Each institute closely collaborates with the major universities in its local area.  Along with the world-class engineering team at the Chan Zuckerberg Initiative, the CZ Biohub supports several 100 of the brightest, boldest engineers, data scientists, and biomedical researchers in the country, with the mission of understanding the mysteries of the cell and how cells interact within systems.

The Biohub is expanding its global scientific leadership, particularly in the area of AI/ML, with the acquisition of the largest GPU cluster dedicated to AI for biology. The AI/ML HPC Principal Engineer will be tasked with helping to realize the full potential of this capability in addition to providing advanced computing capabilities and consulting support to science and technical programs. This position will work closely with many different science teams simultaneously to translate experimental descriptions into software and hardware requirements and across all phases of the scientific lifecycle, including data ingest, analysis, management and storage, computation, authentication, tool development and many other computing needs expressed by scientific projects.

This position reports to the Director for Scientific Computing and will be hired at a level commensurate with the skills, knowledge, and abilities of the successful candidate.

What You’ll Do

  • Work with a wide community of scientific disciplinary experts to identify emerging and essential information technology needs and translate those needs into information technology requirements
  • Build an on-prem HPC infrastructure supplemented with cloud computing to support the expanding IT needs of the Biohub
  • Support the efficiency and effectiveness of capabilities for data ingest, data analysis, data management, data storage, computation, identity management, and many other IT needs expressed by scientific projects
  • Plan, organize, track and execute projects
  • Foster cross-domain community and knowledge-sharing between science teams with similar IT challenges
  • Research, evaluate and implement new technologies on a wide range of scientific compute, storage, networking, and data analytics capabilities
  • Promote and assist researchers with the use of Cloud Compute Services (AWS, GCP primarily) containerization tools, etc. to scientific clients and research groups
  • Work on problems of diverse scope where analysis of data requires evaluation of identifiable factors
  • Assist in cost & schedule estimation for the IT needs of scientists, as part of supporting architecture development and scientific program execution
  • Support Machine Learning capability growth at the CZ Biohub
  • Provide scientist support in deployment and maintenance of developed tools
  • Plan and execute all above responsibilities independently with minimal intervention

What You’ll Bring 

Essential –

  • Bachelor’s Degree in Biology or Life Sciences is preferred. Degrees in Computer Science, Mathematics, Systems Engineering or a related field or equivalent training/experience also acceptable.
  • A minimum of 8 years of experience designing and building web-based working projects using modern languages, tools, and frameworks
  • Experience building on-prem HPC infrastructure and capacity planning
  • Experience and expertise working on complex issues where analysis of situations or data requires an in-depth evaluation of variable factors
  • Experience supporting scientific facilities, and prior knowledge of scientific user needs, program management, data management planning or lab-bench IT needs
  • Experience with HPC and cloud computing environments
  • Ability to interact with a variety of technical and scientific personnel with varied academic backgrounds
  • Strong written and verbal communication skills to present and disseminate scientific software developments at group meetings
  • Demonstrated ability to reason clearly about load, latency, bandwidth, performance, reliability, and cost and make sound engineering decisions balancing them
  • Demonstrated ability to quickly and creatively implement novel solutions and ideas

Technical experience includes –

  • Proven ability to analyze, troubleshoot, and resolve complex problems that arise in the HPC production compute, interconnect, storage hardware, software systems, storage subsystems
  • Configuring and administering parallel, network attached storage (Lustre, GPFS on ESS, NFS, Ceph) and storage subsystems (e.g. IBM, NetApp, DataDirect Network, LSI, VAST, etc.)
  • Installing, configuring, and maintaining job management tools (such as SLURM, Moab, TORQUE, PBS, etc.) and implementing fairshare, node sharing, backfill etc.. for compute and GPUs
  • Red Hat Enterprise Linux, CentOS, or derivatives and Linux services and technologies like dnsmasq, systemd, LDAP, PAM, sssd, OpenSSH, cgroups
  • Scripting languages (including Bash, Python, or Perl)
  • OpenACC, nvhpc, understanding of cuda driver compatibility issues
  • Virtualization (ESXi or KVM/libvirt), containerization (Docker or Singularity), configuration management and automation (tools like xCAT, Puppet, kickstart) and orchestration (Kubernetes, docker-compose, CloudFormation, Terraform.)
  • High performance networking technologies (Ethernet and Infiniband) and hardware (Mellanox and Juniper)
  • Configuring, installing, tuning and maintaining scientific application software (Modules, SPACK)
  • Familiarity with source control tools (Git or SVN)
  • Experience with supporting use of popular ML frameworks such as Pytorch, Tensorflow
  • Familiarity with cybersecurity tools, methodologies, and best practices for protecting systems used for science
  • Experience with movement, storage, backup and archive of large scale data

Nice to have – 

  • An advanced degree is strongly desired

The Chan Zuckerberg Biohub requires all employees, contractors, and interns, regardless of work location or type of role, to provide proof of full COVID-19 vaccination, including a booster vaccine dose, if eligible, by their start date. Those who are unable to get vaccinated or obtain a booster dose because of a disability, or who choose not to be vaccinated due to a sincerely held religious belief, practice, or observance must have an approved exception prior to their start date.

Compensation 

  • $212,000 – $291,500

New hires are typically hired into the lower portion of the range, enabling employee growth in the range over time. To determine starting pay, we consider multiple job-related factors including a candidate’s skills, education and experience, market demand, business needs, and internal parity. We may also adjust this range in the future based on market data. Your recruiter can share more about the specific pay range during the hiring process.

More Reading

Post navigation