HPC Sr. Scientific Software Engineer (IT@JH Research Computing)
Johns Hopkins University
IT@JH Research Computing is seeking a HPC Sr. Scientific Software Engineer who will design, build, and support Johns Hopkins University’s high-performance computing and AI research infrastructure. This role integrates elements of both systems and software engineering, ensuring scalable, secure, and reproducible environments for scientific and data-intensive research. The Engineer develops and automates system and application workflows across CPU/GPU clusters, parallel storage, and hybrid cloud platforms. Responsibilities include configuring and optimizing large-scale Linux environments, implementing job scheduling and orchestration frameworks, containerizing applications, and supporting researchers in optimizing performance and reproducibility. Work combines project-based engineering with operational support, requiring both independent problem-solving and close collaboration with the Research Computing team and faculty stakeholders.
Specific Duties & Responsibilities
Software Deployment and Design
- Develop and refine deployment strategies for scientific software on HPC and AI systems.
- Design computational workflows, selecting optimal software configurations, and utilizing tools like Ansible for automation.
- Assist teams in implementing, tuning, and optimizing AI models and gateway applications (e.g., XDMoD, Coldfront, Open OnDemand, CryoSPARC Live, SBGrid, AI Agents).
Performance Optimization
- Analyze and optimize the performance of AI models and HPC applications, focusing on GPU-enabled computing.
- Implement parallel processing, distributed computing, and resource management techniques for efficient job execution.
Integration and Optimization
- Develop, debug, and maintain software tools, libraries, and frameworks supporting HPC and AI workloads.
- Collaborate with the system team and software vendors (e.g., NVIDIA, Intel, Matlab) to optimize systems for maximum performance.
- Utilize CUDA, DNN, TensorRT, and Intel Compilers to enhance system performance.
HPC Scientific Software Support
- Manage and support scientific software deployment across HPC, cloud-based, and colocation facilities.
- Oversee installation, configuration, and maintenance of HPC packages with tools like CMake, Make, EasyBuild, Spack, and Lua module files
Collaboration and Mentorship
- Work closely with cross-functional teams, including researchers, data scientists, and software developers, to address complex HPC/AI challenges.
- Mentor junior engineers and foster a culture of continuous learning.
Technical Support and Training Workshops and Troubleshooting
- Resolve complex technical issues and perform root cause analysis for HPC/AI software challenges.
- Implement effective solutions to prevent recurrence and improve system reliability
- Provide training workshops for researchers and students, focusing on troubleshooting, optimizing workflows, and effectively using HPC systems.
Learning and Development
- Stay current with advances in HPC and AI technologies and methodologies.
- Incorporate new research findings into existing systems to improve performance and capabilities.
Container Orchestration
- Develop and manage container orchestration strategies to ensure scalability, reliability, and security of applications.
- Oversee the container lifecycle from creation and deployment to scaling and removal.
Documentation and Compliance
- Create comprehensive documentation for system designs, performance metrics, and project status.
- Ensure compliance with security and regulatory standards for all HPC and AI systems.
In Addition to the Duties Described Above
- Design, deploy, and maintain large-scale Linux HPC clusters with CPU/GPU resources, high-speed networks, and distributed storage.
- Develop and maintain automation frameworks for provisioning, monitoring, and software lifecycle management.
- Implement and optimize job scheduling, container orchestration, and workflow automation tools to support diverse research workloads.
- Collaborate with faculty and research teams to parallelize, containerize, and scale computational workflows for multi-GPU and distributed environments.
- Benchmark and tune application performance across architectures, documenting findings and sharing best practices.
- Integrate and support AI/ML frameworks, scientific libraries, and workflow engines (Snakemake, Nextflow, Dask, Ray).
- Ensure system and application reliability through proactive monitoring (Prometheus, Grafana, ELK) and incident response participation.
- Support reproducibility and FAIR data principles through version-controlled, containerized environments.
- Contribute to documentation, training materials, and technical guidance to enhance user experience and self-service capabilities.
- Participate in evaluation and adoption of new technologies to advance performance, efficiency, and sustainability in research computing.
Minimum Qualifications
- PhD in a quantitative discipline.
- Five years of experience in HPC user support, software deployment, and performance optimization within an academic or research environment.
- Additional education may substitute for required experience and additional related experience may substitute for required education beyond a high school diploma/graduation equivalent, to the extent permitted by the JHU equivalency formula.
Preferred Qualifications
- Eight + years of professional experience in high-performance computing, large-scale systems, or research software engineering.
- Deep proficiency in Linux systems administration, performance tuning, and automation tools (Ansible, Terraform, Jenkins, or similar).
- Experience with cluster management, workload schedulers (e.g., Slurm), and distributed or parallel file systems (e.g., GPFS, Lustre, WekaFS, Ceph).
- Strong background in programming or scripting (Python, Bash, C/C++, Go, or Rust).
- Familiarity with containerization and orchestration technologies used in HPC (Singularity, Apptainer, Docker, Kubernetes).
- Understanding of high-speed interconnects (InfiniBand, 100/400 Gb Ethernet) and storage/data access patterns for AI and analytics.
- Experience developing or maintaining CI/CD pipelines and module environments (Lmod/Spack) for research software.
- Knowledge of GPU computing (CUDA, ROCm), MPI/OpenMP, and AI/ML frameworks.
- Demonstrated ability to collaborate with researchers on performance optimization, workflow design, and reproducible computing.
Classified Title: HPC Sr. Scientific Software Engineer
Job Posting Title (Working Title): HPC Sr. Scientific Software Engineer (IT@JH Research Computing)
Role/Level/Range: ATP/04/PG
Starting Salary Range: $99,800 - $175,000 Annually (Commensurate w/exp.)
Employee group: Full Time
Schedule: Mon-Fri, 8:30am-5pm
FLSA Status: Exempt
Location: Johns Hopkins Bayview
Department name: IT@JH Research Computing
Personnel area: University Administration