Hero Image

AnitaB.org Talent Network

Connecting women in tech with the best professional opportunities!
0
Companies
0
Jobs

Master Principal Cloud Engineer - GPU & AI Infrastructure

Oracle

Oracle

Software Engineering, Other Engineering, Data Science
Shanghai, China · Beijing, China · Guangzhou, Guangdong, China · Shenzhen, Guangdong, China · China
Posted on Jan 27, 2026

Position Overview

As a GPU Specialist Cloud Engineer (CE) within the Oracle Cloud Infrastructure (OCI) Pre-Sales organization, you will serve as the primary technical authority for high-performance computing (HPC) and Artificial Intelligence infrastructure. You are not just a generalist; you are the bridge between complex silicon capabilities and transformative business outcomes.

You will partner with Enterprise Sales teams to lead the technical discovery, architectural design, and proof-of-concept (PoC) execution for customers building the next generation of Large Language Models (LLMs), generative AI applications, and computationally intensive simulations. This role requires a deep understanding of NVIDIA/AMD hardware stacks, RDMA networking, and the software orchestration layers that make massive-scale GPU clusters hum.

Core Responsibilities

1. Strategic Technical Advisory

  • Architectural Design: Design end-to-end AI infrastructure solutions on OCI, focusing on Superclusters that leverage NVIDIA H200/B300/GB300 or AMD Instinct™ accelerators.
  • Optimization: Advise customers on right-sizing GPU shapes based on workload requirements (e.g., training vs. inference, FP8 vs. FP16 precision).
  • Networking Excellence: Design high-throughput, low-latency interconnect fabrics using RoCE v2 (RDMA over Converged Ethernet) and OCI’s non-blocking leaf-spine architecture.

2. Hands-on Execution & Validation

  • Proof of Concept (PoC): Lead deep-dive technical evaluations, demonstrating OCI’s superior price-performance ratios for model training and fine-tuning.
  • Stack Integration: Assist customers in deploying and optimizing the NVIDIA AI Enterprise stack, Triton Inference Server, and NeMo Framework on OCI.
  • Performance Tuning: Work directly with engineering teams to troubleshoot "bottlenecks"—whether they reside in the kernel, the NCCL (NVIDIA Collective Communications Library) configuration, or the storage IOPS.

3. Thought Leadership & Enablement

  • Content Creation: Develop whitepapers, reference architectures, and blog posts detailing OCI’s competitive advantages in the AI sovereign cloud and private AI spaces.
  • Market Intelligence: Stay ahead of the curve on the evolving landscape of AI accelerators, interconnects (InfiniBand vs. Ethernet), and distributed training frameworks (PyTorch, JAX, DeepSpeed).

As a world leader in cloud solutions, Oracle uses tomorrow’s technology to tackle today’s challenges. We’ve partnered with industry-leaders in almost every sector—and continue to thrive after 40+ years of change by operating with integrity.

We know that true innovation starts when everyone is empowered to contribute. That’s why we’re committed to growing an inclusive workforce that promotes opportunities for all.

Oracle careers open the door to global opportunities where work-life balance flourishes. We offer competitive benefits based on parity and consistency and support our people with flexible medical, life insurance, and retirement options. We also encourage employees to give back to their communities through our volunteer programs.

We’re committed to including people with disabilities at all stages of the employment process. If you require accessibility assistance or accommodation for a disability at any point, let us know by emailing accommodation-request_mb@oracle.com or by calling +1 888 404 2494 in the United States.

Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.


As a GPU Specialist Cloud Engineer (CE) within the Oracle Cloud Infrastructure (OCI) Pre-Sales organization, you will serve as the primary technical authority for high-performance computing (HPC) and Artificial Intelligence infrastructure. You are not just a generalist; you are the bridge between complex silicon capabilities and transformative business outcomes. You will partner with Enterprise Sales teams to lead the technical discovery, architectural design, and proof-of-concept (PoC) execution for customers building the next generation of Large Language Models (LLMs), generative AI applications, and computationally intensive simulations. This role requires a deep understanding of NVIDIA/AMD hardware stacks, RDMA networking, and the software orchestration layers that make massive-scale GPU clusters hum.

Career Level - IC5


Required Technical Competencies

Domain

Expertise Required

GPU Architecture

Deep knowledge of CUDA cores, Tensor Cores, HBM3 memory, and NVLink/NVSwitch topologies.

Networking

Mastery of RDMA, RoCE, and high-speed fabric management for multi-node distributed training.

Storage

Experience with high-performance parallel file systems like Lustre, Weka, or OCI’s High-Performance Storage for feeding data to GPUs at scale.

Orchestration

Proficiency in Kubernetes (OKE) for AI, Slurm for batch job scheduling, and NVIDIA GPU Operator.

AI Frameworks

Hands-on experience with PyTorch, TensorFlow, and libraries for distributed computing like Megatron-LM.

Candidate Qualifications

  • Education: Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related quantitative field.
  • Experience: 10+ years in Pre-Sales Engineering, Systems Architecture, or HPC. At least 3 years specifically focused on GPU-accelerated computing.
  • The "OCI Edge": Familiarity with OCI’s "Off-Box" virtualization and how it enables "Bare Metal" performance in a cloud environment.
  • Communication: The ability to explain the difference between latency and throughput to a CTO, while being able to debug a Python script with a Data Scientist.