Server Lab Engineer , ML-IL
Amazon
Software Engineering, Data Science
Tel Aviv-Yafo, Israel
Description
Machine Learning Israel (MLIL), as part of Annapurna Labs / Amazon, is hiring a Lab Engineer to own and operate the labs that powers the bring-up and validation of our next-generation ML training and inference racks. In this role you will build, maintain, and continuously evolve the lab infrastructure — from bench setups to server racks — used daily by HW, FW, and SW engineers. You will be the go-to person for delivering working, instrumented setups that the R&D teams can pick up and run with.
Key job responsibilities
• Own the MLIL hardware lab in the Tel-Aviv office: physical layout, power and cooling budget, network topology, cabling, asset tracking, and day-to-day operations.
• Build, configure, and connect new lab setups for HW, FW, and SW engineers — including Servers, GPU sleds, PCIe switches, retimers, NICs, and DRAM modules — and deliver them ready for R&D use.
• Administer and maintain Linux-based servers and systems, including installation, configuration, and optimization
• Manage and configure network services such as DHCP, PXE, and other critical infrastructure components.
• Run sanity tests on every delivered setup — boot, PCIe enumeration, basic DRAM check, network reachability — so R&D teams pick up a known-good baseline and can focus on their work.
• Write and maintain automation scripts (Python / Bash) for repetitive lab tasks — power cycling, log collection, provisioning, imaging, test-harness setup.
• Procure, inventory, and manage lab equipment: bench PSUs, scopes, protocol analyzers, thermal chambers, JTAG debuggers, cables, and fixtures.
• Triage lab-level issues (power, network, cabling, imaging) to unblock R&D fast; escalate deep HW / FW / SW debug (e.g., RDMA / GPU / EFA internals) to the relevant specialist teams.