Data Engineer

Data Engineer – Biological Data Pipelines & AI Systems Location: Berkeley, CA (Hybrid)

Compensation: $110,000 - $140,000 + Equity About Shiru

Shiru is building the future of ingredient discovery and operating at the cutting edge of applied AI from our core R&D differentiation to company operations. We’re a lean, fast-moving deep tech startup that achieves outsized impact with a small, high-performing team. As we scale, we’re looking for a talented data engineer to help us push the boundaries of what’s possible in biological data and AI—building the infrastructure that powers next-generation ingredient discovery. The Opportunity

As a Data Engineer at Shiru, you’ll be at the heart of our mission—building the data infrastructure that powers the future of ingredient discovery. You’ll architect and scale advanced biological data pipelines, transforming massive protein and nucleotide datasets into actionable insights that drive scientific breakthroughs. From designing robust APIs and automated workflows to deploying cutting-edge AI systems like RAG pipelines and LIMS integrations, you’ll enable our team to move faster and smarter than ever before.

This is a unique opportunity for a data engineer who thrives at the intersection of bioinformatics, large-scale data processing, and AI innovation. If you’re passionate about building systems that accelerate discovery and want to shape how technology transforms food and sustainability, we’d love to meet you. What You’ll Do: Data Engineering & Pipelines

Design, implement, and maintain scalable biological data pipelines for processing protein and nucleotide datasets.

Develop robust workflows for ingestion, transformation, and indexing of biological datasets (e.g., UniProt, UniParc, BFD, PDB, GenBank).

Build automated data validation and quality control pipelines.

Optimize workflows for high-performance computing and large-scale data processing.

Workflow

Orchestration

Develop and maintain pipeline orchestration systems using Dagster.

Implement reliable data lineage, observability, and reproducibility within workflows.

Monitor and improve pipeline reliability and runtime performance. Database & Data Infrastructure

Design, maintain, and optimize large-scale biological databases.

Manage relational and non-relational data systems.

Implement data versioning, indexing, and efficient query strategies. AI Data Infrastructure

Design and maintain RAG (Retrieval-Augmented Generation) pipelines.

Build infrastructure for embedding generation, vector storage, and retrieval systems.

Integrate AI systems with biological data pipelines.

Systems

Integration

Deploy and integrate Laboratory Information Management Systems (LIMS).

Develop APIs to expose biological datasets and AI services.

Ensure interoperability across research tools and data platforms. Infrastructure & Cost Optimization

Deploy and maintain services on cloud platforms (AWS, GCP, or Azure).

Optimize compute, storage, and workflow costs.

Manage containerized deployments and distributed systems.

Engineering Best

Practices

Implement robust CI/CD pipelines and testing frameworks.

Apply strong automation, monitoring, and alerting practices.

Maintain high-quality documentation and reproducible environments. Requirements:

PhD or Master's Degree (final year or about to finish) in CS, CE, ML, Statistics, Mathematics or related fields.

Experience processing large-scale biological datasets such as: UniProt, UniParc, BFD, PDB, GenBank

Strong experience in data pipeline development and orchestration.

Experience with large-scale data processing and distributed systems.

Experience designing data schemas for biological or scientific datasets.

Strong software engineering practices: hypothesis-driven development; unit, integration, and behavioral testing, CI/CD workflows, technical debt management

Experience deploying systems to cloud environments and containerized infrastructure.

Technical

Skills

Programming: Python, C++, Java, SQL

Data Engineering: Dagster, Apache Airflow, Apache Spark, Dask, Ray, Prefect, dbt, Pandas, Polars, Apache Arrow, Parquet

Databases & Storage: PostgreSQL, MySQL, ClickHouse, MongoDB, Redis, Elasticsearch, OpenSearch, FAISS, Milvus, Weaviate, Pinecone

AI / ML Infrastructure: PyTorch, TensorFlow, Hugging Face ecosystem, LangChain, LlamaIndex, embedding pipelines

Infrastructure & DevOps: Docker, Kubernetes, Terraform, AWS (S3, Lambda, ECS, EKS, Batch), GitHub Actions, GitLab CI

Bioinformatics (Preferred): BLAST, MMseqs2, AlphaFold pipelines, FASTA, FASTQ, sequence alignment workflows Nice to have:

Experience with high-performance computing (HPC) clusters.

Experience processing petabyte-scale biological datasets.

Experience with vector search and semantic retrieval systems.

Familiarity with scientific computing workflows and research environments. Why Shiru?

Front-row seat to company building: You’ll get hands-on experience with every aspect of scaling a deep tech startup, working directly with our Head of Machine Learning.

Accelerated growth: You’ll do in one year what might take years elsewhere - building systems and growing your professional skillset.

Equity and impact: You’ll be an equity holder, with the chance to realize real value as we grow.

Unique culture: We’re a high-trust, low-ego team that values curiosity, ownership, and collaboration. You’ll make lasting connections and open new doors for your career.

AI-native environment: You’ll help define how humans and AI collaborate to achieve extraordinary results. Closing

At Shiru, we’re looking for people with passion, grit, and integrity. If your experience doesn’t precisely match the job description, we still encourage you to apply—especially if your career has taken extraordinary twists and turns. Join us in this singular opportunity to create the future of food and work!

Shiru is an equal opportunity employer that values diversity. We do not discriminate based on race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. Legal authorization to work in the United States is required.

In compliance with federal law, all persons hired will be required to verify identity and eligibility to work in the United States and to complete the required employment eligibility verification form upon hire. Shiru is not able to sponsor employment visas for this role.

Job description