About the Role:
We are seeking a skilled and pragmatic Data Platform Engineer to architect and scale intelligent data systems that support our AI and ML pipelines—focused specifically on code-based text datasets. You will play a central role in building the infrastructure that powers data ingestion, transformation, and delivery for our models. This includes developing systems for web-scale data discovery and crawling, designing robust data pipelines, and enabling our scientists to experiment and iterate with confidence. If you are excited by building scalable, ML-ready data platforms at the intersection of engineering and AI, we want to hear from you.
Core Responsibilities:
- Design and implement scalable data infrastructure to ingest, transform, and manage large-scale code datasets, ensuring high reliability and modularity.
- Build systems and tools for automated web crawling, parsing, deduplication, and metadata extraction from open-source and public code repositories.
- Develop robust data pipelines for ingesting and processing structured text datasets using distributed compute frameworks. Monitor quality, throughput, and performance.
- Build tools to support data visualization, sampling, and analytics to drive better model outcomes and data understanding.
- Collaborate across research, infrastructure, and compliance teams to meet technical, operational, and regulatory requirements.
Required Skills
- 5+ years of software engineering experience in data-intensive environments
- Proven experience building and maintaining scalable data systems and infrastructure
- Experience with web crawling, scraping frameworks, and large-scale data ingest
- Comfortable with AWS or other cloud environments, including storage, containerized compute, and security
- Working experience with data-centric tech stack including Python, Go, or Scala; Spark or Ray; Airflow or Prefect; Kafka; Redis; PostgreSQL or ClickHouse; and GitHub APIs
- Understanding of how datasets feed into AI/ML workflows
Preferred Qualifications
- Experience curating and preparing code-based datasets for language models or code intelligence applications
- Familiarity with code parsing, tokenization, embedding and static analysis
- Prior experience in a startup or fast-paced, high-ownership engineering environment
- Strong written and verbal communication skills
What We Offer
- Opportunity to shape the technical direction of a disruptive AI startup
- Work with cutting-edge technologies in AI/ML and cloud computing
- Competitive compensation package including equity
- High-caliber, talented collaborators from diverse disciplines
- Collaborative and innovative startup culture