What You’ll Do

We are seeking a Senior Software Engineer with expertise in web scraping, data processing, and search technologies to help build a large-scale data ingestion and classification system. You will be responsible for extracting data from diverse sources (web pages, APIs, PDFs), cleaning and normalizing it, and building search capabilities using ElasticSearch/OpenSearch. You will work with Python, Scrapy, Airflow, Kubernetes, AWS, and Spark to create scalable, high-performance data pipelines.

Build and design large scale, distributed crawling bots (perhaps AI agents) and infrastructure that operate in an adversarial environment aiming at low operational overhead
Develop and maintain data pipelines to extract data from large volumes of web pages, documents, PDFs (OCR), and APIs.
Help unify heterogeneous documents into a coherent data schema across varied source formats
Preprocess and normalize raw data for downstream classification, ML/NLP, and search indexing.
Build APIs to expose structured, classified data via ElasticSearch/OpenSearch.
Collaborate with ML/NLP teams to integrate classification models into the pipeline.
Automate workflows using Apache Airflow and deploy solutions in Kubernetes on AWS.
Optimize and scale data pipelines using Spark (EMR) for processing large datasets.

What You’ll Bring

4 years of experience in Python with building crawling/scraping solutions at scale.
Experience working with APIs (REST), PDF processing (OCR, Tesseract, PyMuPDF etc.).
Proficiency in data processing & search technologies (ElasticSearch/OpenSearch, NoSQL/SQL databases).
Experience with React
Strong problem-solving skills in handling anti-scraping mechanisms and data scaling challenges.
Hands-on experience with AWS or GCP.

Nice to Have

Familiarity with NLP and Machine Learning (a plus but not required).
Experience with LLMs, NLP models, or ML frameworks (e.g., Hugging Face, spaCy, TensorFlow, PyTorch).
Prior experience in automated document classification.
Experience working in high-scale, production environments with petabytes of data.
Hands-on experience with Kubernetes.

Senior Software Engineer - Python (Contract)

Descrierea postului

What You’ll Do

Asociat

Asociat