We're looking for a skilled Web Scraping Data Engineer (Intern) to design and implement robust data extraction systems. In this role, you'll develop scalable crawling architectures to collect high-quality data while ensuring compliance with ethical standards and data regulations.
Design and maintain efficient web crawling systems using frameworks like Scrapy, Playwright, or Selenium
Implement data processing pipelines to clean, normalize, and structure extracted content
Optimize crawling strategies to improve efficiency while respecting website policies
Develop monitoring systems to identify and resolve scraping issues quickly
Deliver high-quality datasets for analysis and model training
Implement storage solutions for large-scale data management
Ensure compliance with data regulations and ethical scraping practices
Strong Python programming experience.
Good to know SQL.
Hands-on experience with web scraping tools (BeautifulSoup, Scrapy, Selenium)
Proficiency with HTML, JavaScript, and HTTP protocols
Experience with data processing libraries (pandas, PySpark)
Familiarity with Linux/UNIX environments
Knowledge of version control systems and code review practices
Strong problem-solving abilities and attention to detail
Excellent communication skills (written and verbal English)
Good to have :(Optional)
Familiarity with AI frameworks (Hugging Face, LangChain, OpenAI)
Familiarity with LLM training pipelines and data requirements
Experience with text data augmentation and synthetic data generation
Experience with large-scale distributed crawling systems
Knowledge of proxy management and anti-bot evasion techniques
Familiarity with any cloud platforms (AWS, GCP, Azure)
Experience with containerization (Docker, Kubernetes)
Opportunity to work on cutting-edge data collection projects
Collaborative environment with talented engineers
Competitive compensation package
Professional growth and development opportunities