A Newbie’s Journey Building a Full-Stack AI-Powered Web Scraper

CHAIRI Chaimae
5 min readSep 18, 2024

--

My First Hiring Hackathon Experience highlights

Planning, executing for winning Image from visteon

Participating in my first hiring hackathon was both exciting and challenging. The hackathon, supervised by Headstarter AI, offered multiple problem tracks, each targeting different real-world challenges. After reviewing the options, I chose to join the Olostep track.

Why?

Because I had some related technical experience, and for a long time, I’ve wanted to work on a web scraper that truly solves real-world problems. This track offered the perfect opportunity to apply my skills to a project with tangible impact.

Our goal?

Build a full-stack application that scrapes, analyzes, and categorizes data from user-provided URLs using a lightweight AI model, complete with a CI/CD pipeline for automated deployment and real-time performance monitoring.

1. Project Overview

Objective

Our team’s task in the Olostep track was to create a web scraper that not only extracts data from user-provided URLs but also utilizes AI to analyze and categorize that data. Additionally, we needed to set up a CI/CD pipeline for seamless deployment and integrate logging and monitoring to track real-time performance.

Why It Matters

This project was particularly relevant to Olostep’s mission of programmatically accessing and interacting with the web at scale. By adding AI-powered data analysis, we took a step closer to enabling AI agents to intelligently gather, process, and act on web data — aligning with Olostep’s vision for AI integration in web interaction.

2. Team Collaboration

For the first time, I had the opportunity to work with an international team consisting of four other members, each bringing a unique skill set to the table:

  • Ivan: Team Leader
  • Deilen: UI and Designer
  • Brandon: Spokesperson, Mentoring
  • Belal: AI Specialist
  • Me, Chaimae: Full-Stack Developer with CI/CD Focus

Despite working across different time zones and cultures, we managed to align quickly, dividing responsibilities based on our strengths and maintaining open communication throughout via discord. The diversity in our team not only enhanced the quality of our project but also taught me valuable lessons about teamwork and collaboration in a global context.

3. Tech Stack & Tools

To build this project, we used a combination of modern tools and technologies:

  1. User Interface: Basic form to input URLs.
  2. Backend: Node.js server using Express.js for handling requests, scraping with Playwright or Puppeteer.
  3. Database: MongoDB for storing scraped data.
  4. AI Integration: TensorFlow.js for on-the-fly data categorization.

5. CI/CD Pipeline: Automated deployment with GitHub Actions, logging with Prometheus.

4. Key Features of the Application

  • Frontend: HTML, CSS, JavaScript
  • Backend: Node.js, Express.js
  • Database: MongoDB
A screenshot from MongoDB Compass showing the stored portfolio URL submitted through the form.
  • AI Model: TensorFlow.js or a pre-trained model for NLP
  • Automation: CI/CD with GitHub Actions
  • Monitoring: Prometheus, Grafana
  • Preferred Tools: Selenium, Playwright, Puppeteer for advanced scraping

Web Scraping

  • Users submit URLs through the frontend, and the backend scrapes the content using Puppeteer.
  • It handles both static and dynamic websites, including those that rely on JavaScript to load data.
A screenshot from the alpha version of the scraper before AI integration.

AI-Powered Data Analysis

  • Once the data is scraped, it’s sent to the AI model, which categorizes it based on content type (e.g., blog posts, e-commerce products).
  • The model provides insights automatically, allowing users to better understand the data without manual analysis.
  • Here’s a short demo of the result: https://youtu.be/0BYhpN2n98A

CI/CD Pipeline

  • We automated testing and deployment using GitHub Actions. This ensured that every new code commit was automatically tested and deployed to production.
  • The pipeline helped us maintain agility and speed during development, a crucial factor in a time-limited hackathon.

This is the github repository: CXaymae/Hackathon-Olostep-track (github.com)

Logging and Monitoring

  • With Grafana, we were able to monitor performance in real-time, track potential issues with web scraping, and ensure the AI model’s analysis was efficient.
  • Alerts were set up to notify us of any system slowdowns or failures, keeping us ahead of potential problems.
  • Here’s a short demo of the result: https://youtu.be/UNEiWkgoo30

5. How It Aligns with Olostep’s Vision

Our project perfectly aligned with Olostep’s goals:

  • Web Scraping at Scale: The application demonstrated the ability to programmatically interact with the web, handling large amounts of data from different sources.
  • AI-Powered Insights: By integrating a lightweight AI model, we turned raw data into categorized, useful information, adding significant value.
  • Automation: The CI/CD pipeline ensured that our solution could be deployed quickly and efficiently, meeting real-world scalability and performance requirements.

6. Future plan

  • Text Summarization and Sentiment Analysis: The application can summarize text and analyze sentiment (negative or positive). It applies this functionality to various content types, including podcasts, articles, e-commerce listings, technology reviews, and news.
  • Content Classification: The scraper identifies and categorizes the type of content (e.g., podcast, article).
  • Regional and Language Translation: It supports content specific to different regions and offers language translation.
  • Sensitive Content Disclaimer: A disclaimer is included if sensitive content is detected. If content is refused, the scraper will remain blank.
  • User Filtering: Users can filter content according to their preferences for more relevant results.

These features enhance the scraper’s versatility and usability.

7. Key Takeaways & Lessons Learned

  1. Collaboration is Key: Working with an international team taught me the importance of clear communication and leveraging everyone’s strengths.
  2. Automation Simplifies Life: Setting up a CI/CD pipeline early on helped us focus on building the core features instead of worrying about manual deployments.
  3. AI Can Be Simple Yet Powerful: Even a lightweight AI model can provide valuable insights when integrated into a larger system.
  4. Monitoring is Critical: Real-time logging and monitoring helped us quickly catch and resolve issues, improving the overall reliability of our application.

Conclusion

Participating in my first hiring hackathon, led by Headstarter AI, was a real eye-opener. On the Olostep track, my international team and I developed a full-stack web scraper with AI capabilities. Although we didn’t win in a highly competitive environment, we gained valuable experience in system design and presentation. I’m eager to apply these learnings to future projects and keep growing as a future software engineer.

--

--

CHAIRI Chaimae
CHAIRI Chaimae

Written by CHAIRI Chaimae

This space's to document my learning and track my progress. Hope it helps someone somewhere. You're welcomed!

No responses yet