Build Club
Posts
An AI Engineers Guide: Web browsing for AI

An AI Engineers Guide: Web browsing for AI

Learn about the latest trends, tools and techniques top AI engineers are using when selecting web scraping and browsing tools and platforms, including pros, cons and pricing

Annie Liao
September 01, 2024

An AI Engineers Guide: Web Browsing and Scraping Tools for AI Apps and Agents

This is a Build Club open-source community document, where we share insights and learnings from builders at the frontier of AI! If you are looking to start your AI journey or level up to be in the top x%, then come Build with us!

👉️ You can join our OS community here

A huge shout out to our OS community contributors: Davis Grainger, Andy Taylor, Arun Prasad, Caroline Davis, Bruno Guedes, Vuong Ngo, Muhammad Ridwan, Yoeven DK, Thomas Gomez, Annie Liao

Written with ❤️ from Build Club

An AI Engineers Guide: Web Browsing and Scraping T …

An intro to web browsing

Web browsing is essential for AI engineers, enabling data collection for machine learning, market analysis, and more.

As web environments grow more complex, sophisticated scraping techniques are crucial. AI-powered automation, scalable solutions, and ethical considerations are transforming web browsing.

Web scraping typically involves three steps:

Exploring web elements to identify the content structure
Accessing and fetching data with automated scripts in languages
Storing the extracted data

This article focuses on the first two which have evolved the most with the rise of new AI capabilities 🌟

Here's a look at the latest trends, key tools, and best practices.

Trends in web browsing

AI-Powered Automation AI models like GPT-4 are revolutionizing web browsing by dynamically generating and refining scripts. This adaptability makes scraping more efficient and capable of handling dynamic content with ease.
Scaling and Anti-Blocking Techniques Scaling operations face challenges like IP blocking. Tools like Apify and Fire Crawl offer proxy rotation and user agent switching to avoid detection. They also bypass CAPTCHAs, ensuring smooth, large-scale data scraping.
Community-Driven Innovations Platforms like Hacker News and Reddit are hotbeds for new scraping techniques. Engineers share tips on using AI for CAPTCHA solving and dynamic content scraping, with tools like Puppeteer often in the spotlight.
Ethical and Legal Considerations As scraping tools evolve, so do the ethical and legal implications. Compliance with robots.txt, respecting privacy, and avoiding undue server strain are increasingly important. Tools that automatically follow these guidelines, such as ZenRows, are becoming essential. As noted by Andy Taylor, "scraping is never illegal; however, what you do with the data can be."

The newest and hottest tools top AI engineers are using

We asked our community what the top tools they are using…

Beautiful Soup: Beautiful Soup emerged as one of the most popular tools discussed among the participants. This Python library is known for its simplicity and effectiveness in parsing HTML and XML documents, making it a go-to for many developers. Examples include scraping job descriptions to match them with candidates' CVs, automating a process that would otherwise be highly manual. Another benefit to Beautiful Soup is community support and examples available online, which made troubleshooting and implementation much more manageable.
Web Scraper.io and Firecrawl: For more extensive web browsing projects, especially those involving large-scale data collection, commercial tools like Web Scraper.io and Firecrawl offer advanced functionalities. This platform allows users to automate data extraction across multiple pages by setting up custom rules. It is particularly effective for scraping content from hundreds of financial websites or monitoring market trends. Engineers found Web Scraper.io to be reliable, especially when dealing with tasks requiring significant data storage and indexing.
Using Large Language Models (LLMs) for Scraping: With the advent of Large Language Models (LLMs) like ChatGPT-4, there is growing interest in leveraging these models to enhance web browsing efforts. For instance, developers have explored using LLMs to interpret and extract meaningful insights from scraped HTML content. However, challenges remain, such as the models misinterpreting HTML tags or shifting from extraction to content generation modes. Despite these limitations, LLMs can be effective in specific scenarios, such as performing Q&A on extracted web data.

Full market map of rising tools

Here is the full list of tools top AI engineers are using and interested in experimenting with 👇️

Market map of web browsing tools, including new AI tools

The breakdown:

Browser Automation Tools
- Selenium: Versatile and widely used for dynamic content scraping, integrating well with AI-driven workflows.
- Puppeteer: A Node.js library, excellent for scraping dynamic, modern web apps.
- Puppeteer-stealth: A stealth plugin widely used to reduce detection when self-hosting scraping infrastructure.
- Playwright: Use different browser bases for rotation with user-agent list to work around user-agent checking.
Headless Browser Platforms
- Browserbase: Headless browser as a service that mimics human behavior, reducing detection risk—an affordable option for stealth and proxies.
- Browserless: Another headless browser platform with excellent stealth and proxy features, though at a higher cost.
Proxy and Anti-Detection Tools
- BrightData: Offers high-quality residential proxies and the SEPR Web Unlocker, a product that is expensive but highly effective at bypassing detection.
- ZenRows: High-quality, though expensive, tools for bypassing detection and paywalls, especially useful for scraping news sites.
- Apify/Crawlee: Scalable and easy to use, with built-in proxy management, ideal for both small and large-scale tasks.
- Fire Crawl: User-friendly, handles large-scale scraping, and integrates seamlessly with AI workflows.
- Serverless (preferably containers): Cloud Run, AWS Lambda, Digital Ocean, etc… Rotate API calling to ensure each container is running on a new IP address.
Scalable Web Browsing Platforms
- Octoparse: A no-code, cloud-based platform that is beginner-friendly yet capable of handling complex, scalable tasks.
- JigsawStackr: Native AI web scraper with built-in proxies to scrape any website accurately and provides structured data along with CSS selector data.
AI-Powered Scraping Solutions
- Jina's Reader API: An AI-driven tool that provides rich data extraction for AI models.
- TaxyAI Browser Extension: Powered by GPT-4, this extension is great for lightweight, in-browser scraping tasks.
- ScrapeGraphAI: A web browsing Python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown).
- AgentQL: A powerful AI-driven tool for reliable web data extraction and automation, using natural language queries to find elements without fragile selectors.
Open Source and Custom Solutions
- Universal HTML Parser (Grub Search): Open-source, versatile tool ideal for custom scraping needs.
- Any LLM: Clean HTML Body first with Beautiful Soup, then call LLM to generate Beautiful Soup script via tool calls. Run that script to try extraction first; then fall back to Goose, Newspaper, or other old-school extraction tools.
- Using LangChain Community AsyncChromiumLoader and Html2TextTransformer: Tools for extracting text content from websites using asynchronous browser loading and HTML-to-text conversion. Code:

import html2textfrom langchain_community.document_loaders import AsyncChromiumLoaderfrom langchain_community.document_transformers import Html2TextTransformerdef extract_text_from_url(self, url):    loader = AsyncChromiumLoader([url])    html_content = loader.load()    html2text_transformer = Html2TextTransformer()    docs_transformed = html2text_transformer.transform_documents(html_content)    content_text = []    for doc in docs_transformed:        page_content = html2text.html2text(doc.page_content)        content_text.append(page_content)    return content_text

Web Scraping Programming Language Libraries
- Beautiful Soup: A Python library for parsing HTML and XML documents, widely used for small-scale web scraping tasks.
- Scrapy: A Python-based open-source web scraping framework designed for handling large-scale data extraction.
- Requests: A simple and elegant HTTP library for Python, often used in conjunction with other scraping tools.
- lxml: A powerful and feature-rich library for processing XML and HTML in Python.
- Jsoup: A Java library for working with real-world HTML, providing a fast and flexible way to extract and manipulate data.
- Pandas: A Python library used for data manipulation and analysis, often employed to process and store scraped data.

Conclusion

There are so many new tools, and new AI powered capabilities on the market with browsing - meaning new capabilities are unlocked to deliver that sweet business value.

Web browsing remains a powerful tool, but it requires careful attention to best practices, particularly regarding legality and ethics. Always check a website's robots.txt file before scraping, respect copyright laws, and avoid putting undue strain on servers.

Additionally, leveraging community insights from platforms like Hacker News and Reddit can provide valuable tips on optimizing your scraping strategy.

Web browsing is evolving rapidly with AI, scalable solutions, and ethical considerations at the forefront. Whether you need robust, scalable tools like Apify or AI-powered solutions like Browserbase, the right tool depends on your specific needs. Stay updated on these trends and choose tools that best align with your scraping goals.

Curious to continue the discussion, join us in our Slack here!

Appendix

Full wiki of tools mentioned:

Category	Tool	Description	Pros	Cons	Link	Pricing
Browser Automation Tools	Selenium	A browser automation tool for testing and scraping dynamic content.	Versatile, widely used, supports dynamic content.	Can be slow, high resource usage, requires manual updates.	https://www.selenium.dev/	Free (Open-source)
Browser Automation Tools	Browserbase	AI-enhanced browser automation tool, mimics human behavior to reduce detection.	AI-enhanced, reduces detection risk.	Relatively new, smaller community.	https://github.com/browserbase	Starts at $39/month, $0.10/browser hour, $10/GB Proxy
Browser Automation Tools	Puppeteer	A Node.js library for controlling headless Chrome/Chromium, ideal for scraping dynamic content.	High-level API, integrates with modern web frameworks.	Requires Node.js, more complex setup.	https://pptr.dev/	Free (Open-source)
Browser Automation Tools	Playwright	A Node.js library for browser automation, similar to Puppeteer but with broader browser support.	Supports multiple browsers, robust API.	Can be complex to set up and use efficiently.	https://playwright.dev/	Free (Open-source)
Browser Automation Tools	Puppeteer-stealth	A stealth plugin for Puppeteer to avoid detection.	Reduces detection risk, customizable settings.	Requires technical setup, limited community support.	https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth	Free (Open-source)
Browser Infrastructure Platforms	Browserless	Headless browser platform with advanced stealth and proxy features.	High-quality stealth features, strong community support.	Expensive compared to alternatives.	https://www.browserless.io/	From $50/month
Browser Infrastructure Platforms	BrightData	Offers high-quality residential proxies and scraping tools.	Reliable, extensive proxy network, anti-detection features.	Expensive, requires a subscription for full features.	https://brightdata.com/	From $300/month
Scalable Web Browsing Platforms	Apify/Crawlee	Scalable web browsing platform with built-in proxy management, ideal for large-scale operations.	Scalable, easy-to-use, proxy management.	May require a subscription for advanced features.	https://apify.com/	Pay-as-you-go: $49/month + usage fees
Scalable Web Browsing Platforms	Fire Crawl	User-friendly, handles large-scale scraping, and integrates seamlessly with AI workflows.	User-friendly, handles large-scale tasks, AI integration.	Limited flexibility for custom tasks.	https://firecrawl.com	Pricing not available online
Scalable Web Browsing Platforms	Octoparse	A no-code, cloud-based platform that is beginner-friendly yet capable of handling complex, scalable tasks.	No-code, beginner-friendly, cloud-based.	Limited customization, paid tiers for advanced features.	https://www.octoparse.com/	Free plan available; Paid plans from $75/month
Scalable Web Browsing Platforms	Web Scraper.io	A powerful cloud-based scraper with a Chrome extension for creating sitemaps.	Easy to use, supports various formats, scalable.	Limited to structured sitemaps, less flexible for custom tasks.	https://webscraper.io/	Free for local use; Cloud from $50/month
Scalable Web Browsing Platforms	JigsawStackr	AI-powered web scraper with built-in proxies to scrape and structure data.	AI integration, handles complex websites effectively.	Newer tool, smaller community.	https://jigsawstackr.com	Pricing not available online
AI-Powered Scraping Solutions	Jina's Reader API	An AI-driven API that reads URLs and searches the web, providing context-rich data for AI models.	AI-driven, excellent for AI model training.	Limited to specific use cases, dependent on API access.	https://jina.ai/reader/	Free (for limited usage)
AI-Powered Scraping Solutions	TaxyAI Browser Extension	Powered by GPT-4, this extension is great for lightweight, in-browser scraping tasks.	Simple setup, ideal for lightweight scraping.	Limited to browser automation, not suitable for large-scale scraping.	https://github.com/TaxyAI/browser-extension	Free (Open-source)
AI-Powered Scraping Solutions	ScrapeGraphAI	A web browsing python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents.	Easy LLM integration for dynamic scraping pipelines.	Limited to Python environment only.	https://github.com/ScrapeGraphAI/Scrapegraph-ai	Free (Open-source)
AI-Powered Scraping Solutions	AgentQL	A powerful AI-driven tool for reliable web data extraction and automation, using natural language queries to find elements without fragile selectors.	Uses natural language for element selection, simplifying queries.	Requires Chrome extension installation for initial setup.	https://www.agentql.com	Pricing not available online
Anti-Blocking Tools	ZenRows	A tool designed to bypass detection, including CAPTCHAs and paywalls.	Effective against advanced anti-bot systems, premium proxies available.	Expensive, complex configuration.	https://www.zenrows.com/	From $69/month
Open Source and Custom Solutions	Beautiful Soup	A Python library for parsing HTML and XML documents, used for simple web browsing tasks.	Easy to use, highly customizable, strong Python integration.	Best suited for smaller tasks, requires manual updates for dynamic content.	https://www.crummy.com/software/BeautifulSoup/	Free (Open-source)
Open Source and Custom Solutions	Universal HTML Parser (Grub Search)	An open-source HTML parser under the GNU AGPL license, suitable for custom web browsing needs.	Customizable, open-source, versatile.	Requires more technical expertise, less support.	https://gitlab.com/grubs/grub-search/-/blob/master/src/scraper/universal-html-parser.rkt	Free (Open-source)
Open Source and Custom Solutions	Scrapy	An open-source web browsing framework, highly flexible, supports various data formats.	Flexible, supports various formats, strong community.	Steeper learning curve, can be overkill for simple tasks.	https://scrapy.org/	Free (Open-source)