Web scraping is a popular “growth hacking” technique popularized by the SEO industry, but web scraping can be performed by anyone, per se, who has an interest in web data collection. It is typically defined as the act of automatically downloading a web page’s data and extracting very specific information from it, to be compiled into a database.
Typically, web scraping is used to collect useful data from websites for tracking purposes, which can give an advantage to online businesses. For example, an ecommerce website can scrape the prices of particular products from competitor websites, to have fast real-time updates on the market demand. Or you could scrape <H2> header text from website pages to make a list of the most common keywords being used in headers, and evaluate how competitive it would be to rank for those keywords.
Web scraping is a totally legal activity so long as you’re scraping data that is already publicly available, but it does fall into a moral grey area. Aside from the fact that web scrapers can be used for malicious purposes, many webmasters really don’t like web scrapers crawling their sites, as the constant floods of requests can end up overloading a server’s resources and crashing the website.
Web scraping is also needed because you have no time to fret over how to download, copy, save the data that you see on a web page. You can use a PHP website scraper. What you need is an easy, automated way of scraping whatever data that you see on the web page and hence web scraping!
It’s the equivalent of a DoS (denial of service) attack if web scraping is done excessively, which is why webmasters either try their best to block web scrapers, or ask that web scrapers limit the amount of time between requests.
Let’s get a bit deeper into some purposes of web scraping and how it’s done.
How is web scraping performed?
Web scrapers are typically coded in Python, which is a dynamic language that is easily readable compared to other programming languages. Python uses many English words as keywords, so the coding structure of a Python app almost resembles a human sentence (print=if(humans)>”talked like this”).
Building a web scraper is a fun and easy way to start learning Python, and you can follow a Python web scraping tutorial which is perfect as a beginner’s project.
Alternatively, people can use web scraper services online, such as ScrapeSimple or Octoparse, where a scraper will pretty much be built for you based on what kind of data you want to scrape.
What a web scraper technically does is ‘visit’ the target website and inspect the HTML/CSS elements, looking for the kind of data you requested. So you could have a web scraper that pulls all of the <H2> tags from a website and display them to you in an output, or only the <H2> tags that contain a specific keyword.
That’s a very simple example of what a web scraper can do, and something you can easily learn to do in the tutorial we linked above, but there are many other complex uses of web scrapers.
Why does web scraping get a bad reputation?
Web scrapers and web crawlers have actually been around pretty much since the invention of the internet. However, the practice of web scraping was given a bad reputation in the past few years, particularly when LinkedIn went after data analytics company HiQ for scraping profile information, citing CFAA laws (Computer Fraud and Abuse Act).
However, the CFAA was written with unauthorized computer access in mind, i.e. actually hacking into a website and stealing data. HiQ was only scraping publicly available profiles of LinkedIn users, and did not even require a log-in, as anyone with a browser can visit LinkedIn profile pages.
After a lengthy and complex court battle, the Ninth Circuit of Appeals basically ruled that LinkedIn could not weaponize the CFAA to stop HiQ from scraping publicly available information, although LinkedIn could still possibly claim other violations such as copyright infringement.
LinkedIn v. HiQ is actually an ongoing case, and there have been many similar cases with complex rulings lately, such as the Eleventh Circuit ruling that “a database may contain trade secret information even though the database contents can be accessed through a publicly available website” in Compulife v. Newman, No. 18-12004 (11th Cir. May 20, 2020), so this is important stuff to follow if you plan on getting into the world of web scraping.