The Other Side of Website Scraping
by Gábor Hajba
There are a lot of articles out there to give you an introduction to website scraping, but most of them miss some key points I find crucial to know before you start your project.
Gathering data is superb for your business, but what about the other parties: website operators, owners, and users?
If you do the website scraping the wrong way, you can end up facing the Law. Even if you don’t know it, there might be a regulation which protects websites and restricts scraping.
One of the well-known processes on this topic is about LinkedIn suing unknown persons or entities because of scraping. Controversially, LinkedIn doesn’t want to prohibit scraping completely because they wanted search bots (like the one Google is using) to index their websites and provide their pages as search results. This differentiation between “friendly bots” and “criminal hackers” can easily give big search engines a monopoly.
Doing Website Scraping Right
I don’t say that if you follow my personal guidelines you will do white-hat website scraping. But every time you start a project, think about the consequences and verify you do the most to scrape safely.
This is the toughest part of all. You have to do some research on the current laws on website scraping. For your region and where your target website is hosted or where the owners of the target website are located.
Laws are a dry read, I know, but not knowing doesn’t mean you’re out of trouble. Invest some time to go through at least your country’s restrictions. They won’t change every day and knowing the basics can be helpful if you deal with projects outside your country.
Because of the new regulations in the European Union (actually, the regulation exists since 2016 but it is enforced since July 2018) it is even critical to take a look at what data you gather.
To make it clear: if you gather personal information of people living in the EU or European Economic Area (Norway, Iceland, Lichtenstein) you need confirmation from those people that you’re allowed to gather, store, and transfer their data.
Reading Terms & Conditions
A big concern is that many developers don’t care about if they’re working legally on their data gathering projects. I don’t want to go into ethical development in this article but I want you to think about this the next time you’re about to accept project proposal: Are you allowed to scrape the required data?
For automated scrapers, like search engine bots, it would be hard to read all websites’ terms. Because of this limitation, those bots align their work to the robots.txt provided by the target website. And you should do the same.
Because I use Python for my website scraping projects, I know that Python has libraries to parse robots.txt files. And Scrapy, the - in my opinion - best available Python scraper toolbox, obeys robots.txt files (only since the release of version 1.5.1, although the support was present since years it just not defaulted.)
Interesting fact: as of writing this article (October 2018) Amazon has a special entry for the EtaoSpider in their robots.txt file:
Identify yourself. You can see that robots.txt files define different rules for different spiders. And as a good Internet citizen, your spider should introduce itself.
It doesn’t matter what tool you use: every time you send a request to a URL, provide the user-agent header. This shows website owners who your scraper is and they can add some entries for you to follow in their robots.txt file.
Limit Your Requests
Another aspect is not the legal, but the physical part of scraping. When you call a web page to gather data from it you cause a bit of server load. However, if you create a Scrapy scraper and turn all speed-knobs up to the maximum, you can cause physical harm to the target website.
This is something you might notice: as response times go up, your scraper needs more time to gather the data. In the worst case scenario, the web server goes down and nobody gets access to anything.
When the website is down, the owners are not making money. And because they make no money, they get angry. And because they’re angry, they will try to find you and force you to pay their missing revenue. You don’t want to do this.
Website scraping is a great and common way to gather information from the internet. However, there are some issues which you have to consider if you want to act as a good internet-citizen.
Next time you want to gather some data, read the website’s Terms & Conditions page (if any) and see if they want to restrict bots scraping their sites. After this, look at the target page’s robots.txt file – and use it in your scraper. Finally, add a delay between your requests. You don’t want to hurt your project because the target website is always down or you were banned.
As an alternative, many companies provide publicly accessible APIs with their data. Why not give this a try?
Disclaimer: I am not a lawyer. Following my guidelines doesn’t mean you’ll be 100% out of trouble. But you’ll be a better Internet-citizen.
About the Author
Gábor Laszlo Hajba is an IT Consultant who specializes in Java and Python, and holds workshops about Java and Java Enterprise Edition. As the CEO of the JaPy Szoftver Kft in Sopron, Hungary, he is responsible for designing and developing customer needs in the enterprise software world. He has also held roles as a software developer with EBCONT Enterprise Technologies, and as an Advanced Software Engineer with Zuhlke Group. He considers himself a workaholic, (hard)core and well-grounded developer, functional minded, freak of portable apps and "a champion Javavore who loves pushing code" and loves to develop in Python.
This article was contributed by Gábor Hajba, author of Website Scraping with Python.