Web-Scraping : Guide
You’ve probably heard multiple times — “Data is the new oil!” And yes this is true DATA is the new differentiator. It is the core of market research and business strategies. Whether you want to start a new project or churn out a new strategy for an existing business, you need to invariably access and analyze a vast amount of data.
Introduction to Web Scraping
Web scraping is a technique to fetch data from websites . One way is to manually copy-paste the needed data from the websites but this is very much time-consuming .Web scrapping automate the process of data extraction from websites which is done with the help of software or programs called scrappers . These can be custom built to work for one site or can be configured to work with any website based on user requirements.
Web scraping finds many uses both at a professional and personal level. Having different needs at different levels, some popular uses of web scraping are.
Potential need for web scrapping:
- To analyze customers feedback to understand the customer’s perspective about their product that help companies to develop new strategies.
- To collect data to make services like content aggregator to serve content to consumers.
- To analyze financial statements and develop more efficient programs to manage finance.
- To understand social media trends , how people react to different social media trends.
- To do any other kind of research and development work.
Let’s talk technical now !!
Alright , we have a rough idea what web scrapping is and it’s uses. Now the question is How we do it ? right ? Well we have various options and techniques available for doing it But I will choose python anytime for this task. Ask me why ? Simple Answer — Python have various features which makes it most suitable.
- Python is super simple and easy to use. We have to write less lines of code and it’s less messy.
- Support for a large number of libraries like Pandas , requests, matplotlib ,selenium, scrapy , etc. which makes it best suitable for data extraction and manipulation.
- Since python is open source we have a large and active community which helps us if we get stuck anywhere.
Basic Steps involves in web scrapping that we need to follow :
- We need the website URL that we want to scrape the data from.
- We make a request to the URL that will give us some response.
- We parse the response.
- We find the data we want to extract.
- We store the data in required format.
Libraries eases the work they come with various inbuilt implementations which we can use directly . There are tons of libraries and niche scrapers around the community, but I’d like to share the most popular of them.
Python Requests is the most basic HTTP library available. Requests allow the user to send requests to an HTTP server and get responses in the form of HTML or JSON. It also allows the user to submit POST requests to the server in order to change or add content. It does not parse the received HTML data, thus, another library is needed.
When you’re just getting started with web scraping and have an API to work with, Requests is the best option. It’s simple to use and doesn’t take much effort to master. Requests also eliminates the need for you to manually add query strings to your URLs. Also, it has excellent documentation and fully supports the restful API with all of its functions (PUT, GET, DELETE, and POST).
2. Beautiful Soup
Beautiful Soup is a Python library for parsing structured data. It allows you to interact with HTML in a similar way to how you interact with a web page using developer tools.
The library exposes a couple of intuitive functions you can use to explore the HTML you received. To get started, use your terminal to install Beautiful Soup:
# To install —
pip install beautifulsoup4To import
— from bs4 import BeautifulSoupThe best use case for this library would be with static html pages.
Here are some basic operation examples -
This is one of the best parsers for HTML and XML. It is used to ease the handling of XML and HTML files. It is widely used for its simplicity and extremely fast response. This library is very useful in web Scraping as this can easily parse the large HTML or XML files.
Among all the Python web scraping libraries, we’ve enjoyed using lxml the most. It’s straightforward, fast, and feature-rich.
Even so, it’s quite easy to pick up if you have experience with either XPaths or CSS. Its raw speed and power has also helped it become widely adopted in the industry.
Beautiful Soup vs lxml
Historically, the rule of thumb was:
- If you need speed, go for lxml.
- If you need to handle messy documents, choose Beautiful Soup.
Yet, this distinction no longer holds. Beautiful Soup now supports using the lxml parser, and vice-versa. It’s also pretty easy to learn the other once you’ve learned one.
So to start, we recommend trying both and picking the one that feels more intuitive for you. We prefer lxml, but many swear by Beautiful Soup
# To install — pip install lxml
# To import — import lxml
You can see the documentation here.
Selenium acts as a web driver. This API provides a way to use WebDriver like Firefox, Ie, Chrome, Remote, etc. A program can do almost all the tasks that can be performed by a user on web browsers like form fillings, form clicking or button clicking, browser opening and a lot more. It is a very useful tool in Python for web Scraping.
Other sites may require you to click through forms before seeing their content. Or select options from a dropdown. Or perform a tribal rain dance…
For these sites, you’ll need something more powerful. You’ll need Selenium (which can handle everything except tribal rain dancing).
Selenium is a tool that automates browsers, also known as a web-driver. With it, you can actually open a Google Chrome window, visit a site, and click on links. Pretty cool, right?
It also comes with Python bindings for controlling it right from your application. This makes it a breeze to integrate with your chosen parsing library.
You can find Selenium Documentation Here.
# To install — pip install selenium
# To import — import selenium
Here are some basic operation example.
Scrapy is a web scraping framework. It is one of the most advanced scraping framework available in Python. This Scrapy provides bots that can scrape thousands of web pages at once. Here you have to create a web spider that will go from one page to another and provides you the data.
Scrapy is the greatest Web Scraping framework, and it was developed by a team with a lot of enterprise scraping experience. The software created on top of this library can be a crawler, scraper, and data extractor or even all this together.
With this Framework, you can create Spider that will crawl on web pages and scrape desired data from the web.
# To install — pip install Scrapy
# To import — import scrapy
Here is image Of Scrapy architecture, image borrowed from documentaion.
This is the basic code for creating a spider with Scrapy. There are tons of predefined class and methods and you just have to use them to create your Spider. It is easy to create a web Spider with this package. Rather it is quite difficult for a beginner to create a fully functional web scraper.
t’s a complete web scraping framework. That means you can use it to manage requests, preserve user sessions, follow redirects, and handle output pipelines.
It also means you can swap out individual modules with other Python web scraping libraries. For instance, if you need to insert Selenium for scraping dynamic web pages, you can do that.(see examples in stack overflow here)
So if you need to reuse your crawler, scale it, manage complex data pipelines, or cook up some other sophisticated spider, then Scrapy was made for you.
When selenium is used, there is a trouble, that is, the configuration of the environment. You need to install relevant browsers, such as chrome, Firefox, etc., and then go to the official website to download the corresponding drivers. The most important thing is to install the corresponding Python selenium library, which is really not very convenient. In addition, if you want to do a large-scale deployment, some problems of the environment configuration also exist It’s a headache.
In pypetter, there is actually a Chrome browser similar to Chrome browser behind it, which is performing some actions to render web pages.
# To install — pip install pyppeteer
# To import — import pyppeteer
Here are basic usage example. Source : Documentation
Example: open web page and take a screenshot.
Example: evaluate script on the page.
Working Example notebook — -
How To choose right tools?
Now , We discussed most used libraries , yes I understand it was a short introduction . Anyways , I have provided the links also if you want to know more about the functionalities of individual libraries . There are so much options here that one can easily get confused which library / tool will be right for required task . I will try to give this answer based on my experience.
To answer this I basically divided the data source into 3 parts-
- The first one is to use the API. In this case, owners grant access to the users as they have a vested interest in subscriptions, newsletters, and so on.
For this kind of task requests module will be more than enough.
- Then comes the static pages without java scripts for this we can use Beautiful Soup along with requests module to parse the response.
Output of soup will be a class which contains information in a tree like structure . we can easily find the data we are looking for with the help of various inbuilt methods like find, find_all , as shown in the example below.
For Example —
- Other type is static websites with java script enabled . For these websites we can use any session based approach .request library also have session class .just use the snippet (line 10 & 11) instead and it should work like charm.
- The websites with dynamic data rendering . Dynamic content is one of the main challenges of web scraping. In this case, we’ll need any web-browser capable of playing a dynamic content and script on the client-side. Analyzing DOM structure based on the screen scraping can be a good choice. Selenium can be a go to option since we can control an actual browser .Selenium can also extract data from websites which requires login. However configuring this is sometimes really painful . One other option we can use Pyppeteer to solve the configuration problem.
In this snippet I logged in to instagram and obtained media from homepage using selenium
Challenges to Web Scraping
Okay , we know a lot of general stuffs now. In a short picture We know now that for static websites requests library with beautiful Soup will do most of our work and for dynamic websites selenium or Pyppeteer can be options.
But there are more stuffs that we should consider. I tried to include all major problems here.
- Anti-Scraping Technologies
Some websites use anti-scraping technologies that push away any scraping attempt. They apply a dynamic coding algorithm to prevent any bot intervention . It requires a lot of time and money to work around such anti-scraping technologies.
- IP blocking
IP address blocking is another common issue that a web crawler faces. If you are making the request more often to the website with the same IP, then there is a higher chance that the site will block your IP address .Some websites use anti-scraping technologies which makes the site hard to scrape. LinkedIn , Instagram can be examples of this.
- Website Structure Changes
Every website periodically updates its user interface to improve its attractiveness and experience. This requires various structural changes too. Since the web scrapers are set up according to the code elements of the website at that time, they require changes too. So, they require changes weekly too to target the correct website for data scraping as incomplete information regarding the website structure will lead to improper scraping of data.
If we are scrapping a large number of websites / pages we can come across such problem where scrappers takes a lot of time. It is really important to optimize our scripts to address such problem. We can also use multiple instances or dockers to minimize that.
- Quality and Freshness of the Data
The quality and freshness of data must be one of your most important criteria while choosing sources for data extraction. The data that you acquire should be fresh and relevant to the current time period for it to be of any use at all. Always look for sites always updated frequently with fresh and relevant data when selecting sources for your data extraction project. You could check the last modified date on the site’s source code to get an idea of how fresh the data is.
Captcha is a type of challenge-response test used in computing to determine whether or not the user is human, and it is quite popular these days for keeping spammers away .Web services like Cloudflare prevent bots and provide DDoS protection services, which makes it even harder for bots to perform their tasks.
- Website Specific Traps
Many websites do such things to detect it the user is bot or human. They use AI based tools for that , which differ between how a human operates a website and how a bot does.
This detection is not easy and requires a significant amount of programming work to accomplish correctly.
Sometimes we need to scrape private data, which is available once you get authenticated on the website. Some times websites are simple straightforward just pass authentication in post headers and you are good to go but some are not that easy , they seek some hidden inputs or CSRF tokens or any other specific header which we have to look for . However selenium can be your solution since we can control browser it will handle all these most of the times.
Best practices for Web Scrapping
To address above problems we can use these practices that will make us less like a bot and will be hard to detect . Here is a compilation of the best practices that you must follow while scraping websites.
Some websites like Flipkart for example provides robots.txt file https://www.flipkart.com/robots.txt.
You should always check the Robots.txt file of a website you are planning to extract data from. Websites set rules on how bots should interact with the site in their robots.txt file. Think of a robots.txt file as being like a “Code of Conduct” sign posted on the wall at a gym, a bar, or a community center: The sign itself has no power to enforce the listed rules, but “good” patrons will follow the rules, while “bad” ones are likely to break them and get themselves banned.
- Pay respect to source website
You have to understand that you are accessing other’s platform to complete your data goal then why not pay some respect to them .Web servers are susceptible to downtimes if the load is very high. Just like human users, bots can also add load to the website’s server. If the load exceeds a certain limit, the server might slow down or crash, rendering the website unresponsive to the users. This creates a bad user experience for the human visitors on the website which defies the entire purpose of that site. Do not hit the servers too frequently. or we can choose to Scrape During Off-Peak Hours.
- Finding reliable sources
The quality of data is also very important so to insure this we can
a. Avoid Sites with too many broken links
b. Avoid Sites with Highly Dynamic Coding Practices
c. Always check the freshness of data
- Avoid getting IP blocked
It can happen that making multiple request to source website can block you for limited or unlimited time . To address this you can do these things
a. Use Proxies- The best way to avoid IP blocking is by regularly rotating your identity, i.e. your IP address. It is always better to rotate IP’s and use proxy services and VPN services so that your scrapper won’t get blocked .By doing so you are now making requests through different ips and this site here provides list of free proxies you can start with but don’t use them in production environment since that would not be efficient
b. Use random time delays — To make operation less like a bot and more like a human provide some random time delays after making each request . This will lower your chances of getting detected as bot.
c. user-agent — The User-Agent request header is a character string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent. If you are using the same user-agent for every request you will be banned in no time. To avoid you have to either create a list of User-Agents or maybe use libraries like fake-useragents.
d. headers — When you make a request to a website from your browser it sends a list of headers. Using headers, website analyses about your identity. To make your scraper look more human you can use these headers.
- Avoid Captcha / Use Captcha Solving Services
Many websites use ReCaptcha from Google which lets you pass a test. If the test goes successful within a certain time frame then it considers that you are not a bot but a real human being. IF you are scrapping frequently this problem is very common to come . In this case you can use these captcha solving services . Note that some of these CAPTCHA solving services are fairly slow and expensive, so you may need to consider whether it is still economically viable to scrape sites that require continuous CAPTCHA solving over time. I am linking this article if you come across this problem .
- Avoid Honeypot Traps
These are traps by some websites to detect any hacking activity . It can easily detect if the user is a bot . This problem is sometimes very hard to deal with . Just follow the best practices and try to scrape data slowly and pretend less like a bot and you can avoid this problem.
The Trick Question : Legalization
I look web scrapping as a tool . A tool that we should never misuse .
We now got a pretty big picture about web scrapping . I tried to answer many question here . I hope i answered them well . If there are any problems in this article or you have any suggestions please write to me . If you don’t like any part and needs to be corrected Or if there is anything that i can help you with let me know .
MY PROFILES —
In writing this article i took help from these articles -
Introduction to Web Scraping — GeeksforGeeks
The Ultimate Guide to Web Data Extraction — PromptCloud
Advanced Python Web Scraping Tactics | Pluralsight
Web Scraping with Python and Scrapy | Pluralsight
Web Scraping Techniques: How to Scrape Data from the Internet | ParseHub
The Most Effective Web Scraping Methods | by JetRuby Agency | JetRuby
Python’s Requests Library (Guide) — Real Python
5 Tasty Python Web Scraping Libraries (elitedatascience.com)
Top 5 Popular Python Libraries for Web Scraping in 2020 | ScrapingAnt Blog