Web-Scraping : Guide

Introduction to Web Scraping

Web scraping is a technique to fetch data from websites . One way is to manually copy-paste the needed data from the websites but this is very much time-consuming .Web scrapping automate the process of data extraction from websites which is done with the help of software or programs called scrappers . These can be custom built to work for one site or can be configured to work with any website based on user requirements.

Potential need for web scrapping:

  1. To analyze customers feedback to understand the customer’s perspective about their product that help companies to develop new strategies.
  2. To collect data to make services like content aggregator to serve content to consumers.
  3. To analyze financial statements and develop more efficient programs to manage finance.
  4. To understand social media trends , how people react to different social media trends.
  5. To do any other kind of research and development work.

Let’s talk technical now !!

Alright , we have a rough idea what web scrapping is and it’s uses. Now the question is How we do it ? right ? Well we have various options and techniques available for doing it But I will choose python anytime for this task. Ask me why ? Simple Answer — Python have various features which makes it most suitable.

  1. Python is super simple and easy to use. We have to write less lines of code and it’s less messy.
  2. Support for a large number of libraries like Pandas , requests, matplotlib ,selenium, scrapy , etc. which makes it best suitable for data extraction and manipulation.
  3. Since python is open source we have a large and active community which helps us if we get stuck anywhere.
  1. We need the website URL that we want to scrape the data from.
  2. We make a request to the URL that will give us some response.
  3. We parse the response.
  4. We find the data we want to extract.
  5. We store the data in required format.

Libraries

Libraries eases the work they come with various inbuilt implementations which we can use directly . There are tons of libraries and niche scrapers around the community, but I’d like to share the most popular of them.

1. Requests

Python Requests is the most basic HTTP library available. Requests allow the user to send requests to an HTTP server and get responses in the form of HTML or JSON. It also allows the user to submit POST requests to the server in order to change or add content. It does not parse the received HTML data, thus, another library is needed.

2. Beautiful Soup

Beautiful Soup is a Python library for parsing structured data. It allows you to interact with HTML in a similar way to how you interact with a web page using developer tools.

3. LXML

This is one of the best parsers for HTML and XML. It is used to ease the handling of XML and HTML files. It is widely used for its simplicity and extremely fast response. This library is very useful in web Scraping as this can easily parse the large HTML or XML files.

  • If you need speed, go for lxml.
  • If you need to handle messy documents, choose Beautiful Soup.

4. Selenium

Selenium acts as a web driver. This API provides a way to use WebDriver like Firefox, Ie, Chrome, Remote, etc. A program can do almost all the tasks that can be performed by a user on web browsers like form fillings, form clicking or button clicking, browser opening and a lot more. It is a very useful tool in Python for web Scraping.

5. Scrapy

Scrapy is a web scraping framework. It is one of the most advanced scraping framework available in Python. This Scrapy provides bots that can scrape thousands of web pages at once. Here you have to create a web spider that will go from one page to another and provides you the data.

6. Pyppeteer

Puppeter is a tool developed by Google based on node.js. With it, we can control some operations of Chrome browser through JavaScript. and pyppeteer is actually the implementation of the python version of puppeter, but it is not developed by Google. It is an unofficial version developed by an engineer from Japan based on some functions of puppeter. It uses chromium and it avoids tedious environment configurations.

How To choose right tools?

Now , We discussed most used libraries , yes I understand it was a short introduction . Anyways , I have provided the links also if you want to know more about the functionalities of individual libraries . There are so much options here that one can easily get confused which library / tool will be right for required task . I will try to give this answer based on my experience.

  • The first one is to use the API. In this case, owners grant access to the users as they have a vested interest in subscriptions, newsletters, and so on.
    For this kind of task requests module will be more than enough.
    For Example-
  • Then comes the static pages without java scripts for this we can use Beautiful Soup along with requests module to parse the response.
    Output of soup will be a class which contains information in a tree like structure . we can easily find the data we are looking for with the help of various inbuilt methods like find, find_all , as shown in the example below.
    For Example —
  • Other type is static websites with java script enabled . For these websites we can use any session based approach .request library also have session class .just use the snippet (line 10 & 11) instead and it should work like charm.
  • The websites with dynamic data rendering . Dynamic content is one of the main challenges of web scraping. In this case, we’ll need any web-browser capable of playing a dynamic content and script on the client-side. Analyzing DOM structure based on the screen scraping can be a good choice. Selenium can be a go to option since we can control an actual browser .Selenium can also extract data from websites which requires login. However configuring this is sometimes really painful . One other option we can use Pyppeteer to solve the configuration problem.
    In this snippet I logged in to instagram and obtained media from homepage using selenium
  1. Anti-Scraping Technologies
    Some websites use anti-scraping technologies that push away any scraping attempt. They apply a dynamic coding algorithm to prevent any bot intervention . It requires a lot of time and money to work around such anti-scraping technologies.
  2. IP blocking
    IP address blocking is another common issue that a web crawler faces. If you are making the request more often to the website with the same IP, then there is a higher chance that the site will block your IP address .Some websites use anti-scraping technologies which makes the site hard to scrape. LinkedIn , Instagram can be examples of this.
  3. Website Structure Changes
    Every website periodically updates its user interface to improve its attractiveness and experience. This requires various structural changes too. Since the web scrapers are set up according to the code elements of the website at that time, they require changes too. So, they require changes weekly too to target the correct website for data scraping as incomplete information regarding the website structure will lead to improper scraping of data.
  4. Slowness
    If we are scrapping a large number of websites / pages we can come across such problem where scrappers takes a lot of time. It is really important to optimize our scripts to address such problem. We can also use multiple instances or dockers to minimize that.
  5. Quality and Freshness of the Data
    The quality and freshness of data must be one of your most important criteria while choosing sources for data extraction. The data that you acquire should be fresh and relevant to the current time period for it to be of any use at all. Always look for sites always updated frequently with fresh and relevant data when selecting sources for your data extraction project. You could check the last modified date on the site’s source code to get an idea of how fresh the data is.
  6. Captcha
    Captcha is a type of challenge-response test used in computing to determine whether or not the user is human, and it is quite popular these days for keeping spammers away .Web services like Cloudflare prevent bots and provide DDoS protection services, which makes it even harder for bots to perform their tasks.
  7. Website Specific Traps
    Many websites do such things to detect it the user is bot or human. They use AI based tools for that , which differ between how a human operates a website and how a bot does.
    This detection is not easy and requires a significant amount of programming work to accomplish correctly.
  8. Authentications
    Sometimes we need to scrape private data, which is available once you get authenticated on the website. Some times websites are simple straightforward just pass authentication in post headers and you are good to go but some are not that easy , they seek some hidden inputs or CSRF tokens or any other specific header which we have to look for . However selenium can be your solution since we can control browser it will handle all these most of the times.
  1. robot.txt
    Some websites like Flipkart for example provides robots.txt file https://www.flipkart.com/robots.txt.
    You should always check the Robots.txt file of a website you are planning to extract data from. Websites set rules on how bots should interact with the site in their robots.txt file. Think of a robots.txt file as being like a “Code of Conduct” sign posted on the wall at a gym, a bar, or a community center: The sign itself has no power to enforce the listed rules, but “good” patrons will follow the rules, while “bad” ones are likely to break them and get themselves banned.
  2. Pay respect to source website
    You have to understand that you are accessing other’s platform to complete your data goal then why not pay some respect to them .Web servers are susceptible to downtimes if the load is very high. Just like human users, bots can also add load to the website’s server. If the load exceeds a certain limit, the server might slow down or crash, rendering the website unresponsive to the users. This creates a bad user experience for the human visitors on the website which defies the entire purpose of that site. Do not hit the servers too frequently. or we can choose to Scrape During Off-Peak Hours.
  3. Finding reliable sources
    The quality of data is also very important so to insure this we can
    a. Avoid Sites with too many broken links
    b. Avoid Sites with Highly Dynamic Coding Practices
    c. Always check the freshness of data
  4. Avoid getting IP blocked
    It can happen that making multiple request to source website can block you for limited or unlimited time . To address this you can do these things
    a. Use Proxies- The best way to avoid IP blocking is by regularly rotating your identity, i.e. your IP address. It is always better to rotate IP’s and use proxy services and VPN services so that your scrapper won’t get blocked .By doing so you are now making requests through different ips and this site here provides list of free proxies you can start with but don’t use them in production environment since that would not be efficient
    b. Use random time delays — To make operation less like a bot and more like a human provide some random time delays after making each request . This will lower your chances of getting detected as bot.
    c. user-agent — The User-Agent request header is a character string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent. If you are using the same user-agent for every request you will be banned in no time. To avoid you have to either create a list of User-Agents or maybe use libraries like fake-useragents.
    d. headers — When you make a request to a website from your browser it sends a list of headers. Using headers, website analyses about your identity. To make your scraper look more human you can use these headers.
  5. Avoid Captcha / Use Captcha Solving Services
    Many websites use ReCaptcha from Google which lets you pass a test. If the test goes successful within a certain time frame then it considers that you are not a bot but a real human being. IF you are scrapping frequently this problem is very common to come . In this case you can use these captcha solving services . Note that some of these CAPTCHA solving services are fairly slow and expensive, so you may need to consider whether it is still economically viable to scrape sites that require continuous CAPTCHA solving over time. I am linking this article if you come across this problem .
  6. Avoid Honeypot Traps
    These are traps by some websites to detect any hacking activity . It can easily detect if the user is a bot . This problem is sometimes very hard to deal with . Just follow the best practices and try to scrape data slowly and pretend less like a bot and you can avoid this problem.

The Trick Question : Legalization

I am not a lawyer , I am an engineer who uses this as a tool to fulfil my data needs without abusing this power. There are various arguments in favor or against . There are various factors we have to see to answer this question like copyright , Violating Terms of Service (ToS) etc . Legality of Web Scraping is a grey area that tends to develop as time goes on. Although the web scrapers technically increase the speed up data surfing, loading, copying, and pasting web scraping is also the key culprit behind the increases cases of copyright violation, violated terms of use and other activities that are highly disruptive to a company’s business.

SHORT Word!

I look web scrapping as a tool . A tool that we should never misuse .
We now got a pretty big picture about web scrapping . I tried to answer many question here . I hope i answered them well . If there are any problems in this article or you have any suggestions please write to me . If you don’t like any part and needs to be corrected Or if there is anything that i can help you with let me know .

MY PROFILES —

Linkedin, GITHUB

HAPPY SCRAPPING!!!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sunnyraj

Sunnyraj

Machine Learning enthusiast . likes to play with models and experimenting new implementations.