When it comes to search engines, Google is King. Google has been visited 62.9 billion times in 2019 alone. With millions of websites on google, you can extract this data to help you write the most recent articles, industry analyses, keyword research and SEO opportunities. You can also find great books that have been peer-reviewed to help you with your research!
We are ParseHub, and today we’ll show you how you can scrape Google Scholar on Google to extract content that is relevant to your research.
So let's get started
With a web scraper like ParseHub, we will be able to pdf and book URLs that are related to a certain keyword. We will extract the page title, description and Author.
Now let’s begin!
Web scraping Google Scholar
For this project, we are going to scrape the pdf and books that target the term “data science”
If you would like to follow along with this example, you can use this link here.
How to scrape Google Scholar
- Download and install PareseHub. Click on the new project button and submit the URL into the text box. The website will now render inside the App.
2. A select command will automatically be created. While using the select command, click on the first title that is on the results page. You should notice the title you selected will be in green. ParseHub will now suggest which other elements you want to extract in yellow.
3. Click on the next title that is in yellow to select them all. You may need to do this 2-to 3 times to teach ParseHub what to extract. The rest of the page titles will now be highlighted in green.
4. On the left sidebar rename your headline selection to something more appropriate, we’re going to name it “title”
5. Click on the PLUS (+) sign next to your title and choose the relative select command.
6. Click on the title that is highlighted in orange, then click on the description below it. An arrow will appear showing the association you have created. You may need to repeat this step to fully train the Web scraper. Rename your selection to “description”.
7. Repeat the following steps to extract other data like author.
Your project should look like this:
If we were to start our project, we would only extract 10 URL titles. We will now teach you how to add pagination to your web scraping project.
- Click the PLUS(+) sign next to your page selection and choose the “Select” command.
- Using the Select command, scroll all the way down to the Next Page link. Click on it to select it and rename your selection to next_button.
3. Click on the icon next to your next_button selection to expand it.
4. Delete the two commands under the next selection.
5. Click on the PLUS(+) sign next to your next selection and add a Click command.
6. A pop-up will appear asking you if this is a “next page” link. Click on Yes and enter the number of times you’d like to repeat this process. In this case, we will repeat it 4 times.
Running your Scrape
It is now time to run your scrape. To do this, click on the green Get Data button on the left sidebar. Here you will be able to test, schedule, or run your scrape job.
For larger projects, we recommend that you always test your job before running it. In this case, we will run it right away.
Once your run is completed, you will be able to download it as an Excel or JSON file.
Enabling IP Rotation (Paid Feature)
If your web scraping project comes back blank, you may need to enable IP Rotation. You may be getting blocked from scraping the data, but IP rotation will allow you to still scrape the hotel data
If you’re getting blocked, let’s show you how you can enable IP Rotation.
Note: If you do enable IP rotation, your project will take longer to complete.
- Click on the gear icon, and then select settings
- Click on Rotate IP address
- A Popup will appear with to warning about your run speed, click on OK
Now run your project as normal.
You now know how to scrape Google Scholar. The great thing about ParseHub is you can schedule your project to run every hour, day or week, depending on what you need.
If you run into any issues during this project, reach out to us via the live chat on our site and we will be happy to assist you with your project.