How to Scrape PubMed and Research Papers

In this blog, we will show you how to scrape medical journals and research papers from PubMed with our free web scraper, ParseHub, in just a few steps!

You can also check out our guide on web scraping Google Scholar.

Whether you need to scrape journals and papers for research, or for citations, this guide will help you gather large amounts of references, abstracts and articles from PubMed. PubMed was released in 1996, and is over 27 years old! It has been used ever since as a free search engine for life science and bio-med research papers. PubMed has over 30 million records, 7.5 million of which are available for free.

Ready to scrape scholarly journals and papers? Let’s get started!

Step 1: Scraping Articles

  1. Firstly, open the ParseHub application and log in.
  2. On the front page, click “New Project” to start a new project.
  3. Enter the PubMed URL you would like to scrape from, we will use this URL to scrape articles containing the term ‘web scraping’: https://pubmed.ncbi.nlm.nih.gov/?term=web+scraping
  4. Once the page fully loads, click the first article’s name to extract it, it should be an A tag.
  5. The rest of the article titles should turn yellow, click the next one to train the algorithm.
  6. All 10 article titles should now be extracted from the first page, rename this selection to “article” on the left pane.

Step 2: Scraping Additional Information

Now that we have each article title, let’s scrape their authors, summaries and ids:

  1. Firstly, click the PLUS(+) button on the “article” selection you just made.
  2. Choose “Relative Select” and click the first article’s title again.
  3. Move the arrow to the respective article’s author(s) and click to close the arrow.
  4. All 10 article authors will now be extracted, rename this selection to the left to “author”.
  5. Now do another “Relative Select” again by clicking the PLUS(+) button next to your “article” selection from the first step.
  6. Click the first article’s title and this time close the arrow on its summary.
  7. Rename this extraction on the left to “info”.
  8. Finally, redo the “Relative Select” step again and this time click the title and then the PubMed ID.
  9. Rename this selection on the left to “PMID”.

Step 3: Pagination

If we begin the script now, we will only scrape a single page. To scrape multiple or all pages, we need to use ParseHub’s pagination.

  1. Begin by scrolling down the PubMed page until you see the next page navigation bar.
  2. Click the PLUS(+) button next to your “page” selection and choose “Select”.
  3. Click the next button on the navigation bar to select it.
  4. Rename this selection to “pagination”, expand it, and delete the extraction.
  5. Click the PLUS(+) button next to your “pagination” selection and choose “Click”.
  6. Choose “Yes” on the popup, as this is a next-page button.
  7. Enter the number of additional pages you wish to scrape, we chose 2, which is 3 pages of scraped data in total. Entering 0 will scrape every single page available!

Step 4: Start PubMed Scraping

Awesome, just one step left, and that is to begin scraping on ParseHub’s servers!

To begin scraping, click the green “Get Data” button on the left-hand side of ParseHub. You can test, run or schedule your scrape. We chose “Run” to scrape a single time, which resulted in 3 pages of scraped data. (as specified in the pagination step)

If you followed our tutorial, your data should look like this:

If you need help scraping scientific journals, articles or any other publication, you can contact our live support.

Happy Scraping! 💻