Many websites will have multiple pages, some range from 10 while others will have more than 5000 pages on their domain.

You may want to scrape a sitemap to see all of the pages the website has.

Today we’ll quickly show you how you can scrape a website’s sitemap

Getting started

Before we begin, you’ll need a web scraping tool. While there are several tools available, we think you’ll enjoy ParseHub, its free to use and has a suite of features that we think you’ll enjoy like:

  • Cloud-based scraping
  • IP rotation
  • Dropbox integration
  • Many more!

Download ParseHub for free

Web scraping a sitemap without any coding skills

To find a website’s sitemap

Simply put one of the following after the domain name:

  • /sitemap
  • /sitemap.xml
  • sitemap_index.xml

So for example

https://yourwebsite.com/sitemap.xml

Once you've found the sitemap URL you want to scrape, let's get started!

  1. Download and install PareseHub. Click on the new project button and submit the URL into the text box. The website will now render inside the app.
Sitemap rendering inside of ParseHub

2. A select command will automatically be created. While using the select command, click on the first URL that is on the sitemap. You should notice the URL you’ve selected will be in green. ParseHub will now suggest which other URLs you want to extract in yellow.

selecting the first url on the page

3. Click on the next URL that is in yellow to select them all. The rest of the URLs will now be highlighted in green.

4. On the left sidebar rename your URL selection to something more appropriate, we’re going to name it “URL_index”

renaming your url selection

5. Expand your new extraction by clicking on the icon, and delete the name that is being extracted. (since the name is the same as the URL extraction)

6. Now use the PLUS(+) button next to the URL selection and choose the “Click” command. A pop-up will appear asking you if this link is a “next page” button. Click “No” and next to Create New Template input a new template name, in this case, we will use URL_Pages.

Click command

7. Once the new page is loaded, click on the first URL on the page to extract it.

8. ParseHub will now suggest which other URLs you want to extract in yellow.

9. Click on the next URL that is in yellow to select them all. The rest of the URLs will now be highlighted in green.

selecting all urls

10. Rename your new selection to something more appropriate, we're going to call it "page_url". Expand your new selection and delete the name that is being extracted.

11. On the left sidebar, click the PLUS(+) sign next to the URL selection and choose the Relative Select command.

Using the relative select command to extract date modified

12. Using the Relative Select command, click on the URL of the listing on the page that is highlighted in orange and then on the last date modified. You will see an arrow connect the two selections.

date modified

Running and Exporting your Project

Now our project is ready to scrape our sitemap. To do this, simply click on the left sidebar and click on the green “Get Data” button.

This is where you can test, run or schedule your project. We recommend doing a Test Run for longer and bigger projects just to make sure your data will be extracted and formatted correctly.

But for this project, click on the “Run” button to begin your scrape.

Once ParseHub is done scraping the website, you will be notified by email and you’ll be able to download your extracted data as an Excel/CSV or as a JSON file.

Closing Thoughts

Now you know how to scrape a sitemap and export it into a CSV/ Excel or JSON file without any coding skills.

This is a great way to analyze your competitor’s websites and see what pages they have.

We understand that projects can get quite complex. If you need any help you can contact our customer support team using our live chat. We will be more than happy to assist you!

What will you scrape?

Happy Scraping!

Download ParseHub for Free Today!