Web Scraping projects can get quite complex. According to TechRadar, "Web scraping (web data extraction, web harvesting) is the process of fetching data from websites to be processed later."
For example, you might be trying to extract data from multiple different URLs from the same website.
Today, we will go over how to set up a web scraper to extract data from multiple different URLs.
A Free and Powerful Web Scraper
First up, you will need the right web scraper to tackle this task.
We personally recommend ParseHub, a free and powerful web scraper that can extract data from any website.
To get started with this guide, download and install ParseHub for free.
Is It Legal To Scrape Multiple Websites?
Scraping multiple websites is legal as long as the data sources are public. If you are scraping sensitive or private data, you will be unlawfully harvesting data.
To be safe, make sure to only scrape public data, that does not need a login to access. You should also check with your municipal and federal laws to make sure you're conducting legal web scraping.
Submitting your list of URLs to Scrape
Now it’s time to get started scraping.
For this example, we will extract data from 4 random Amazon product listings. We will set up our scraper to open the URLs for each product page and extract some data we have selected.
- Start by opening ParseHub. Click on “new project” and enter a basic URL. We will start with the Amazon.ca homepage.
- ParseHub will now render the webpage inside the app.
- Now, let’s give ParseHub the list of URLs we will be scraping data from. To do this, start by clicking on the Settings icon at the top right of the screen.
- Here in the settings menu, we can submit our list of URLs under the “Starting Value” section. We can do that either by clicking on the green “Import from CSV/JSON” button or by copy/pasting our list of URLs in JSON format right into the text box.
- If you are submitting a CSV file with URLs, make sure the heading reads “urls” and the URLS are copy-pasted below it. Just like in the image below.
- In this case, we will just copy-paste our list of URLs in a JSON format in the text box.
- Here is the code we pasted in, in case you’d just want to use it and replace it with your own URLs.
{
"urls": [
"https://www.amazon.ca/Fire-TV-Stick-with-All-New-Alexa-Voice-Remote/dp/B0791Z1G6W",
"https://www.amazon.ca/CM7000-Surround-Headphones-Canceling-Nintendo/dp/B07W22F9G3",
"https://www.amazon.ca/Nintendo-Switch-with-Gray-Joy%E2%80%91Con/dp/B07VJRZ62R",
"https://www.amazon.ca/Etubby-Microphone-Suspension-Adjustable-Management/dp/B07DHFDDS1"
]
}
- Now close the settings panel by clicking on “Back to Commands” on the top right.
How to Scrape Multiple URLs
Now that we have submitted our list of URLs, it is time to start setting up ParseHub to navigate through all the URLs and extract data.
- Back on the Commands screen, click on the PLUS (+) sign next to your “select page” command.
- Click on “Advanced” and select the “Loop” command.
- By default, the Loop command should be Looping though every item in “urls”. If it isn’t, use the dropdown to select “urls”
- Click on the PLUS (+) sign next to your new Loop command and choose the “Begin New Entry” command under “Advanced”.
- Make sure that this command has a different name than “urls”. We will rename this one to “amzn_urls”
- This step is optional, but we want our final scrape to keep the URLs for each product we are scraping data from. To do this, click on the PLUS (+) sing next to your “begin new entry command” and choose the “Extract” command under “Advanced”.
- Rename this command to “link” and replace “$location.href” with “item” in the text box below.
- Now we will tell ParseHub to start loading the URLs we’ve selected. Start by clicking on the PLUS (+) sign next to your “Begin new entry” command and choose the “Go To Template” command under “Advanced”.
- A pop up will appear with a few available selections. In the “Go to URL” field we will write in “link” without quotations. On the “Create New Template” field, we will name our new template “product_page”. Once done, click on the green “Create New Template” button.
- The page for the first URL on the list will now render inside the app. Make sure to select the new “product_page” template tab on the left sidebar.
- A “Select” command will be created by default. We will use this command to extract data from this page. Start by clicking on the product name on the page to extract it. The product name on the page will be highlighted in green to indicate that it has been selected. On the left sidebar, rename your selection to “product_name”.
- Now, click on the PLUS (+) sign next to your “select page” command to add a new “Select” command and extract more data as we did on step 10. We will click on the product price to extract it. Rename this new command accordingly. Repeat these steps if you want to add more Select commands and extract more data. Our template ended up looking like this:
Want to extract even more data from Amazon? Check out our in-depth guide on how to scrape data from Amazon.
Running Your Scrape
It’s now time to run our scrape job and extract all the data we have selected.
Start by clicking on the green Get Data button on the left sidebar. Here you will be able to Test, Run or Schedule your project. In this case, we will run it right away.
ParseHub will now go and scrape the data you’ve selected. You will be notified when it’s done.
Once the scrape is completed, you will be able to download your data as a CSV or JSON file.
Closing Thoughts
Thank you for reading our updated 2023 guide on scraping from multiple URLS. Now you should be able to scrape data from multiple URLs from a list, if you still need help, feel free to contact our live chat support.
Looking to scrape data from multiple pages via navigation? Check out our guide on how to scrape data from multiple pages via navigation.
You can also read our guide on how to scrape data behind a login screen.
Happy scraping!