Government websites have valuable information you can use for your research. Depending on the website, they can have valuable statistics and news articles related to your topic.
Since these are articles/ stats are on a government website, you can trust that the information is trustworthy.
Getting Started
With a web scraper like ParseHub, we will be able to scrape the latest press releases in a specific industry. We will extract the headline, description publisher and date.
Make sure to download and install ParseHub for free before you get started.
Now let’s begin!
Web scraping a government website
For this project, we are going to scrape the UK government website for stats that are related to COVID-19
If you would like to follow along with this example, you can use this link here.
How to scrape a government website
- Download and install PareseHub. Click on the new project and button and submit the URL into the text box. The website will now render inside the ParseHub.
2. A select command will automatically be created. If not simply click on the PLUS (+) sign and choose the select command. While using the select command, click on the first headline that is on the page. You should notice the headline you selected will be in green. ParseHub will now suggest which other elements you want to extract in yellow.
3. Click on the next headline that is in yellow to select them all. You may need to do this 2-3 times to teach ParseHub what to extract. The rest of the headlines will now be highlighted in green.
4. On the left sidebar rename your headline selection to something more appropriate, we’re going to name it “headline”
5. Click on the PLUS (+) sign next to your headline and choose the relative select command.
6. Click on the first headline that is highlighted in orange, then click on the description below it. An arrow will appear showing the association you have created. You may need to repeat this step to fully train the Web scraper. Rename your selection to “description”.
7. Repeat steps 5-6 to extract data like publisher and date posted.
Adding pagination
If we were to start our project, we would only give extract 20 headlines. We will now teach you how to add pagination to your web scraping project.
- Click the PLUS(+) sign next to your page selection and choose the “Select” command.
2. Using the Select command, scroll all the way down to the Next Page link. Click on it to select it and rename your selection to next_button.
3. Click on the icon next to your next_button selection to expand it.
4. Delete the two commands under the next selection.
5. Click on the PLUS(+) sign next to your next selection and add a Click command.
6. A pop-up will appear asking you if this a “next page” link. Click on Yes and enter the number of times you’d like to repeat this process. In this case, we will repeat it 3 times.
Running your Scrape
It is now time to run your scrape. To do this, click on the green Get Data button on the left sidebar. Here you will be able to test, schedule, or run your scrape job.
For larger projects, we recommend that you always test your job before running it. In this case, we will run it right away.
Once your run is completed, you will be able to download it as an Excel or JSON file.
Closing Thoughts
There you go, now you know how to scrape a government website without any coding skills! Government websites have a lot of data that can be used for your own research pieces.
If you run into any issues during this project, reach out to us via the live chat on our site and we will be happy to assist you with your project.
Happy Scraping!