There are many press release distribution websites that will allow you to gain valuable information and news. You can extract this data to help you write the most recent articles, industry analysis and opportunities for investments.
Constantly browsing these websites to find press releases that you’re interested in can be a tedious process.
We are ParseHub, and today we’ll show you how you can scrape a press release distribution website like Cision to get the latest news in any industry.
Getting Started
With a web scraper like ParseHub, we will be able to scrape the latest press releases in a specific industry. We will extract the headline, description, time posted and Press release URL.
Make sure to download and install ParseHub for free before you get started.
Now let’s begin!
Web scraping a press release website like Cision
Cision is a public relations and earned media company that is continuously providing press releases related to many different industries around the world.
For this project, we are going to scrape press release and stories that are related to the energy industry.
If you would like to follow along with this example, you can use this link here.
How to scrape press release data
- Download and install PareseHub. Click on the new project and button and submit the URL into the text box. The website will now render inside the ParseHub.
2. A select command will automatically be created. If not simply click on the PLUS (+) sign and choose the select command. While using the select command, click on the first headline that is on the page.You should notice the headline you selected will be in green. ParseHub will now suggest which other elements you want to extract in yellow.
3. Click on the next headline that is in yellow to select them all. You may need to do this 2-3 times to teach ParseHub what to extract. The rest of the headlines will now be highlighted in green.
4. On the left sidebar rename your headline selection to something more appropriate, we’re going to name it “headline”
5. Click on the PLUS (+) sign next to your headline and choose the relevant select command.
6. Click on the first headline that is highlighted in orange, then click on the description below it. An arrow will appear showing the association you have created. You may need to repeat this step to fully train the Web scraper. Rename your selection to “description”.
7. Repeat steps 5-6 to extract data like time posted.
Adding pagination
If we were to start our project, we would only give extract 25 headlines. We will now teach you how to add pagination to your web scraping project.
- Click the PLUS(+) sign next to your page selection and choose the “Select” command
2. Using the Select command, scroll down to the Next Page link. Click on it to select it and rename your selection to next_button
3. Click on the icon next to your next_button selection to expand it
4. Delete the two commands under the next selection
5. Click on the PLUS(+) sign next to your next selection and add a Click command
6. A pop-up will appear asking you if this a “next page” link. Click on Yes and enter the number of times you’d like to repeat this process. In this case, we will repeat it 4 times.
Your project should look like this:
Running your Scrape
It is now time to run your scrape. To do this, click on the green Get Data button on the left sidebar. Here you will be able to test, schedule, or run your scrape job.
For larger projects, we recommend that you always test your job before running it. In this case, we will run it right away.
Once your run is completed, you will be able to download it as an Excel or JSON file.
Closing Thoughts
You now know how to scrape a press release website like Cision. The great thing about ParseHub is you can schedule your project to run every hour, day or week, depending on what you need. This way you can always get the latest news.
If you run into any issues during this project, reach out to us via the live chat on our site and we will be happy to assist you with your project.
Happy Scraping!