If you like shopping for clothes, or for household goods, then you have definitely heard of Macy’s! Macy’s is one of the largest department stores in the United States by retail sales. The stores were founded in 1858 and have since acquired many other chains such as Bloomingdale’s. The company has over 130,000 employees, 725 stores and earned more than 25 billion dollars in annual revenue since 2017. Their headquarter and flagship store is in New York City, where they host Fourth of July fireworks and Thanksgiving Day Parades every year. In this blog post, we will show you how to scrape the Macy’s website and the thousands of products that are listed online.
To follow along with this guide, register and download ParseHub for free.
Let’s scrape Macy’s products!
Scraping Product Names
- Begin by opening ParseHub and sign in.
- Create a new project by clicking the blue “New Project” button.
- Enter the Macy’s URL you wish to scrape, we will use this URL to scrape shoes on clearance: https://www.macys.com/shop/shoes/sale-clearance?id=13604&edge=hybrid
- Click the first product’s name to extract it, the rest should turn yellow.
- Click the next product name and all products should now be extracted!
- Rename this selection on the left to “shoe”.
Scraping Prices
When scraping additional data from each product, you will need to use ParseHub’s Relative Select tool.
- Begin by clicking the PLUS(+) icon next to the “shoe” extraction.
- Choose the “Relative Select” command.
- Click the first shoe’s name and an arrow will appear.
- Point the arrow to the respective shoe’s price, and click to select it.
- Click the next shoe’s name, and then its price to train the algorithm.
- All prices should now be extracted, rename this selection to “price” on the left.
Using RegEx
To clean up your price extraction, so it’s just the numeric value, you will need to use Regular Expressions in ParseHub.
- Begin by clicking the price extraction from the last step.
- Expand the selection by clicking on the expand icon.
- Tick the “Use regex” box and enter: (\d+.\d+)
- Now, you will have clean prices without the “Now CAD” text!
Scraping Multiple Pages
To scrape more than the 60 products on the first page, we need to parse through the next pages with ParseHub’s pagination.
- Begin by scrolling down the webpage until you see the next page chevron.
- Click the PLUS(+) button next to the “page” selection, and choose Select.
- Click the next page chevron to select it.
- Rename this selection on the left to “pagination” and expand it to remove the data extraction.
- Click the PLUS(+) button next to this selection, and choose Click.
- A popup will appear asking if this is a next page button, choose Yes.
- You can now choose the additional amount of pages to scrape, we will choose 2 which means 3 pages scraped in total!
Bypassing Blocks (Paid Featured)
Many websites, especially eCommerce stores, have methods in place to block web scraping and web crawlers. To bypass scraping blocks on Macy’s, you need to enable IP Rotation. Simply, click the settings cog at the top left, and under Settings find the IP Rotation checkbox. Click to enable it and now you should be able to scrape without blocks!
Note: You may need to use your own custom proxies, we recommend IPRoyal’s Residential Proxies. They have a tutorial on integrating proxies with ParseHub.
Starting Your Scrape
To begin scraping, simply click the green “Get Data” button on the left-hand side of ParseHub. You will be able to test, run or schedule your scrape. Scheduling can be useful for having up-to-date products and prices at your disposal. In our case, we will choose Run, to have the scrape run a single time. This will get us 3 pages of results, as we specified in the pagination step!
If you followed our guide correctly, your data should look like this:
That concludes our Macy’s scraping tutorial, we hope you enjoyed it!
If you run into any issues web scraping eCommerce stores or other websites, feel free to reach out to our support.
Happy Scraping! 🛍️