Posted on 5th Aug 2017 00:00:00 in Teachers
If we take our scraper script so far, we can perform a basic search on IMDb and scrape the single page of results that is returned for the movies’ URLs.
But what if we want to scrape all of the results pages? What if we then want to scrape all of the results for their specific attributes, such as movie name, release date, description, director and so on…?
Well, that’s what we’ll be covering today. Using PHP and cURL to navigate the results pages and scrape multiple pages of the website for data and organise that data into a logical structure for further use.
So, our first task is to get the URLs from all of the results pages. This involves evaluating whether there is another page of results and, if there is, visiting it, scraping the results URLs and adding them to our array.
If we take our script from last time and include our scrape_between() and curl() functions, we need to make the following changes to the script. Don’t worry, I’ll talk the through after.
First up we retrieve the initial results page. Then we scrape all of the results and add them to the array $results_urls. Then we check to see if there is a “Next” link to another page of results, if there is then we scrape that and loop through the script to repeat the scraping of results from the next page. The loop iterates and continues to visit the next page, scraping the results, until there are no more pages of results.
Now we have an array with all of the results URLs, for which we can do a foreach() over to visit each URL and scrape the results. I’ll leave that to you, with what we’ve covered so far it should be easy to figure out.
I’ll get you started:
In the next post in the series I’ll post up the code you should have got and then we’ll cover downloading images and other files.
Up next time: Downloading Images And Files With PHP & CURL