call back on cheerio node.js


I'm trying to write a scrapper using 'request' and 'cheerio'. I have an array of 100 urls. I'm looping over the array and using 'request' on each url and then doing cheerio.load(body). If I increase i above 3 (i.e. change it to i < 3 for testing) the scraper breaks because var productNumber is undefined and I can't call split on undefined variable. I think that the for loop is moving on before the webpage responds and has time to load the body with cheerio, and this question: nodeJS - Using a callback function with Cheerio would seem to agree.

My problem is that I don't understand how I can make sure the webpage has 'loaded' or been parsed in each iteration of the loop so that I don't get any undefined variables. According to the other answer I don't need a callback, but then how do I do it?

for (var i = 0; i < productLinks.length; i++) {
    productUrl = productLinks[i];
    request(productUrl, function(err, resp, body) {
        if (err)
            throw err;
        $ = cheerio.load(body);
        var imageUrl = $("#bigImage").attr('src'),
            productNumber = $("#product").attr('class').split(/\s+/)[3].split("_")[1]


Example of output:


TypeError: Cannot call method 'split' of undefined
Problem courtesy of: brownie3003


You are scraping some external site(s). You can't be sure the HTML all fits exactly the same structure, so you need to be defensive on how you traverse it.

var product = $('#product');
if (!product) return console.log('Cannot find a product element');
var productClass = product.attr('class');
if (!productClass) return console.log('Product element does not have a class defined');
var productNumber = productClass.split(/\s+/)[3].split("_")[1];

This'll help you debug where things are going wrong, and perhaps indicate that you can't scrape your dataset as easily as you'd hoped.

Solution courtesy of: David Ellis


Since you're not creating a new $ variable for each iteration, it's being overwritten when a request is completed. This can lead to undefined behaviour, where one iteration of the loop is using $ just as it's being overwritten by another iteration.

So try creating a new variable:

var $ = cheerio.load(body);
^^^ this is the important part

Also, you are correct in assuming that the loop continues before the request is completed (in your situation, it isn't cheerio.load that is asynchronous, but request is). That's how asynchronous I/O works.

To coordinate asynchronous operations you can use, for instance, the async module; in this case, async.eachSeries might be useful.

Discussion courtesy of: robertklep

This recipe can be found in it's original form on Stack Over Flow.