There's no sleep()/wait for mutex in node.js, so how to deal with large IO tasks?

Problem

I have a large array of filenames I need to check, but I also need to respond to network clients. The easiest way is to perform:


    for(var i=0;i < array.length;i++) {
        fs.readFile(array[i], function(err, data) {...}); 
    }

, but array can be of any length, say 100000, so it's not a good idea to perform 100000 reads at once, on the other hand doing fs.readFileSync() can take too long. Also launching next fs.readFile() in callback, like this:


    var Idx = 0;
    function checkFile() {
       fs.readFile(array[Idx], function (err, data) {
          Idx++;
          if (Idx < array.length) {
             checkFile();
          } else {
             Idx = 0;
             setTimeout(checkFile, 10000); // start checking files in one second
          }
       });
    }

is also not a best option, because array[] gets constantly updated by network clients - some items deleted, new added and so on.

What is the best way to accomplish such a task in node.js?

Problem courtesy of: d0rc

Solution

You should stick to your first solution (fs.readFile). For file I/O, node.js uses a thread pool. The reason is that most unix kernels don't provide efficient asynchronous APIs for the file system. Even if you start 10,000 reads concurrently, only a few reads will actually run and the rest will wait in a queue.

In order to make this answer more interesting, I browsed through node's code again to make sure that things hadn't changed.

Long story short, file I/O uses blocking system calls and is made by a thread pool with at most 4 concurrent threads.

The important code is in libeio, which is abstracted by libuv. All I/O code is wrapped by macros which queue requests. For example:

eio_req *eio_read (int fd, void *buf, size_t length, off_t offset, int pri, eio_cb cb, void *data, eio_channel *channel)
{
  REQ (EIO_READ); req->int1 = fd; req->offs = offset; req->size = length; req->ptr2 = buf; SEND;
}

REQ prepares the request and SEND queues it. We eventually end up in etp_maybe_start_thread:

static unsigned int started, idle, wanted = 4;

(...)

static void
etp_maybe_start_thread (void)
{
  if (ecb_expect_true (etp_nthreads () >= wanted))
    return;
(...)

The queue keeps 4 threads running to process the requests. When our read request is finally executed, eio simply use the block read from unistd.h:

case EIO_READ:      ALLOC (req->size);
                          req->result = req->offs >= 0
                                      ? pread     (req->int1, req->ptr2, req->size, req->offs)
                                      : read      (req->int1, req->ptr2, req->size); break;
Solution courtesy of: Laurent Perrin

Discussion

There is currently no discussion for this recipe.

This recipe can be found in it's original form on Stack Over Flow.