Wednesday, December 8, 2010

Verify a List of URLs in C# Asynchronously

Recently I wanted to test a bunch of URLs to see whether they were broken/valid.  In my scenario, I was checking on URLs for advertisements that are served by Lake Quincy Media’s ad server (LQM is the largest Microsoft developer focused advertising network/agency).  However, this kind of thing is an extremely common task that should be very easy for any web developer or even just website administrator to want to do.  It also gave me an opportunity to use the new asynch features in .NET 4 for a production use, since prior to this I’d only played with samples.

Check if a URL is OK

First, you’ll need a method that will tell you whether a given URL is OK.  What OK means might vary based on your needs – in my case I was just looking for the status code.  I found the following codehere.

   1: private static bool RemoteFileExists(string url)
   2: {
   3:     try
   4:     {
   5:         var request = WebRequest.Create(url) as HttpWebRequest;
   6:         request.Method = "HEAD";
   7:         var response = request.GetResponse() as HttpWebResponse;
   8:         return (response.StatusCode == HttpStatusCode.OK);
   9:     }
  10:     catch
  11:     {
  12:         return false;
  13:     }
  14: }


Using this Synchronously

If you want to use this synchronously, it’s pretty simple.  Get a list of URLs and write a loop something like this:

   1: foreach (var link in myLinksToCheck)
   2: {
   3:    link.IsValid = RemoteFileExists(link.Url);
   4: }


I checked about 1500 URLs with my script and I wrote it with a flag that would let me run it synch or asynch.  The synchronous version took about an hour and forty minutes to complete.  The asynch one took about seventeen minutes to complete.

Make it Parallel

If you want to see how to do things using the parallel libraries that are now part of .NET 4, there’s no better place to start than the Samples for Parallel Programming with the .NET Framework 4.  There’s some very cool stuff here.  Be sure to check out the Conway’s Game of Life WPF sample.

For me, there were two steps I had to take to turn my synchronous process into a parallelizable process.

1. Create an Action<T> method that would perform the URL check operation and store the result in my collection.  I created a method UpdateUrlStatus(Link linkToCheck) to do this work.

2. Call this method using the new Parallel.For() helper found in System.Threading.Tasks.

Here’s the code, slightly modified from my own domain-specific code:

   1: var linkList = GetLinks();  
   2: Console.WriteLine("Loaded {0} links.", linkList.Count);
   4: Action<int> updateLink = i =>
   5:     {
   6:         UpdateLinkStatus(linkList[i]);
   7:         Console.Write(".");
   8:     };
   9: Parallel.For(0, linkList.Count, updateLink);
  11: // replaces this synchronous version:
  12: for(int i=0; i < linkList.Count; i++)
  13: {
  14:     updateLink(i);
  15: }


In my scenario, using the parallel instead of the iterative approach dropped the time from about 100 minutes down to about 17.  That’s on a machine that appears to windows to have 8 cores.  100/8 = 12.5 so it’s not quite a straight eightfold increase, but it’s close.  If you’ve got applications that are doing a lot of the same kind of work and each operation has little or no dependencies on the other operations, consider using Action<T> and Parallel.For() to take advantage of the many cores available on most modern computers to speed it up.




No comments:

Post a Comment