Wide Finder with LINQ

02Oct07

I’ve been dabbling with the Wide Finder problem and have found a solution in C# 3.0 that is startlingly elegant.

To recap, the Wide Finder challenge is to write a program that:

  1. Scans logfiles for hits on blog articles
  2. which are counted and
  3. sorted with the
  4. top 10 most popular being printed to stdout. It should also
  5. be about as elegant and concise as Tim’s Ruby version and
  6. its performance should scale with additional CPU cores.

Here’s my solution in C# 3.0/.NET 3.5.

using (new QuickTimer("Total time"))
{
    IEnumerable<string> data = new LineReader(args[0]);

    Regex regex = new Regex(@"GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/([^ ]+) ",
        RegexOptions.Compiled | RegexOptions.CultureInvariant);

    var result = from line in data
                 let match = regex.Match(line)
                 where match.Success
                 group match by match.Groups[1].Value into grp
                 orderby grp.Count() descending
                 select new { Article = grp.Key, Count = grp.Count() };

    foreach (var v in result.Take(10))
        Console.WriteLine("{0}: {1}", v.Article, v.Count);
}

In case you’re unfamiliar with C# 3.0, the big chunk in the middle is a LINQ query. Think of it as a Python list comprehension, except 1) more expressive, 2) more declarative, and 3) more extensible.

On my desktop machine (Core 2 Duo 1.83GHz, 2GB, Windows Vista 32-bit), it processes 1,000,000 lines in 4.5 seconds once the OS disk cache has gotten warmed up. Under the same conditions, Tim’s original Ruby script takes about 6.5 seconds (±0.5 sec due to me using an actual stopwatch!). Now, It’s easy to write a faster serial implementation than this in C#–I wrote a (still fairly naive) version that took about 2.5 seconds, in 85 lines of code. But this implementation seemed more in keeping with the spirit of the Wide Finder challenge, considering requirement #5.

PLINQ

Now, as currently written, this is a serial program: no matter how many cores you throw at it, it will only use one. But the whole point of the Wide Finder exercise, as Tim reminds, is to find a minimally invasive syntax change to take advantage of multiple CPUs.

As it turns out, some smart folks at Microsoft have been working on just that: enter Parallel LINQ, part of the Parallel FX library (which also includes the Task Parallel Library I blogged about earlier).

You can read all about PLINQ in this MSDN Magazine article. In a nutshell, it’s an implementation of LINQ that can transparently use multiple threads to execute the query and merge the results back together. To use PLINQ, you basically add a single method call to your LINQ query (in the code above, simply change “from line in data” to “from line in data.AsParallel()”). That’s all it takes for PLINQ to take over!

Looks very promising–once they release some bits I’ll give it a try and post an update.

About these ads


4 Responses to “Wide Finder with LINQ”

  1. Thanks for your comments on my first test post using WLW. I’ve done another test post that seemed to work correctly, with the image upload successful.

    Congrats on what seems to be a powerful blog editor. I’ll certainly put it to the test.

    To completely ‘kill off’ my desire for Performancing/ScribeFire, you need to develop an add-on for Firefox, for in-browser, in-window blog publishing. I would never have looked for your fine program if ScribeFire hadn’t bombed out when I installed Feedburner. I’ll miss the fast action ScribeFire brought to my blogging…

  2. 2 farhad

    i’m currently learning about linq and this is very helpful..


  1. 1 Scott Hanselman's Computer Zen - The Weekly Source Code 9 - WideFinder Edition
  2. 2 每周源代码9 – WideFinder版本 | Hello VisualStudio

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: