Thursday 9 July 2009

Forays into Linq - Querying a text file.

Hi! I'm Phil and this is my first ever programming themed blog post. I am a software developer of around 7 years currently working in .net 3.5. It is my intention to post about programming matters that I find interesting as I encounter them. Hopefully the result will be as enjoyable to you as it is to me!

To start off I'd like to blog about a program I wrote recently. It was interesting to me as I was able to leverage a couple existing technologies to write something quite concise.

We had a problem with some "corrupted" text files - some MS Office documents had been converted to text files and apparently contained both the body of the text and the accompanying meta data. What was needed was a text file that contained just the body of the text.

A cursory look over some of the files showed that the meta data was always 72 characters wide and consisted of alphanumeric characters and the symbols '/', and '+'. In that case it is a simple matter to write a Regular Expression that will match a line of of meta data:

// regex recognises a row of corrupted data.
Regex corruptedText = new Regex(@"^[a-zA-Z0-9+/]{72}$");
So, we have a method for spotting lines of meta data and now what we need to do is iterate over every line of the file and copy all the lines that don't match to a new text file. Or do we? It would be nice if we could just ask the file to give me all the lines of text that don't match the regular expression, wouldn't it? It turns out that using Linq we can do exactly that!

Linq can be used to query any collection that implements the IEnumerable<> interface. If we had such a collection that had an item for each line of text we could create a Linq expression that used the regular expression to give us all the lines of text that weren't meta data:

IEnumerable<string> goodlines = textFile.Where(dataLine => !corruptedText.IsMatch(dataLine)).Select(dataLine => dataLine);
Where textFile represents a collection implementing IEnumerable. This will result in every instance of dataLine that doesn't get matched by the regular expression being placed in the goodlines collection. Now we have a complete method for outputting the data we want from a collection all that's left is to get the text from the file into said collection. It turns out that this is also easy enough to almost be trivial:

/// <summary>
/// File reader that enumerates every line of text in a file.
/// </summary>
class StreamReaderEx : StreamReader, IEnumerable
{
public StreamReaderEx(string path)
: base(path)
{ }
#region IEnumerable Members
public IEnumerator GetEnumerator()
{
string dataLine;
while ((dataLine = base.ReadLine()) != null)
yield return dataLine;
}
#endregion
#region IEnumerable Members
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
string dataLine;
while ((dataLine = base.ReadLine()) != null)
yield return dataLine;
}
#endregion
}
Let's take a minute to look at what we have here. The StreamReaderEx class inherits from the StreamReader class and implements the IEnumerable interface. Inheriting from the StreamReader class gives us all the functionality we need to read text files from the file system, and implementing the IEnumerator interface lets Linq in on the action too. We want our Linq expression to query each line of data so our implementation of IEnumerable<>.GetEnumerator() should do a yield return on each line of text from the text file. Using the yield return keywords means that the compiler will generate the enumerator class for us in the background.

So there we have it! The result is we have an application that is mostly declarative rather than imperative. This means that rather than write detailed instructions for every step of the process we can just ask for a specific result and the compiler will interpret that request and transform it into detailed instructions for us. The advantages of being declarative rather than imperative are that the code is faster to write, easier to maintain, and more concise, and the compiler can make optimisations as it sees fit. The disadvantage is that we lose control of the fine details of the execution.

The full code listing is shown below for the sake of completeness. Unfortunately all the formatting around the code gets messed up by the blog editor; I'm looking into a way of tidying that up.

using System;
using System.Collections.Generic;
using System.Linq;
using System.IO;
using System.Text.RegularExpressions;

namespace CleanTextFiles
{
class Program
{
static void Main(string[] args)
{
// regex recognises a row of corrupted data. Regex
corruptedText = new Regex(@"^[a-zA-Z0-9+/]{72}$");
foreach (string corruptedFilePath in Directory.GetFiles(@"D:\Data\"))
{
string cleanedFilePath = String.Concat(Path.GetDirectoryName(corruptedFilePath),
@"
\clean\",
Path.GetFileName(corruptedFilePath));
Directory.CreateDirectory(Path.GetDirectoryName(cleanedFilePath));
using (StreamReaderEx reader = new StreamReaderEx(corruptedFilePath))
using (StreamWriter writer = new StreamWriter(cleanedFilePath))
{
WriteLines(reader, writer, corruptedText);
}
}
}
/// <summary>
/// Write every line of text from reader into writer that doesn't match the corruptedText regex pattern.
/// </summary>
static void WriteLines(StreamReaderEx reader, StreamWriter writer, Regex corruptedText)
{
IEnumerable goodlines = reader.Where(dataLine => !corruptedText.IsMatch(dataLine)).Select(dataLine => dataLine);
foreach (string goodLine in goodlines)
writer.WriteLine(goodLine);
}
}
/// <summary>
/// File reader that enumerates every line of text in a file.
/// </summary>
class StreamReaderEx : StreamReader, IEnumerable
{
public StreamReaderEx(string path)
: base(path)
{ }
#region IEnumerable Members
public IEnumerator GetEnumerator()
{
string dataLine;
while ((dataLine = base.ReadLine()) != null)
yield return dataLine;
}
#endregion
#region IEnumerable Members
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
string dataLine;
while ((dataLine = base.ReadLine()) != null)
yield return dataLine;
}
#endregion
}

}

No comments:

Post a Comment