To start off I'd like to blog about a program I wrote recently. It was interesting to me as I was able to leverage a couple existing technologies to write something quite concise.
We had a problem with some "corrupted" text files - some MS Office documents had been converted to text files and apparently contained both the body of the text and the accompanying meta data. What was needed was a text file that contained just the body of the text.
A cursory look over some of the files showed that the meta data was always 72 characters wide and consisted of alphanumeric characters and the symbols '/', and '+'. In that case it is a simple matter to write a Regular Expression that will match a line of of meta data:
So, we have a method for spotting lines of meta data and now what we need to do is iterate over every line of the file and copy all the lines that don't match to a new text file. Or do we? It would be nice if we could just ask the file to give me all the lines of text that don't match the regular expression, wouldn't it? It turns out that using Linq we can do exactly that!
// regex recognises a row of corrupted data.
Regex corruptedText = new Regex(@"^[a-zA-Z0-9+/]{72}$");
Linq can be used to query any collection that implements the IEnumerable<> interface. If we had such a collection that had an item for each line of text we could create a Linq expression that used the regular expression to give us all the lines of text that weren't meta data:
Where textFile represents a collection implementing IEnumerable
IEnumerable<string> goodlines = textFile.Where(dataLine => !corruptedText.IsMatch(dataLine)).Select(dataLine => dataLine);
Let's take a minute to look at what we have here. The StreamReaderEx class inherits from the StreamReader class and implements the IEnumerable
/// <summary>
/// File reader that enumerates every line of text in a file.
/// </summary>
class StreamReaderEx : StreamReader, IEnumerable
{
public StreamReaderEx(string path)
: base(path)
{ }
#region IEnumerable Members
public IEnumerator GetEnumerator()
{
string dataLine;
while ((dataLine = base.ReadLine()) != null)
yield return dataLine;
}
#endregion
#region IEnumerable Members
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
string dataLine;
while ((dataLine = base.ReadLine()) != null)
yield return dataLine;
}
#endregion
}
So there we have it! The result is we have an application that is mostly declarative rather than imperative. This means that rather than write detailed instructions for every step of the process we can just ask for a specific result and the compiler will interpret that request and transform it into detailed instructions for us. The advantages of being declarative rather than imperative are that the code is faster to write, easier to maintain, and more concise, and the compiler can make optimisations as it sees fit. The disadvantage is that we lose control of the fine details of the execution.
The full code listing is shown below for the sake of completeness. Unfortunately all the formatting around the code gets messed up by the blog editor; I'm looking into a way of tidying that up.
using System;
using System.Collections.Generic;
using System.Linq;
using System.IO;
using System.Text.RegularExpressions;
namespace CleanTextFiles
{
class Program
{
static void Main(string[] args)
{
// regex recognises a row of corrupted data. Regex
corruptedText = new Regex(@"^[a-zA-Z0-9+/]{72}$");
foreach (string corruptedFilePath in Directory.GetFiles(@"D:\Data\"))
{
string cleanedFilePath = String.Concat(Path.GetDirectoryName(corruptedFilePath),
@"\clean\",
Path.GetFileName(corruptedFilePath));
Directory.CreateDirectory(Path.GetDirectoryName(cleanedFilePath));
using (StreamReaderEx reader = new StreamReaderEx(corruptedFilePath))
using (StreamWriter writer = new StreamWriter(cleanedFilePath))
{
WriteLines(reader, writer, corruptedText);
}
}
}
/// <summary>
/// Write every line of text from reader into writer that doesn't match the corruptedText regex pattern.
/// </summary>
static void WriteLines(StreamReaderEx reader, StreamWriter writer, Regex corruptedText)
{
IEnumerable goodlines = reader.Where(dataLine => !corruptedText.IsMatch(dataLine)).Select(dataLine => dataLine);
foreach (string goodLine in goodlines)
writer.WriteLine(goodLine);
}
}
/// <summary>
/// File reader that enumerates every line of text in a file.
/// </summary>
class StreamReaderEx : StreamReader, IEnumerable
{
public StreamReaderEx(string path)
: base(path)
{ }
#region IEnumerable Members
public IEnumerator GetEnumerator()
{
string dataLine;
while ((dataLine = base.ReadLine()) != null)
yield return dataLine;
}
#endregion
#region IEnumerable Members
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
string dataLine;
while ((dataLine = base.ReadLine()) != null)
yield return dataLine;
}
#endregion
}
}
No comments:
Post a Comment