Sunday, 15 April 2012

Not SP-Related: Searching for a Proper Name Within Range of a Word

This is not related to the day job or to sharepoint but it is coding stuff so here is as good a place as any to put it. I rarely do geek stuff in my free time but I thought this might be of some interest.

I was having a discussion with a very nice journalist on the tweet machine a few weeks ago and proposed I write some code for her. She wanted to search a Very Large CSV file (which I won't name cos it's a bit political and this blog is not about politics :-) about 2GB in size for names surrounding various words. One of these words was the word "whistleblower".

I came up with an algorithm that would search either a 50-character radius either side of the instance of the word, or, if there weren't enough characters, the whole line (I was asking the filestream to ReadLine() each time) It would look for a word that had a capital letter that was not preceded by a space and a full stop. In order to make it run, I downloaded the freebie edition of Microsoft Visual Studio C# Express, which had everything I needed for the purpose.

Now this is not perfect and it needs quite a bit of work. For a start, I need to write some extra code to grab the initial code of each cable (yeah, it's becoming more obvious what I'm referring to here, for the record allow me to say that I have no interest in the cause this organisation espouses, or affection for its leaders, but given it is now in the public domain...::shrugs::) But if I wait till I do that, I might never get it done. Also the file has the annoying habit of containing a lot of acronyms and randomly capitalised words, a prob which I can't get around at present.

What this code does: Examines file and searches for every instance of the word "whistleblower" (word can be changed) then writes everything to a file called results.txt, which will be found in the \bin folder of your C# project.

What you need to do. Download VS, Create a console application, go to Program.cs and copy this in wholesale. Sorry about the brackets, it should format them properly for you once it's pasted in

Oh and you need to download the source file. There are plenty of web resources telling you how to do that :)


using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using System.Reflection;

namespace ReadFromFile
{
class Program
{
static void Main(string[] args)
{
StreamReader reader = null;

//write to file - change YourFolder to your actual folder!
FileStream fs = new FileStream("C:\\YourFolder\\results.txt", FileMode.Create);
StreamWriter sw = new StreamWriter(fs);
TextWriter tmp = Console.Out;

try
{
//read from file - change YourFolder to your actual folder!
string fileName = "C:\\YourFolder\\cables.csv";
reader = File.OpenText(fileName);
Console.SetOut(sw);

while (!reader.EndOfStream)
{
string line = reader.ReadLine();
int position = -1;
//change to lower case version of word
position = line.IndexOf("whistleblower");
if (position == -1)
{
//change to capitalised version of word
position = line.IndexOf("Whistleblower");
if (position == -1)
{
//no instances, next row
continue;
}
}

//we have a value for required string
string subLine = line;
if ((position - 50) >= 0 && ((position + 50) <= line.Length))
{
subLine = line.Substring(position - 50, position + 50);
}
else
{
subLine = line.Substring(0, line.Length);
}

char[] characters = line.ToCharArray();

for (int i = 0; i < characters.Length; i++)
{
if (Char.IsLetter(characters[i]) && (i >= 2))
{

if ((!characters[i - 2].Equals(".")))
{

if (Char.IsUpper(characters[i]))
{ //do stuff
//start from i and go to next space
Console.Write("Found possible name string: ");
for (int j = i; j < characters.Length; j++)
{
if (!Char.IsWhiteSpace(characters[j]))
{ Console.Write(characters[j]); }
else
{ Console.WriteLine(); break; }
}
}
}
}

}
}

}
finally
{
//close in and out files
sw.Close();
fs.Close();
reader.Close();
}

//reset console output
Console.SetOut(tmp);
}
}
}


No comments:

Post a Comment