Tag Archives: Lapsang

Text analysis in Lapsang

Lapsang is my personal programming project that will download article titles from an RSS-feed and then recommends which ones are probably the most interesting for the user. Initially it will only use the title for analysis (later versions may use the actual article content, url, poster, etc.). For every title the user must tell if it sounds attractive to read and based on this input the program will learn the interests of the user. As soon as the user rates a title, the scores of the individual words of this title will be adjusted. For attractive titles the word score will increase and for uninteresting titles it will decrease. Based on this word scores the program can give a recommendation for new titles. As soon as a title contains words that the program saw before it will use the word scores to calculate a recommendation.

Not all words in a title will be used for scoring. Stop words like ‘a’, ‘the’, ‘and’, etc. can be ignored as they add no significant value. How about singular and plural, should ‘language’ be considered the same word and thus have the same score as ‘languages’? And how to handle verbs, how should I deal with present and past? For simplicity matters I will assume all titles will be in English as multilingual text analysis is way over my head for now :-) Probably it is sufficient to store only the stem of a word. There are various stemming algorithms available and I found a C# implementation for Porter stemming that I could integrate in my own tokenizer. As this text analysis is getting more and more complex I decided to take a step back and stop my own implementation and have a look at already available open source libraries.

Lucene.NET (a port of the Java search engine) is a mature project and contains various text analyzers. Maybe other parts of this library are useful in my program, too, but first I will focus on the text analysis. I downloaded the binaries from the Lucene.NET website to start experimenting. The library is literally ported from the Java version (easier to maintain for them, they say) so the API feels very Java-ish and not so .NET-ish.
The download contains the core Luce.Net.dll and various additional libraries for more advanced analysis purposes.

I created a temporary project and added references to the Lucene.Net.dll and the Snowball.Net.dll (for language-specific analysis using stemming). The following code shows the results using various analyzers to tokenize a text:

using System;
using System.IO;

using Lucene.Net.Analysis;

namespace TryTextAnalysis
{
class MainClass
{
static void Main (string[] args)
{
	string title = 	"My husband is a programmer; I have no idea what that means.";
	Console.WriteLine(title);			
	
	ShowTokens(title, new WhitespaceAnalyzer());
	ShowTokens(title, new SimpleAnalyzer());
	ShowTokens(title, new StopAnalyzer());
	ShowTokens(title, new Lucene.Net.Analysis.Standard.StandardAnalyzer());
	ShowTokens(title, new Lucene.Net.Analysis.Snowball.SnowballAnalyzer("English", StopAnalyzer.ENGLISH_STOP_WORDS));
}

static void ShowTokens(string text, Analyzer analyzer)
{			
	Console.WriteLine(analyzer.GetType());
	TokenStream stream = analyzer.TokenStream("text", new StringReader(text));
	while (true)
	{
		Token token = stream.Next();
		if (token == null)
		{
			break;
		}
		Console.Write(" [{0}]", token.TermText());
	}
	stream.Close();
	Console.WriteLine();
}
}
}

The results when you run this program are as follows:

My husband is a programmer; I have no idea what that means.
Lucene.Net.Analysis.WhitespaceAnalyzer
 [My] [husband] [is] [a] [programmer;] [I] [have] [no] [idea] [what] [that] [means.]
Lucene.Net.Analysis.SimpleAnalyzer
 [my] [husband] [is] [a] [programmer] [i] [have] [no] [idea] [what] [that] [means]
Lucene.Net.Analysis.StopAnalyzer
 [my] [husband] [programmer] [i] [have] [idea] [what] [means]
Lucene.Net.Analysis.Standard.StandardAnalyzer
 [my] [husband] [programmer] [i] [have] [idea] [what] [means]
Lucene.Net.Analysis.Snowball.SnowballAnalyzer
 [my] [husband] [programm] [i] [have] [idea] [what] [mean]

The SnowBallAnalyzer suits my needs so now I can start coding the word scoring algorithm.

The birth of Lapsang

In my previous post I described a personal programming project I was planning. I wanted to use Ruby as the programming language of choice, but there are a couple of reasons for going back to C#: first, I couldn’t find a working GUI toolkit for Ruby (at least working on my OSX machine). I spent a full day and decided to move back to System.Windows.Forms (I can use the Mono implementation). Another reason is, as Michel already commented, a new language will slow me down too much and at this moment I prefer a working application above a new language.

The hardest part of a new project is the name and after a deep thought I came up with Lapsang (a black Chinese tea).

When downloading the data from the RSS-feed from HackerNews I saw there are only 25 items in the document. I need more items to train the network so I decided to make a quick-and-dirty screen scraper to browse the website and download the items of multiple pages. This will probably not be part of the final product, I prefer a pure RSS feed (or multiple, maybe via an OPML file) as my datasource.

My personal programming project

Introduction

Lately I am following the newsfeed Hacker News (http://news.ycombinator.com) where members can post links to interesting (IT related) articles. There are a couple of new items per hour and members can vote these items up or down in the list. Not all articles are interesting for me and my time is limited so I must choose which articles to read. The site shows 30 items per page, only the title and the website it’s referring to and no description. Based on the title I select the most interesting items and open them in a new tab in the browser, read them and close the tabs and finally move on to the next page, etc.

A cunning plan

My idea is to make an application that can help me by suggesting interesting articles based on my previous selected items. Over time it should be better in giving suggestions. I finally have a reason to dive into neural networks. Ten years ago during my traineeship in Bangkok I worked in the Software and Language Laboratory of NECTEC and colleagues were working on Thai OCR and text-to-speech software using neural networks. I got fascinated but never found a reason to use it in my professional developer career.

I have been developing applications in C# for the last decade so it’s time for something new! The application will be build using Ruby, as this is a language that is on my ‘learning’-wishlist for a couple of years. I am not sure what libraries to use for the user interface yet. At home I use OSX and at work I use Windows and it would be nice if it runs on both platforms. I still having trouble getting wxRuby up-and-running at home so I will have a look at IronRuby in combination with Mono (WinForms) or maybe another cross-platform GUI toolkit.¬†At least this definitely is a reason to use a pattern like Model-View-Controller or Model-View-Presenter and make the UI layer as thin as possible.

The user interface will have a list of titles and a button to open a webbrowser with the selected article. I want to rate every title probably by dragging new items into one of three buckets/lists (must-read, maybe, not-worth-my-time) or by using 1-3 stars. The title and the score will be used in the learning part of the application. When the application downloads new items from the newsfeed, it will give a suggested score to each of them that should be visible to me (eg. colors, size, stars, sorted). First I will make it work, then I will make it beautiful.

Conclusion

This is a project with a lot of new things to learn (for me and for the application), but it is definitely a challenge :-)

p5rn7vb