Monday, 19. December 2005

a light-weight Language Guesser in Java

Thomas released his Java Language Guesser (which is basically a java port of libTextCat) under the LGPL:
-> http://textcat.sourceforge.net/

var de = "Hallo! Das ist ein deutscher Text.";
var en = "Hi! This is some english text.";
var fi = "Moi joulupukki!";

var guesser = new Packages.org.knallgrau.utils.textcat.TextCategorizer();
res.writeln("de: " + guesser.categorize(de));
res.writeln("en: " + guesser.categorize(en));
res.writeln("fi: " + guesser.categorize(fi));


I haven't read through the whole paper, but it seems that the idea of the algorithm is pretty simple, yet turns out to be quite powerful, i.e. accurate.
For each category (in our case here: for each language) a fingerprint is provided, that simply contains the frequencies of all n-grams. Then the n-grams-frequencies of the new document are compared to these fingerprints, and whatever category/language comes closest, will be the guess.
If you want to use textcat for sthg different than languages, then the library also provides a handy method to create such fingerprints (i.e. train the dataset).

I bet this would also work to determine whether a text is written by an austrian, by a swiss or by a german (which would be interesting to determine here at twoday.net). I would even go that far, that it would probably work to determine the gender of the author. And I am afraid that "fingerprint" isn't such a far-fetched analogy, and that such an algorithm will also, at some not-so-distant time in the future, work to determine the actual author of a text, which is a frightning thought.

Addendum: Actually, after reading the paper, i must withdraw my statement that it should be possible to detect different writing styles between different people. At least not with the distance measure which they proposed. Using a distance measure, which also effectively considers the "long tail" of the Zipf distribution (i.e. one that doesn't just "cut it off") by applying different weightings to the rank-differences (i.e. rank-differences for top-ranked n-grams should have more weight) would probably help.

Search

 

About michi

michi Michi a.k.a. 'Michael Platzer' is one of the Knallgraus, a Vienna-based New Media Agency, that deals more and more with 'stuff' that is commonly termed as Social Software.

Meet my fellow bloggers at Planet Knallgrau.

my delicious

Recent Updates

My Gadgets

Credits

Knallgrau New Media Solutions - Web Agentur f�r neue Medien

powered by Antville powered by Helma


Creative Commons License

xml version of this page
xml version of this page (summary)

twoday.net AGB

Counter



berufliches
blogosphaerisches
privates
spassiges
sportliches
technisches
trauriges
Profil
Logout
Subscribe Weblog