a light-weight Language Guesser in Java
Thomas released his Java Language Guesser (which is basically a java port of libTextCat) under the LGPL:
-> http://textcat.sourceforge.net/
I haven't read through the whole paper, but it seems that the idea of the algorithm is pretty simple, yet turns out to be quite powerful, i.e. accurate.
For each category (in our case here: for each language) a fingerprint is provided, that simply contains the frequencies of all n-grams. Then the n-grams-frequencies of the new document are compared to these fingerprints, and whatever category/language comes closest, will be the guess.
If you want to use textcat for sthg different than languages, then the library also provides a handy method to create such fingerprints (i.e. train the dataset).
I bet this would also work to determine whether a text is written by an austrian, by a swiss or by a german (which would be interesting to determine here at twoday.net). I would even go that far, that it would probably work to determine the gender of the author. And I am afraid that "fingerprint" isn't such a far-fetched analogy, and that such an algorithm will also, at some not-so-distant time in the future, work to determine the actual author of a text, which is a frightning thought.
Addendum: Actually, after reading the paper, i must withdraw my statement that it should be possible to detect different writing styles between different people. At least not with the distance measure which they proposed. Using a distance measure, which also effectively considers the "long tail" of the Zipf distribution (i.e. one that doesn't just "cut it off") by applying different weightings to the rank-differences (i.e. rank-differences for top-ranked n-grams should have more weight) would probably help.
-> http://textcat.sourceforge.net/
var de = "Hallo! Das ist ein deutscher Text.";
var en = "Hi! This is some english text.";
var fi = "Moi joulupukki!";
var guesser = new Packages.org.knallgrau.utils.textcat.TextCategorizer();
res.writeln("de: " + guesser.categorize(de));
res.writeln("en: " + guesser.categorize(en));
res.writeln("fi: " + guesser.categorize(fi));
I haven't read through the whole paper, but it seems that the idea of the algorithm is pretty simple, yet turns out to be quite powerful, i.e. accurate.
For each category (in our case here: for each language) a fingerprint is provided, that simply contains the frequencies of all n-grams. Then the n-grams-frequencies of the new document are compared to these fingerprints, and whatever category/language comes closest, will be the guess.
If you want to use textcat for sthg different than languages, then the library also provides a handy method to create such fingerprints (i.e. train the dataset).
I bet this would also work to determine whether a text is written by an austrian, by a swiss or by a german (which would be interesting to determine here at twoday.net). I would even go that far, that it would probably work to determine the gender of the author. And I am afraid that "fingerprint" isn't such a far-fetched analogy, and that such an algorithm will also, at some not-so-distant time in the future, work to determine the actual author of a text, which is a frightning thought.
Addendum: Actually, after reading the paper, i must withdraw my statement that it should be possible to detect different writing styles between different people. At least not with the distance measure which they proposed. Using a distance measure, which also effectively considers the "long tail" of the Zipf distribution (i.e. one that doesn't just "cut it off") by applying different weightings to the rank-differences (i.e. rank-differences for top-ranked n-grams should have more weight) would probably help.
michi - 19.Dec 2005 10:28 - technisches