michi bloggt!

As I have already announced last week at the helma-dev we will provide twoday (pretty much as it can currently be enjoyed here at twoday.net) under an Open-Source-license (BSD-style!), and plan to "establish an active user and developer community around the twoday-software". A first step into this direction is the release of twoday-1_0_0beta, the setup of the according sourceforge.net-project, and the setup of the twoday.org-wiki.

We already made a similar announcement over a year ago, where we planned on releasing twoday under the GPL, which we actually never did. This time we are more aware of the consequences, and ~~will~~ should have time to maintain and support users, as well as developers.

This is a pretty big step for us. Twoday became a core business for us over the past two years, and we invest quite some effort into its ongoing development. Projects like twoday.net, weblife.at, moday.at, twoday.tuwien.ac.at,.. should speak for themselves. Nevertheless i am fully convinced that this is the right step, and am very happy to work in an environment where my partners share the same opinion on Open-Source.

So, go ahead and start your blog community today! (urgh, saying this actually did hurt a bit :-)) No matter if its just for you, for your school/university, for your company, or whether you want to beat twoday.net regarding free hosting for everyone.
Well, actually better wait till twoday 1.0.x is declared stable, which should be soon, after receiving some first feedback from this release.

-> http://twoday.org
Still under construction (as every wiki) but at least the start page looks fine to me.

michi - 22.Dec 2005 19:05 - technisches

What web2.0 is NOT about

Privacy
Anonymity
Open-Source (?)

Google is good, too good actually. And its not just going to be Google becoming the bad guy, everybody will turn mean, as soon as more VC money hits the scene.

With the information currently available on the net, anybody can find out pretty damn much about me (and very very likely also about you, the reader). You will find out my company (ok thats easy), find out where i live and what i do. You are able to read all my stories and comments at twoday.net, and all emails that i posted to various mailing lists, and all my bookmarks. Furthermore you will find plenty of images, and even some video footage. So, after digging through all these MegaBytes of data, you will get a pretty solid picture about my person. And my personality! You will know how i handle criticism, you will know that i turn cynical from time to time, you will know what my political view is, you will know how good my english is! You will know much more than i ever wanted to give away. But lately I realised that you will find out much more than i deliberately gave away. Simply by collecting even more information about my "social network". Well, just look at who is writing comments here (ignore the spam for the moment :-)), just look where i am writing comments at, just look who is linking me and vice versa. Even if these friends just use their nicknames, you will find more images and stories at their blogs, at their flickr accounts and at their delicious accounts regarding my person.

So, who cares? Its just me anyways. Who should be that interested to invest so much effort analyzing my person? Well, the point is, that its gonna be no effort at all at some (near) time in the future. So many companies investigate into information retrieval, that its rather sooner than later before a search engine specializes on "person(ality) retrieval". A search for my name will bring you all of the above information within the split of a second. My skills, my personality, my feelings, my thoughts, my health, my social status (PersonRank?), everything neatly summarized and categorized. A bit scary, wouldn't you say?

Well its getting even scarier, if one takes a look at Google's latest acquisitions:

Google bought Riya, an automated face recognition company! (Update: Actually this is just a rumour. Sorry for spreading it again.)
Google bought Urchin, a web-stats analysis tool
Google bought 10% of AOL, a well you know what they do.

The riya acquisition is scary for obvious reasons, but the other two acquisitions are not as apparent on the first sight. But Google is buying traffic information! Just take a look at Google Analytics, and you will know how smart they are already (our company ip-address resolves to 62-99-202-58.zollergasse.xdsl-line.inode.at, and i have no idea how they can assign this to our correct(!) company name). Pretty soon, Google will be able to link persons and ip-addresses, giving them infinite information about each one of us.

Ha, and we were afraid of some desktop music-players broadcasting some information to servers, or of the police performing dragnet investigation ("Rasterfahndung"). Peanuts. George Orwell has already arrived.

If you still haven't realised why all this information about my/your person should be that bad, read this story and then think again:
-> Blogger Blocked at U.S. Border

Sorry, to spoil all the cheers about web2.0, but i would really like to see some things cleared out first, before everybody starts undressing themselves on the net. But i guess it's already too late (at least for me) anyways.

Note 1: According item 3 in my list stated at the beginning i might write a follow-up story at some later time.

Note 2: You might consider this article a bit ironic considering that we are actually hosting thousands of blogs, and that we are also investigating in information retrieval. Nonetheless i am a web2.0-user the same way as you are, and in that respect i am very interested in raising awareness for these issues.

michi - 22.Dec 2005 00:40 - blogosphaerisches

PageRank 10

StatCounter is according to this list one of 16 websites that score a PageRank of 10. Wow, thats something! Just make a free stats service, boost your PageRank of the according site, and then go ahead and sell advertising space.

michi - 22.Dec 2005 00:01 - blogosphaerisches

Wednesday, 21. December 2005

Chris Kröll

Einer der weltbesten Snowboarder. Live und unzensiert.
-> Chris Kröll

michi - 21.Dec 2005 23:03 - blogosphaerisches

Gregor Seberg

Gregor Seberg liest mindestenshaltbar-Artikel! Das ist ja der Wahnsinn. Ich schätze nämlich seine Weise Texte vorzutragen ungemein. Für alle die sich nicht mehr erinnern können, bzw zu jung sind um sich zu erinnern, hier nochmal der Link zu der seinerzeitigen README-Lesung:
-> http://readme.twoday.net/stories/190352/
bzw
-> http://readme.twoday.net/stories/190374/

michi - 21.Dec 2005 20:08 - blogosphaerisches

Tuesday, 20. December 2005

knallgraue Weihnachten

Sensationelle Bilder von unser gestrigen Weihnachtsfeier gibts beim Helge:
-> http://www.helge.at/photos/xmas-knallgrau/dsc_5289.php
-> http://www.helge.at/photos/xmas-knallgrau/dsc_5321.php
-> http://www.helge.at/photos/xmas-knallgrau/

bzw auch bei der barbara gibts fotos in massen.

Nachtrag: Und Markus zeichnet verantwortlich für die nette Weihnachts-Animation auf unserer Knallgrau-Startseite.

michi - 20.Dec 2005 19:21 - privates

eval is evil

JSON is a handy, comprehensive format to store/transmit data, especially when working with JavaScript. The syntax looks something like this:
var obj = {key1: value1, key2: [value2, value3]}
And the good thing in JavaScript is, that all you need to do in order to convert a string into an according object or array, is to call "eval(.)" on that string.
Well, well, this "good" thing is actually a very "bad" thing (especially on the server side, i.e. in helma), since it might encourage programmers to call eval on all kind of unsafe strings. Fortunately there is a free javascript-library available at http://www.crockford.com/JSON/js.html, which parse such strings in a safe way. This library can and should always be used instead of eval!

RubyOnRails (or Ruby in general?) heavily uses YAML as data format, and i must say that i really like that format. But go ahead, and judge for yourself:
-> http://www.yaml.org/start.html
On the website you will also find a Java- as well as a JavaScript-implementation for that format.

michi - 20.Dec 2005 13:01 - technisches

Monday, 19. December 2005

a light-weight Language Guesser in Java

Thomas released his Java Language Guesser (which is basically a java port of libTextCat) under the LGPL:
-> http://textcat.sourceforge.net/

   var de = "Hallo! Das ist ein deutscher Text.";   

var en = "Hi! This is some english text.";   

var fi = "Moi joulupukki!";   



var guesser = new Packages.org.knallgrau.utils.textcat.TextCategorizer();   

res.writeln("de: " + guesser.categorize(de));   

res.writeln("en: " + guesser.categorize(en));   

res.writeln("fi: " + guesser.categorize(fi));

I haven't read through the whole paper, but it seems that the idea of the algorithm is pretty simple, yet turns out to be quite powerful, i.e. accurate.
For each category (in our case here: for each language) a fingerprint is provided, that simply contains the frequencies of all n-grams. Then the n-grams-frequencies of the new document are compared to these fingerprints, and whatever category/language comes closest, will be the guess.
If you want to use textcat for sthg different than languages, then the library also provides a handy method to create such fingerprints (i.e. train the dataset).

I bet this would also work to determine whether a text is written by an austrian, by a swiss or by a german (which would be interesting to determine here at twoday.net). I would even go that far, that it would probably work to determine the gender of the author. And I am afraid that "fingerprint" isn't such a far-fetched analogy, and that such an algorithm will also, at some not-so-distant time in the future, work to determine the actual author of a text, which is a frightning thought.

Addendum: Actually, after reading the paper, i must withdraw my statement that it should be possible to detect different writing styles between different people. At least not with the distance measure which they proposed. Using a distance measure, which also effectively considers the "long tail" of the Zipf distribution (i.e. one that doesn't just "cut it off") by applying different weightings to the rank-differences (i.e. rank-differences for top-ranked n-grams should have more weight) would probably help.

michi - 19.Dec 2005 10:28 - technisches

Wednesday, 14. December 2005

Handy one-liner to find out the peak demand of your server

grep '2005:17:' access.log |  \
  awk '{ print $5 }' |  \
  sort |  \
  uniq -c |  \
  sort -nr |  \
  head

    143 [14/Dec/2005:17:15:43
    138 [14/Dec/2005:17:40:13
    134 [14/Dec/2005:17:39:15
    127 [14/Dec/2005:17:35:17
    127 [14/Dec/2005:17:27:48
    127 [14/Dec/2005:17:07:21
    123 [14/Dec/2005:17:01:36
    118 [14/Dec/2005:17:24:37
    117 [14/Dec/2005:17:20:48
    116 [14/Dec/2005:17:13:09

Sprich bis zu 143 Requests pro Sekunde. Vielleicht wirds doch mal Zeit für den viel gepriesenen lighttpd? Schon jemand Erfahrungen damit gesammelt?

Nachtrag: Ebenso interessant, sind aber auch die Spitzenlasten bei der Bandbreite

grep '2005:17:' access.log |  \
  awk '{ print $5 " " $11 }' | 
  awk '{a[$1]+=$2} END {for (j in a) print a[j] " " j}' |  \
  sort -nr |  \
  head

1365589 [14/Dec/2005:17:40:57
1245322 [14/Dec/2005:17:40:12
1129162 [14/Dec/2005:17:41:04
1080311 [14/Dec/2005:17:41:35
1075507 [14/Dec/2005:17:43:31
991233 [14/Dec/2005:17:41:55
980098 [14/Dec/2005:17:41:12
967750 [14/Dec/2005:17:40:10
936295 [14/Dec/2005:17:41:50
929714 [14/Dec/2005:17:41:52

Bzw wenn man anstatt print $5 " " $11, einfach print $2 " " $11 schreibt, dann bekommt man sehr leicht denn großen Bandbreitenfresser (=IP-Adresse) heraus. Mit print $8 " " $11 erhält man die entsprechende Datei.

michi - 14.Dec 2005 19:22 - technisches

Wow, soeben entdeckt dass R auf onlamp.com mit einem fetten Artikel gefeatured wird:
-> http://www.onlamp.com/pub/a/onlamp/2005/11/17/r_for_statistics.html

Ich will ja nicht gross protzen, aber ich hab bereits 1998 oder so mit R zu tun gehabt :-)

Und dass aber nicht aus einer frühen Weitsicht heraus, sondern bloss weil ich zufälligerweise etliche Lehrveranstaltung bei Prof Leisc h besucht hab. Mittlerweile schreib ich (in unregelmäßig aber doch bemerkbaren Schritten) ja beim "Treasurer" persönlich die Diplomarbeit. Bzw genauer gesagt, arbeite ich die meiste Zeit hierfür mit R.

Dem R-Neuling sei folgende Referenz-Karte ans Herz gelegt:
-> http://cran.r-project.org/doc/contrib/Short-refcard.pdf

Und dem dann-noch-immer-an-R-Interessierten dieses Buch:
-> http://www.amazon.de/...9E5BVH%26camp=2025%26link_code=xm2

michi - 13.Dec 2005 21:06 - technisches

older stories