michi bloggt!

-> http://www.howtoforge.com/
-> http://www.howtoforge.com/statistics

HowtoForge is the source for Linux tutorials.

Empfehlenswert!

michi - 31.Mar 2006 16:59 - technisches

Thursday, 30. March 2006

Amazon's Simple Storage Service a.k.a. S3

-> http://aws.amazon.com/s3
-> http://unicast.org/archives/004073.html [via lng]

Truly amazing! And very competitive prices.

$0.15 per stored GB per month
$0.20 per transferred GB

With €0.16 per GB the traffic costs are even lower than Hetzner's €0.19. So that's a definitive win for amazon.

Regarding the storage prices, the situation is not that clear. 1 GB of stored data will cost you 1.5€ per year at amazon. Mmh, a SATA harddisk costs about 0.5€ per GB, once. Let's say we double that for backup, add some additional RAID fault-tolerance, add the necessary hardware around all that, and we still will remain somewhat short of 1.5€ per GB (still just one-time costs). But as soon as we start considering personnnel expenses for maintaining the hardware (and software as MogileFS), resp consider the fact that amazon's cost scale perfectly along with your demand (no setup fees, no sunken costs), and that you never pay for any unused storage, then amazon should also win this comparison within this respect.

Getting started with S3 is dead simple with a command line tool named jSh3ll:
-> http://jroller.com/page/silvasoftinc
-> http://www.theserverside.com/...thread_id=39613

Check this out:
-> http://s3.amazonaws.com/michi/
-> http://s3.amazonaws.com/michi/helloworld
-> http://s3.amazonaws.com/michi/test.jpg
-> http://s3.amazonaws.com/michi/test.jpg?torrent (a bittorrent seed!)

So, maybe we will seriously start considering this option. And if the only reason is, that I am just fed up with sitting in between loud & noisy racks :-)

Oh, and speaking of traffic costs. I hope our newest Knallgrau baby, the freshly launched videoblog for mini, does not become any successful at all. The hetzner bill just gets higher and higher :-)
-> http://www.vlogbymini.de/

michi - 30.Mar 2006 20:43 - technisches

Wednesday, 29. March 2006

Twoday's Top 10 Stories (according to the number of different commentators)

wow! 5 out of 10. little missy wins it all.

Until today i didn't know that there is a "select count(distinct value) from..", which comes in very handy in order to generate such a list.

mysql> select t1.TEXT_ID, t1.TEXT_F_SITE, count(distinct t2.TEXT_F_USER_CREATOR) as cnt from AV_TEXT t1, AV_TEXT t2 where t2.TEXT_F_TEXT_STORY = t1.TEXT_ID group by t1.TEXT_ID order by cnt desc limit 10;

michi - 29.Mar 2006 18:13 - blogosphaerisches

Tuesday, 28. March 2006

moblogs im vergleichstest

-> http://www.xonio.com/artikel/x_artikel_19200194.html

Rühmliche Ausnahme
Den ersten Versuch, einen Blog-Eintrag vom Handy aus zu generieren, machten wir bei kostenlosen Blog-Anbietern wie 20six.de, Blogger.com und Twoday.net. Völlig reibungslos funktionierte der Blog-Eintrag per Handy allerdings nur bei einem Anbieter. Die Registrierung und das Anlegen eines Weblogs ging bei Twoday.net schnell über die Bühne und wir bekamen sofort eine genaue Anleitung an die Hand. Außerdem wurde uns eine eigene E-Mail-Adresse zugewiesen, an die wir die Dateien über das Handy schicken konnten. Wir haben es ausprobiert und nach nur wenigen Minuten waren Bild und Kommentar online. [via basic thinking]

michi - 28.Mar 2006 09:25 - blogosphaerisches

Monday, 27. March 2006

video surveillance - angela merkel

4 weeks ago, i suggested a video surveillance project that targets the private houses of politicians
-> http://michi.knallgrau.at/blog/stories/1638172/
in order to increase the awareness of law makers regarding that problem.

And now it turns out, that Angela Merkel's private appartment wasn't that "private" anyways over the past eight years. As far as i understand, this has not been exploited at all, except for some security-staff zooming in from time to time. Still this incident might trigger some thought process among politicians, helping to protect privacy a bit more.
-> http://www.spiegel.de/...408015,00.html
-> http://www.bild.t-online.de/.../merkel-wohnzimmer-kamera.html
[links via orf.at]

Update: another schneier article on "the future of privacy" (which is already couple of weeks old, though)
-> http://www.schneier.com/blog/archives/2006/03/the_future_of_p.html

michi - 27.Mar 2006 12:55 - trauriges

Wednesday, 22. March 2006

digg.com algorithm - the holy grail?

Question du jour: How does digg.com actually sort their articles on their frontpage?

Despite the number of digg-clones out there, I couldn't really find anything particular useful in order to reproduce their ordering (see [1] or [2] for example, but both did not provide any answer to my question).

Well, how about some simple reverse engineering? A couple of hourse ago I took the top 40 stories together with their

number of diggs
number of comments
time in minutes since the article has been posted

Some other possible factors in the digg-algorithm, but to which i don't have access to, might be the

number of click-throughs
number of views
number of non-diggs (that is a view for a registerd user, that does not digg the item)

You can download the sorted list from here: digg (csv, 2 KB)

First some illustrative charts, and some observations (which might be obvious to regular digg-users):

R> plot(data$diggs, type="l", main="Top 40 Stories at digg.com", ylab="nr of diggs", lwd=2)
R> points(data$comments*10, typ="l", col=4)
R> legend(1, 2800, legend=c("diggs", "comments * 10"), text.col=c(1,4))

1. Stories with very very few diggs make it to the absolute top.
2. Story "30" has 6 times more diggs than story "28". Therefore i assume that the "value" of a digg decreases with each additional digg. ln(digg)?.
3. Astounishingly the number of comments per story is nearly exactly ten times the number of diggs. Therefore due to this linear dependency between diggs and comments this variable probably won't be useful for our model.

R> plot(data$time, type="l", main="Top 40 Stories at digg.com", ylab="time in minutes", lwd=2, ylim=c(0,3000))
R> points(data$diggs, col=4, cex=0.5)
R> legend(1, 2800, legend=c("time in min  ", "diggs"), text.col=c(1,4))

1. There is not a single story that is younger than 9 hours in this list!
2. There is not a single story that is older than 40 hours in this list.
3. This narrow bandwidth let's one assume that digg.com is using some sort of time-thresholds to enter the list.
4. Story 21 and 22 are nearly of the same age, but story 21 with 25% less diggs than story 22 is still ranked further on the top!?
5. Story 20 and 22 nearly have the same nr of diggs, but story 20 is nearly double as old as story 22, but is still ranked further on the top!? This and the previous finding strongly indicate that there are additional variables to the model than diggs and times, but unfortunately i dont have access to them.
6. Story 21 to 40 are nicely sorted according to their age (with just a few exceptions). Maybe this is a coincedence, but it seems that age plays a more important role for the sorting of the older articles.

Then I tried to factor these findings into a simple model, and came up with the following:

rank = log(diggs) / time^2
R> plot(data$logdiggs * data$timepenalty^2, main="log(diggs) / time^2", ylab="", lwd=2, cex=1, pch=4)

In a true model the chart above would show a strictly decreasing path. Obviously i am unable to explain why the first five/six stories made it to the top with that few diggs.

Despite that let's see how good our model actually is:

R> lm1 <- lm(log(data$score) ~ data$logdiggs + log(1/data$timepenalty^2))
R> summary(lm1)

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -9.3065     1.7274  -5.388 4.24e-06 ***
data$logdiggs               0.8771     0.1032   8.502 3.14e-10 ***
log(1/data$timepenalty^2)   0.4521     0.1130   4.003 0.000289 ***
---
Signif. pres:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.4859 on 37 degrees of freedom
Multiple R-Squared: 0.7063,     Adjusted R-squared: 0.6905 
F-statistic:  44.5 on 2 and 37 DF,  p-value: 1.429e-10

Well, it's still far away from perfect, but according to these results i am at least able to explain 69% of digg's sorting algorithm. I guess most of the remaining 31% depend on the number of non-diggs. I would highly appreciate it if anyone can point me to articles/papers that provide some further insights into digg's/reddit's/yigg's mechanism.

Maybe it's all not that complex anyways.

Update: Shame on me. Actually it is really not that complex at all, as joshrt pointed out in the comments at here. And obviously i am not a regular digg-user, otherwise i would have known better. The sort order always stays the same! So, as soon as the article makes it to the frontpage, digging does not change anything? So, why should anyone digg anything on the frontpage? Just for the topstories? Does reddit work the same way?
So the new question is: What determines whether an article makes it to the frontpage or not? Is it really just an absolute number of diggs? Does it matter who diggs it? Do a fixed number of articles enter the frontpage within a certain timeperiod? Why do some articles take so long to make it to the frontpage? What is the alogrithm behind that?
As you can see, I am still interested in finding out how that works.

michi - 22.Mar 2006 18:59 - technisches

older stories