michi bloggt!: technisches

technisches

mod_cband

-> Apache 2 Bandwidth Quota and Throttling

nützlich um:
* googlebots zu drosseln
* übereifrige downloader zu limitieren
* attacken von einer einzelnen ip zu beschränken
* ...

ebenfalls kürzlich enteckt:
-> flv streaming with lighttpd
-> mod_secdownload (for lighthttpd)

michi - 03.May 2006 20:51 - technisches

Thursday, 20. April 2006

MySQL linkdump

as a follow-up to my previous post, here are some more links:

-> http://meta.wikimedia.org/wiki/Wikimedia_servers#Server_list
1 active master, 6 active slaves; most of the machines having fast disks, RAID-0 and 16GB RAM

-> Status of all Wikipedia MySQL Servers
-> Status of a single Wikipedia MySQL Slave
no swapping! very low "cached" memory number, indicating that InnoDB behaves completely different than MyIsam (resp our setup) regarding disk caching. average system-load is below 1!

memory usage with our setup:
mem-month

-> Brad Fitzpatrick's Notes on LiveJournal at the MySQLcon 2005
(note, you might have already read his notes from 2004, but these slides are updated and provide some more valuable infos)
* "User-Clusters", i.e. livejournal-users are split into several clusters.
* master-master replication-setup
* Use InnoDB. Really. Little bit more config work, but worth it...fast as hell.

michi - 20.Apr 2006 10:22 - technisches

Friday, 14. April 2006

MySQL Troubles at twoday.net

To say the least, i've gone through hell the last couple of days :-) It started monday night, when i once again tried to nail down the performance troubles, which we experienced here at twoday.net. Performance tuning requires knowledge and experience in so many different areas (hardware, debian, mysql-db, helma, twoday), that this task is still quite often attached to me within the company.

So, Monday night I could quickly identify with the help of helma's sql files, that the troubles solely reside at the database-side, and that all other troubles (the infamous "maximum thread count reached"-message) were just symptoms of a slow database-server. I started out with the assumption, that the db-indexes were not working correct anymore, since some of the standard sql-statements took an insane amount of time (~8sec). So i tried to rebuild these indexes, to add some new and remove some unused ones. An iterative task, that takes hours with large databases. Somewhere around 4a.m. i got the deceiving impression, that everything was working fine again, and i went to sleep.

Well, just for a couple of hours, since I was awakened by alerts that twoday.net is extremly slow again, and that it is even worse than before. That everything worked at 4a.m. was simply because there is not much traffic at that hour. Panic! Since we switched a while ago to MySQL 5 (from dotdeb.org), i started blaming this move. So i switched to all kind of MySQL 5-binaries from the MySQL-website. I started switching back to MySQL 4.0. Nothing helped at all, the sql-statements still took far too long. And it wasn't just a certain kind of statement, even a simple SELECT count(*) from AV_TEXT took a second. SHOW PROCESSLIST showed that most of the connections were in "locked"-state, i.e. waiting to be processed, and that around 3-4 statements were actually being processed, but taking couple of seconds. As I mentioned a while ago, we have about 200-300 SQL-statements per second, so this meant troubles for sure.

Switching database-versions obviously didn't help, so i started to blame the hardware. I completely moved the database to another server. It didn't help either. Same symptoms. More Panic!

After I read through tons of MySQL-pages (it was already getting night again), I tried fiddling around with all kind of Server System Variables. It all had no effect at all, nada.

Finally I gave in, and started setting up a mysql-cluster, in order to evenly distribute the queries to two servers. Again it was 4a.m., my mind was not working anymore, and i went to sleep. Just to be awakend couple of hours later, to hear that it didn't help either. The cluster-setup was fine, but the queries still took 4secs instead of 0.04secs, so i would have needed a hundred servers.

What really shocked me was, that i was not able to bring back a twoday.net-setup that worked at least slowly. Whatever i tried, ended up in "maximum thread count reached", and lots of unhappy users, who i can fully understand. Generally speaking I am just not the type of guy, who gives up easily. Hey, i finished 6 marathons until now, and they were all not a piece of cake. But this time I gave in.

I went back to the office wednesday morning, with absolutely no idea of what to do next. I started telling matthias and axel the full story. And while telling them everything chronologically, and with them just asking the right questions, and with a sudden common inspiration, it all became obvious! The twoday-database is growing and growing, and now is about 1.7GB big, with the single AV_TEXT-table accouting for most of this space (btw wikipedia's content is not much bigger). Hmm, so we've got 2GB RAM at the db-server (and also on the other machine i tried out). About 300KB of these are used for key-caching, and that's why i always assumed that we have plenty of RAM for the database. But, what is hardly mentioned anywhere, not even in the otherwise excellent O'Reilly book "High Performance SQL", is that it makes a HUGE difference whether the OS is able to cache your data-files or not. With Oracle (according to Axel) this kind of disk-caching is handled by the db-server. With MySQL it is left to the operating system! Therefore you will never find any configuration parameter to adjust this, resp. no mysql status variable indicates the fact, that the OS can't cache your files anymore! I was just never aware of that the full database has always been completely in memory, and as soon as this is not possible anymore, there is just no chance to handle 300 queries per second. And honestly, I blame MySQL (or Zawodny, author of the forementioned book) for not making this point clearer.

Solution 1: Make your database smaller!
Solution 2: Order more RAM!
Solution 3: Start thinking about Partitioning (a new feature in Mysql 5.1)

@1: Thanks to Axel, who gave me the decisive hint, it was damn easy to cut a twoday-database nearly to half of its size. Simply drop the TEXT_RAWCONTENT-column, and perform the search on the TEXT_TEXT-column. That is the reason, why everthing is running so smoothly here at twoday.net again.
@2: Easy solution, but in our case, the RAM will not arrive before middle of next week. So, we have to wait for that, and hope solution 1 is good enough until then.
@3: Next Wednesday there is a (free) web presentation regarding Partitioning, which might be interesting: see here.

So, to sum things up, (painful) lessons lerned from this nightmare:
* Alyways keep your database in memory!
* Start acting more like a team-player than a marathon-runner! :-)

michi - 14.Apr 2006 11:05 - technisches

Thursday, 13. April 2006

tag the net !

We are happy to announce that we were able to increase quality and performance of our tag extraction algorithm. The improvements have especially benefited the quality of topics for german texts but also the extracted topics for english texts are even more accurate now. You might think that that's the old "20% better taste!"-blabla but you need not take my word for it. Go and see for yourself.
-> http://www.tagthe.net/blog/stories/1720596/
-> http://tagthe.net

So, in case you have checked out that service a while ago, and thought its speed and/or tag-quality sucks, then please check it out again.

A neat use-case is the delicious-plugin for example, which now became a lot more useful to use than before.
-> http://www.tagthe.net/blog/stories/1342788/

Congrats to the b o y z!

michi - 13.Apr 2006 16:50 - technisches

Wednesday, 5. April 2006

(Helma) Developer(in) gesucht

Tja, mal wieder. Die Firma wächst und wächst.

Wir bieten:
* Spannendes, dynamisches Betätigungsumfeld
* nette kollegen [1,2], und (manchmal) nette chefs [3]
* einen firmen-eigenen Basketball-Court

Wir suchen:
* Talentierte, smarte, kommunikative,.. Entwickler(in) mit mehrjähriger Erfahrung im Web-Bereich.
* Helma-Erfahrung ist kein K.O.-Kriterium :-)

Interessenten und -innen melden sich am besten mit Lebenslauf bei mir.

michi - 05.Apr 2006 23:30 - technisches

Friday, 31. March 2006

HowtoForge

-> http://www.howtoforge.com/
-> http://www.howtoforge.com/statistics

HowtoForge is the source for Linux tutorials.

Empfehlenswert!

michi - 31.Mar 2006 16:59 - technisches

Thursday, 30. March 2006

Amazon's Simple Storage Service a.k.a. S3

-> http://aws.amazon.com/s3
-> http://unicast.org/archives/004073.html [via lng]

Truly amazing! And very competitive prices.

$0.15 per stored GB per month
$0.20 per transferred GB

With €0.16 per GB the traffic costs are even lower than Hetzner's €0.19. So that's a definitive win for amazon.

Regarding the storage prices, the situation is not that clear. 1 GB of stored data will cost you 1.5€ per year at amazon. Mmh, a SATA harddisk costs about 0.5€ per GB, once. Let's say we double that for backup, add some additional RAID fault-tolerance, add the necessary hardware around all that, and we still will remain somewhat short of 1.5€ per GB (still just one-time costs). But as soon as we start considering personnnel expenses for maintaining the hardware (and software as MogileFS), resp consider the fact that amazon's cost scale perfectly along with your demand (no setup fees, no sunken costs), and that you never pay for any unused storage, then amazon should also win this comparison within this respect.

Getting started with S3 is dead simple with a command line tool named jSh3ll:
-> http://jroller.com/page/silvasoftinc
-> http://www.theserverside.com/...thread_id=39613

Check this out:
-> http://s3.amazonaws.com/michi/
-> http://s3.amazonaws.com/michi/helloworld
-> http://s3.amazonaws.com/michi/test.jpg
-> http://s3.amazonaws.com/michi/test.jpg?torrent (a bittorrent seed!)

So, maybe we will seriously start considering this option. And if the only reason is, that I am just fed up with sitting in between loud & noisy racks :-)

Oh, and speaking of traffic costs. I hope our newest Knallgrau baby, the freshly launched videoblog for mini, does not become any successful at all. The hetzner bill just gets higher and higher :-)
-> http://www.vlogbymini.de/

michi - 30.Mar 2006 20:43 - technisches

Wednesday, 22. March 2006

digg.com algorithm - the holy grail?

Question du jour: How does digg.com actually sort their articles on their frontpage?

Despite the number of digg-clones out there, I couldn't really find anything particular useful in order to reproduce their ordering (see [1] or [2] for example, but both did not provide any answer to my question).

Well, how about some simple reverse engineering? A couple of hourse ago I took the top 40 stories together with their

number of diggs
number of comments
time in minutes since the article has been posted

Some other possible factors in the digg-algorithm, but to which i don't have access to, might be the

number of click-throughs
number of views
number of non-diggs (that is a view for a registerd user, that does not digg the item)

You can download the sorted list from here: digg (csv, 2 KB)

First some illustrative charts, and some observations (which might be obvious to regular digg-users):

R> plot(data$diggs, type="l", main="Top 40 Stories at digg.com", ylab="nr of diggs", lwd=2)
R> points(data$comments*10, typ="l", col=4)
R> legend(1, 2800, legend=c("diggs", "comments * 10"), text.col=c(1,4))

1. Stories with very very few diggs make it to the absolute top.
2. Story "30" has 6 times more diggs than story "28". Therefore i assume that the "value" of a digg decreases with each additional digg. ln(digg)?.
3. Astounishingly the number of comments per story is nearly exactly ten times the number of diggs. Therefore due to this linear dependency between diggs and comments this variable probably won't be useful for our model.

R> plot(data$time, type="l", main="Top 40 Stories at digg.com", ylab="time in minutes", lwd=2, ylim=c(0,3000))
R> points(data$diggs, col=4, cex=0.5)
R> legend(1, 2800, legend=c("time in min  ", "diggs"), text.col=c(1,4))

1. There is not a single story that is younger than 9 hours in this list!
2. There is not a single story that is older than 40 hours in this list.
3. This narrow bandwidth let's one assume that digg.com is using some sort of time-thresholds to enter the list.
4. Story 21 and 22 are nearly of the same age, but story 21 with 25% less diggs than story 22 is still ranked further on the top!?
5. Story 20 and 22 nearly have the same nr of diggs, but story 20 is nearly double as old as story 22, but is still ranked further on the top!? This and the previous finding strongly indicate that there are additional variables to the model than diggs and times, but unfortunately i dont have access to them.
6. Story 21 to 40 are nicely sorted according to their age (with just a few exceptions). Maybe this is a coincedence, but it seems that age plays a more important role for the sorting of the older articles.

Then I tried to factor these findings into a simple model, and came up with the following:

rank = log(diggs) / time^2
R> plot(data$logdiggs * data$timepenalty^2, main="log(diggs) / time^2", ylab="", lwd=2, cex=1, pch=4)

In a true model the chart above would show a strictly decreasing path. Obviously i am unable to explain why the first five/six stories made it to the top with that few diggs.

Despite that let's see how good our model actually is:

R> lm1 <- lm(log(data$score) ~ data$logdiggs + log(1/data$timepenalty^2))
R> summary(lm1)

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -9.3065     1.7274  -5.388 4.24e-06 ***
data$logdiggs               0.8771     0.1032   8.502 3.14e-10 ***
log(1/data$timepenalty^2)   0.4521     0.1130   4.003 0.000289 ***
---
Signif. pres:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.4859 on 37 degrees of freedom
Multiple R-Squared: 0.7063,     Adjusted R-squared: 0.6905 
F-statistic:  44.5 on 2 and 37 DF,  p-value: 1.429e-10

Well, it's still far away from perfect, but according to these results i am at least able to explain 69% of digg's sorting algorithm. I guess most of the remaining 31% depend on the number of non-diggs. I would highly appreciate it if anyone can point me to articles/papers that provide some further insights into digg's/reddit's/yigg's mechanism.

Maybe it's all not that complex anyways.

Update: Shame on me. Actually it is really not that complex at all, as joshrt pointed out in the comments at here. And obviously i am not a regular digg-user, otherwise i would have known better. The sort order always stays the same! So, as soon as the article makes it to the frontpage, digging does not change anything? So, why should anyone digg anything on the frontpage? Just for the topstories? Does reddit work the same way?
So the new question is: What determines whether an article makes it to the frontpage or not? Is it really just an absolute number of diggs? Does it matter who diggs it? Do a fixed number of articles enter the frontpage within a certain timeperiod? Why do some articles take so long to make it to the frontpage? What is the alogrithm behind that?
As you can see, I am still interested in finding out how that works.

michi - 22.Mar 2006 18:59 - technisches

Friday, 17. March 2006

utf-8 urls

soon available at your favorite locale weblog-hoster
-> https://github.com/antville/helma/pipermail/helma-user/2006-March/006428.html

ich freu mich grad wie ein kleines kind

Nachtrag:
utf-8 urls waren schon längst kein problem in einem mod_jk-setup, und niemand hat mir was davon gesagt :-)
-> https://github.com/antville/helma/pipermail/helma-user/2006-March/006437.html
ich freu mich trotzdem sehr darüber.

michi - 17.Mar 2006 13:06 - technisches

Thursday, 16. March 2006

Scaling with Ruby on Rails (and Helma)

an interesting part I of a series of 4 articles on scaling eins.de, which is powered by RubyOnRails:
-> the adventures of scaling

1. The old codebase roughly consisted of around 50.000 lines of PHP code (plus a closed-source CMS that’s not included in this calculation). We’ve rewritten most of it (some features were left out on purpose) in about 5.000 lines of Rails code.

2. eins.de serves about 1.2 million dynamic page impressions on a good day.

3. The (4) application servers are dual Xeon 3.06GHz, 2GB RAM, SCSI U320 HDDs RAID-1. The (2) database servers are dual Xeon 3.06GHz, 4GB RAM, SCSI U320 HDDs RAID-1. The proxy server is a single P4 3.0GHz, 2GB RAM, SCSI U320 HDDs RAID-1.

4. At peak times about 20Mbit/s leave the proxy server’s ethernet interface.

I'm seriously impressed by these traffic-numbers, and also by the refactoring. Congrats to eins.de!

In order to make a comparison with Helma's performance I will provide the according numbers for twoday.net:

1. The codebase consists of 18.000 lines of JS-code (ongoing refactoring brought it currently down to 12.000 lines; hopefully this number will decline some more).

2. On a "good" day twoday.net serves about 400.000 (true) PIs, together with 150.000 RSS-requests and the (also dynamicly served) css- and js-files we have 1.6 million dynamic requests per day, which are handled by Helma.

3. 1 application server, that is a dual Xeon 2.8GHz, 2GB RAM, and 1 database server, a dual Xeon 3.06GHz, 2GB RAM. Additionally there is now a third server for serving the static content.

4. Since twoday.net consists mostly of textual content, our throughput never exceeds 1 5 MBit/s.

Currently we haven't reached our limit with this hardware setup. (As a side-note: The troubles we experienced over the last two months were related to some nightly cronjob-activities, like logfile splitting and generating statistics.) The maximum system-load of the web-server is 5, and of the db-server 6 (and for the static-server 0.2 :-)).

I know that each web-application has its own characteristics, and that it is simplistic to compare these numbers without taking a closer look at the applications themselves. Still i wanted to make a clear statement that Helma's performance is absolutely amazing! We here at knallgrau can concentrate on the implementation of new functionality, without having to worry about performance/caching at all [as long as we follow my five golden rules :-)]. It makes the development so much more fun, and saves us lots of time.

My five golden rules for Helma developers

Cover each and every collection with a database index.
Stay away from large JavaScript arrays (use Java-Sets or -Hashtables for such purposes)
Stay away from massive string concatenations, esp. '+=' seems to perform poorly
Stay away from looping through collections, just to apply some filtering or sorting. Try to use distinct collections instead.
Keep the read/write ratio for sql-operations high. I.e. writing sthg to the database for each request, is a very bad idea. If really necessary, then try to do the update/inserts in bulk (like in twoday).

So, and that is actually my version of my ~~4-series~~ article on performance tuning a Helma application :-)

michi - 16.Mar 2006 11:02 - technisches

michi bloggt!

technisches

Wednesday, 3. May 2006