Extracting Information from StudiVZ

Some time ago somebody stole 1 million data records from StudiVZ, the German Facebook clone. I’m not exactly sure why people call the person a hacker who stole data, because it appears he simply wrote a tool that harvested the publicly available data from StudiVZ (which everyone with an account can view).

People on StudiVZ share all their data by default—contrary to Facebook which values a person’s private data a lot more. Thus by simply opening each profile from a dummy user and processing the HTML data from StudiVZ one can extract a lot and some more information from random people who probably don’t even know about it or don’t care.. so I’m not sure about the stealing part.

Apparently there are some captcha’s when you start browsing searches beyond a few pages. I guess that is where the hacking part comes in, because getting around a captcha probably constitutes hacking—maybe?

Anyway I think part of the media coverage is a bit ridiculous because anyone can write a simple harvester in an hour or two. It took me one and half hours, so I think I’m on the safe side with this estimate and I didn’t really have a clue about this stuff before either.

Since I don’t want to “hack”, I’ve only written a very tame harvester. It connects to your personal StudiVZ account, and retrieves the name and profile ID (and thus profile URL) of all your friends in the “Meine Freunde” pages.

It could do a lot more with that like retrieving everybody’s birthday or random pictures, but I’m too lazy to code that because you use the same pattern for extracting data over and over again and it stops being interesting quite fast.

You can download the project here. It is a one file C# project. I’m releasing it under GPL (whatever).

It’s really easy to explain how it works:

It uses System.Net’s HttpWebRequest and HttpWebResponse to get (and post) web pages.
StudiVZ (like every other portal) uses cookies, so I create a CookieContainer and use it in every http request.
There are a few hidden values that StudiVZ expects during login. I’m retrieving them from the main page using custom built regular expressions. I’ve found a handy AJAX tester for .NET regular expressions. It was really useful for building the expressions and debugging them. (BTW you can find all URLs I used in the comments.)
After login I use the same pattern: get page & parse using regex for everything.
Visual Studio has an awesome “HTML Visualizer” for strings. It displays the content of a string as HTML page, which is really nifty if you’re doing anything related to HTML processing.

The code is quite ugly. Well, it’s not production code and this is only meant as a proof of concept.

Also note that I have at most violated the AGB of StudiVZ and not committed any criminal acts and I’m not planning to sell my friend’s profile IDs or data either :-)

Maybe someone can extend the code and make it more useful. I guess it would be fun to automatically download all your pictures (including tags) and feed them into flickr or picasa… but someone else can do that.

Cheers, Andreas