<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>BlackHC's Adventures in the Dev World &#187; StudiVZ</title>
	<atom:link href="http://blog.blackhc.net/tag/studivz/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.blackhc.net</link>
	<description>Just another weblog</description>
	<lastBuildDate>Mon, 06 Sep 2010 11:19:23 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Extracting Information from StudiVZ</title>
		<link>http://blog.blackhc.net/2009/10/extracting-information-from-studivz/</link>
		<comments>http://blog.blackhc.net/2009/10/extracting-information-from-studivz/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 18:34:42 +0000</pubDate>
		<dc:creator>BlackHC</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Personal Rantings]]></category>
		<category><![CDATA[University]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[Profiles]]></category>
		<category><![CDATA[Regex]]></category>
		<category><![CDATA[StudiVZ]]></category>

		<guid isPermaLink="false">http://blog.blackhc.net/?p=678</guid>
		<description><![CDATA[Some time ago somebody stole 1 million data records from StudiVZ, the German Facebook clone. I&#8217;m not exactly sure why people call the person a hacker who stole data, because it appears he simply wrote a tool that harvested the &#8230; <a href="http://blog.blackhc.net/2009/10/extracting-information-from-studivz/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Some time ago somebody stole 1 million data records from <a href="http://www.studivz.net/" target="_blank">StudiVZ</a>, the German Facebook clone. I&#8217;m not exactly sure why people call the person a hacker who stole data, because it appears he simply wrote a tool that harvested the publicly available data from StudiVZ (which everyone with an account can view).</p>
<p>People on StudiVZ share all their data by default&#8212;contrary to Facebook which values a person&#8217;s private data a lot more. Thus by simply opening each profile from a dummy user and processing the HTML data from StudiVZ one can extract a lot and some more information from random people who probably don&#8217;t even know about it or don&#8217;t care.. so I&#8217;m not sure about the stealing part.</p>
<p>Apparently there are some captcha&#8217;s when you start browsing searches beyond a few pages. I guess that is where the hacking part comes in, because getting around a captcha probably constitutes hacking&#8212;maybe?</p>
<p>Anyway I think part of the media coverage is a bit ridiculous because anyone can write a simple harvester in an hour or two. It took me one and half hours, so I think I&#8217;m on the safe side with this estimate and I didn&#8217;t really have a clue about this stuff before either.</p>
<p>Since I don&#8217;t want to &#8220;hack&#8221;, I&#8217;ve only written a very tame harvester. It connects to your personal StudiVZ account, and retrieves the name and profile ID (and thus profile URL) of all your friends in the &#8220;Meine Freunde&#8221; pages.</p>
<p>It could do a lot more with that like retrieving everybody&#8217;s birthday or random pictures, but I&#8217;m too lazy to code that because you use the same pattern for extracting data over and over again and it stops being interesting quite fast.</p>
<p>You can download the project <a href="http://blog.blackhc.net/wp-content/uploads/2009/10/StudiVZExtractor.zip" target="_blank">here</a>. It is a one file C# project. I&#8217;m releasing it under GPL (whatever).</p>
<p>It&#8217;s really easy to explain how it works:</p>
<ul>
<li>It uses <strong>System.Net</strong>&#8216;s <strong>HttpWebRequest</strong> and <strong>HttpWebResponse</strong> to get (and post) web pages.</li>
<li>StudiVZ (like every other portal) uses cookies, so I create a <strong>CookieContainer </strong>and use it in every http request.</li>
<li>There are a few hidden values that StudiVZ expects during login. I&#8217;m retrieving them from the main page using custom built regular expressions. I&#8217;ve found a <a href="http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx" target="_blank">handy AJAX tester for .NET regular expressions</a>. It was really useful for building the expressions and debugging them. (BTW you can find all URLs I used in the comments.)</li>
<li>After login I use the same pattern: get page &amp; parse using regex for everything.</li>
<li>Visual Studio has an awesome &#8220;HTML Visualizer&#8221; for strings. It displays the content of a string as HTML page, which is really nifty if you&#8217;re doing anything related to HTML processing.</li>
</ul>
<p>The code is quite ugly. Well, it&#8217;s not production code and this is only meant as a proof of concept.</p>
<p>Also note that I have at most violated the AGB of StudiVZ and not committed any criminal acts and I&#8217;m not planning to sell my friend&#8217;s profile IDs or data either <img src='http://blog.blackhc.net/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>Maybe someone can extend the code and make it more useful. I guess it would be fun to automatically download all your pictures (including tags) and feed them into flickr or picasa&#8230; but someone else can do that.</p>
<p>Cheers,<br />
Andreas</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.blackhc.net/2009/10/extracting-information-from-studivz/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
