Category Archives: Web

T61 Extractor

I’ve been a vivid fan of thesixtyone.com – until they changed the design. The old site can still be found here: http://old.thesixtyone.com/.

Many of my most favorite songs are from this site and I have only been able to listen to them through the site. However, since the aforementioned design change, the site has really died down in my opinion and I was fearful ever since that it would simply go down some day and vanish – taking my songs and playlists with it.

As counter-measure I’ve written a java console application to extract playlists and songs from thesixtyone. It uses Selenium to remote-control a FireFox instance that uses the normal user-interface to play songs and read in playlists.

I’ve uploaded the code to launchpad: https://launchpad.net/t61extractor/trunk
I’ve tested and used it myself and it runs alright.

This was a small weekend project (or rather two weekend project) that I did a few months ago but I only found time now to write about it.

The code should be mostly self-explanatory and it’s not a lot of code either.

Cheers,
Andreas

Extracting Information from StudiVZ

Some time ago somebody stole 1 million data records from StudiVZ, the German Facebook clone. I’m not exactly sure why people call the person a hacker who stole data, because it appears he simply wrote a tool that harvested the publicly available data from StudiVZ (which everyone with an account can view).

People on StudiVZ share all their data by default—contrary to Facebook which values a person’s private data a lot more. Thus by simply opening each profile from a dummy user and processing the HTML data from StudiVZ one can extract a lot and some more information from random people who probably don’t even know about it or don’t care.. so I’m not sure about the stealing part.

Apparently there are some captcha’s when you start browsing searches beyond a few pages. I guess that is where the hacking part comes in, because getting around a captcha probably constitutes hacking—maybe?

Anyway I think part of the media coverage is a bit ridiculous because anyone can write a simple harvester in an hour or two. It took me one and half hours, so I think I’m on the safe side with this estimate and I didn’t really have a clue about this stuff before either.

Since I don’t want to “hack”, I’ve only written a very tame harvester. It connects to your personal StudiVZ account, and retrieves the name and profile ID (and thus profile URL) of all your friends in the “Meine Freunde” pages.

It could do a lot more with that like retrieving everybody’s birthday or random pictures, but I’m too lazy to code that because you use the same pattern for extracting data over and over again and it stops being interesting quite fast.

You can download the project here. It is a one file C# project. I’m releasing it under GPL (whatever).

It’s really easy to explain how it works:

  • It uses System.Net‘s HttpWebRequest and HttpWebResponse to get (and post) web pages.
  • StudiVZ (like every other portal) uses cookies, so I create a CookieContainer and use it in every http request.
  • There are a few hidden values that StudiVZ expects during login. I’m retrieving them from the main page using custom built regular expressions. I’ve found a handy AJAX tester for .NET regular expressions. It was really useful for building the expressions and debugging them. (BTW you can find all URLs I used in the comments.)
  • After login I use the same pattern: get page & parse using regex for everything.
  • Visual Studio has an awesome “HTML Visualizer” for strings. It displays the content of a string as HTML page, which is really nifty if you’re doing anything related to HTML processing.

The code is quite ugly. Well, it’s not production code and this is only meant as a proof of concept.

Also note that I have at most violated the AGB of StudiVZ and not committed any criminal acts and I’m not planning to sell my friend’s profile IDs or data either :-)

Maybe someone can extend the code and make it more useful. I guess it would be fun to automatically download all your pictures (including tags) and feed them into flickr or picasa… but someone else can do that.

Cheers,
Andreas

How to make WordPress display Full Text in RSS Feeds

… and why WordPress Moderators are obviously idiots

Today I got an email asking me to enable full text RSS feeds – so far so good, only when I activated the “Full Text” option nothing changed – neither in FireFox, nor in Google Reader, nor anywhere else. Only in Feed Proxy I saw a tag appear that contained the full text (including html tags), but the description still only contained the excerpt (without html tags).

Before I link you to the thread that made me conclude the statement in the post title, let me quote from the RSS 2.0 specs:

A channel may contain any number of s. An item may represent a “story” — much like a story in a newspaper or magazine; if so its description is a synopsis of the story, and the link points to the full story. An item may also be complete in itself, if so, the description contains the text (entity-encoded HTML is allowed; see examples), and the link and title may be omitted. All elements of an item are optional, however at least one of title or description must be present.

That said take a look at http://wordpress.org/support/topic/190901 – having the problem described above and reading the replies, it just left me with one question:

If Otto42 is a moderator, where does WordPress get its trolls from?

The RSS Specs are pretty unspecific and blurry when it comes to the issue, one can at most point to http://www.feedvalidator.org/docs/warning/DuplicateDescriptionSemantics.html, but the main issue I have is that the rants of this “moderator” don’t help me fix my problem, because Google Reader and FireFox still don’t display the full entries.

A fix for it

If you want to fix it, you can do the following (in WP 2.7):

Open up wp-includes/feed-rss2.php and change

to

 Similarly, if you want to fix your comments, too, open wp-includes/feed-rss2-comments.php and change

 to

WordPress Hacking II

privatepagesfixI have a few private pages that I use to store random stuff and ideas and private pages (for a reason I don’t understand) don’t show up in the pages widget.

It turns out that WordPress’s get_pages function always filters them out, while get_posts doesn’t (it actually has some logic to figure out whether to show private pages or not).

I’ve decided to fix this. A real fix would probably be merging get_pages and get_posts because both seem to do pretty much the same except that get_posts is a tad bit more advanced, but I’m all for quick fixes at the moment, because I don’t have much time and I don’t want to end up considering myself a PHP developer ;-)

To make get_pages return private pages, too, you have to open the file wp-includes/post.php and change the following lines near the bottom of the get_pages function:

to:

If you also want to make it easier to distinguish private pages from normal ones, you can also change the following bits in wp-includes/classes.php in the method Walker_Page::start_el:

to:

Best-of-Explosm Web 2.0

Yahoo Pipes are an interesting concept, as are the other existing mashup tools (like Microsoft’s Popfly or Ubiquity), and it is amazing what can be done with a few clicks with them.

Because I’ve wanted to learn how to work with Yahoo Pipes (before the Google Mashup Editor is released to the public), I’ve decided to take my best-of-explosm viewer to a different level and prototype it with Yahoo Pipes.

This just shows the first page (it’s easier to view – and create – the feed in groups of 20 pictures than all at once). You can follow the link if you want to view more pages.

Stay tuned for more.
Cheers,
Andreas

WordPress Hacking

Since I’ve been working on another blog (http://info1.blackhc.net) for my tutorial at university, I’ve played around with using Microsoft Word for publishing and editing content on the blog. I gotta say that SmartArts are really nice and good way to lighten up blog posts easily.

However Word’s support for the current version of WordPress is sub-optimal – that is, it totally messes up formatting when retrieving existing blog posts (that is, Word doesn’t recognize any paragraphs or line breaks). I’ve spent some time digging around in the blog’s .php code to find a way to fix it and after an hour or so, I’ve been able to fix it.
It’s one-liner I want to share with you:

The function wpautop adds <p> tags to your text (and by default <br> tags, too).

This will fix the formatting bug in Word. This is a hack, of course, but it works for Word.

WordPress has a bug in its Atom Publishing Protocol code, too:
It messes up the status code header when asking for user credentials, which prevents your browser from displaying the login form.
Again only one line needs to be changed:

Stay tuned for more updates in the next months.. :-)
Cheers,
Andreas

PS: University is really keeping me busy..