Creating My Own Idea

posted on 2005-08-19 at 23:35:09 by Joel Ross

Well, I've been harping on an idea for a while now, and instead of waiting for others to get on board, I decided to see what I could do myself. I started making the code, and to get a quick and dirty version up and running, it was actually pretty simple.

Here's the high level logic of my initial run:

1. Get an OPML file.
2. Load each feed for that OPML file.
3. Parse the content of each item in each feed and pull out each URL.
4. For each URL, keep track of the domain. If the domain isn't in the list, add it. If it's in the list, then add one to it's link count.
5. Once you're done, sort it by the link count (descending) and show me.

There's a lot of room for improvement, but it worked?for a first cut. It took about 15 minutes so far. I ran it through an OPML listing of my top ten blogs, and here are the top domains linked to, followed by the number of links:

www.microsoft.com (43)
weblogs.asp.net (36)
blogs.msdn.com (28)
scottonwriting.net (25)
spaces.msn.com (23)
msdn.microsoft.com (21)
www.hanselman.com (13)
www.mikeswanson.com (13)
channel9.msdn.com (13)
dbvt.com (10)
en.wikipedia.org (10)
www.aisto.com (10)

As you can see, it's not very accurate yet. It does reflect where people are linking to, but here's how it could be improved.

1. Add caching of feeds. This would make the retrieval faster - you only need to retrieve feeds that are updated. (and for those in my top ten, sorry for banging on your feed!)
2. Make the updating of feeds happen in the background. By doing this, the feeds are loaded separately of your request to see your "Top Blogs List." I would also store the parsed URLs by feed, so I have that information at hand.
3. Let this handle multiple users. It could right now, but not smartly. If the feeds are stored generically, that data can be used across users. This would obviously be key for large aggregator companies like Newsgator or Bloglines.
4. Add a list of invalid URLs. This could be used for two purposes. First, sites like www.microsoft.com aren't exactly blogs, and, if this is a top blog list, then it should be excluded. Second, notice that three of the top four domains linked to are aggregate blogs. By adding a site like weblogs.asp.net to the invalid URL list, it would force the software to dig further - looking for something like weblogs.asp.net/rhoward, which is a valid blog URL.
5. Exclude links to themselves. 6 of my 12 top linked domains are domains in my top ten blog list. I'm not sure right now if it's because they are linking to each other, or because they're linking to themselves. The latter should be excluded.
6. Exclude links to blogs already in your OPML. This way, you only see the top blogs outside of your network. Maybe another option would be to see only the blogs in your OPML - who's the most influential among your circle.

Since I've started thinking about this, I've seen a few downfalls of this approach. First, non-full text feeds are all but useless (but you knew that already, didn't you?). Most don't provide enough content to be useful, let alone include any links. No links - no influence! The other thing that's a little more subtle is that feeds that provide more items in their feed will get more influence than feeds that only include a few items. Obviously, the last one can be solved by only using the latest x number of feeds, and then you'll get the same number of items from each feed. This can also be handled with caching - if you only provide 7 items in your feed, and I want to base my Top Blogs on 15 items, then I can cache 8 items that aren't in your feed anymore.

I may spend some more time refining this, but I'm not sure. For now, it was a good chance to mess around with RSS and the wonderful open source RSS.NET framework.

Categories: Blogging