Building a Dynamic Feed Aggregator With Yahoo Pipes

For those of you that do not know Yahoo Pipes:

Pipes is a powerful composition tool to aggregate, manipulate, and mash-up content from around the web.

So basically, do you love the Unix pipe operator? Now imagine you using your mouse to drag the components (processes) and connecting them together using lines (pipes), all this on your regular browser, defining a visual attractive workflow without the requirement of external software. Now that’s Web 2.0!

After you create a pipe, it becomes accessible via an URL, and you can integrate it with your web-page, or better, you can download the results via XML (RSS) or JSON format, consuming whatever data your pipe produces! For free! Now my mind starts thinking evil… :-)

Requirements and motivation

First you need a valid Yahoo account :-). What we want to achieve? Imagine you want to build a feed aggregator (maybe to build a planet site). You need to parse, collect and process all members’ feeds, watch for errors, duplicates… You have to know how to process all the RSS and ATOM specifications and find the common denominator of these formats. Worst than that, you will need the CPU and network power to cope with more users and more posts.

You can use the gazillion of libraries to do this job and waste your CPU cycles and network bandwidth… or you can do it with Yahoo Pipes :-)

To achieve that we need to pass to Yahoo a list of blogs we want to aggregate. One way of doing that is by creating a CSV file that Yahoo can access every-time it runs your pipe. The CSV could have just one column with the URL of the blog (it does not have to be the actual feed URL, more on that later):

"URL" 
"http://www.digg.com" 
"http://www.slashdot.org" 
"http://www.osnews.com"

Put this file (blogs.csv) accessible on your web service, and we can start to build our pipe!

Building the pipe

Now we can get our hands dirty!

Fetch CSV

Start by creating a new pipe. The first element we need is the “Source → Fetch CSV” component. Drag it to the pipe edit pane. On the URL put the address of your CSV file. You can accept the default option to use the first column as the column name. Yahoo uses that later when you need to select which columns you want to extract. The component should look like this:

Fetch Feed Site (Loop)

Now for each blog on our CSV file, you want to fetch the RSS or ATOM on the site. Remember that I told you didn’t have to know the exact URL of the feed? That’s right, Yahoo uses the standards to find the correct feed by looking at the blog content, much like the way Firefox shows that a website has a feed available to subscribe.

For this you’ll want the module “Source → Fetch Site Feed” that knows how to do just that. But wait, don’t we have a list of blogs to parse? That’s right, we need to apply the “Fetch Site Feed” to all blogs. To do that, we use the “Operators → Loop” component!

So we drag a “Loop” module and put a “Fetch Site Feed” inside it. Now, we need to configure these components. Yahoo can do much of the configuration by itself when you connect the blocks. So just connect the two components, creating a pipe! Then the loop and site feed will react, allowing you to consume only what the previous module produces. It is a kind of “strongly typed pipes” :-)

Basically you have to configure:

The URL on the “Fetch Site Feed” should be “item.URL” (i.e., the URL column from the CSV file).
The “emit all” should be selected on the “Loop” module, we are interested in all the results.

The end result should be this:

| sort | uniq > output

Now we have our feed items almost ready! However, we should process them in order to output a more “user friendly” result. For me, this includes sorting the feed items by the date of publication (the newer first), and removing the duplicated elements (who knows…).

This couldn’t be simpler with Yahoo Pipes. Just drag a “Operators → Sort” and connect it to the previous “Loop” module. You should say that you want to order by “item.pubDate” (remember to connect the module before selecting “item.pubDate”) in descending order.

Then you can drag and connect a “Operators → Unique” component and say you want to reject duplicated “item.link” items! The result should be like this figure:

Connect to output and have fun

The final step is to connect the output of the “Unique” component to the “Pipe Output”. You’re done :-) Now you can play with the debugger and see the output of your pipe. You can click on any of your components and get the intermediate results too!

Now that you’ve tested your pipe, you should save it, click “Run Pipe” on top, and Yahoo redirects you to a page where you can see the results of running your pipe :-) Then, you can publish (make the pipe public), edit, delete, clone, adding the results to your Google account, see a bunch of statistics about your pipe, etc…

For me, the best feature is under the “More Options” menu. You can download the results of your pipe on XML or (even better) in JSON! You can then use this on your favorite application to do.. well.. whatever you want :-) Using the pipe this way is like the “pull model” because you poll the pipe every-time you want results.

The other option is to use the “Operators → Web Service” component that pushes the pipe results to your application URL, and you don’t have to poll Yahoo anymore.. (well, you have to run the pipe by invoking its URL so…).

Conclusion and final thoughts

I’ve just scratched the surface of “Yahoo Pipes”. There are tons of other features that allow you to build a pipe with whatever complexity you want. But we can already raise some questions:

How easier could it be??
Does it scale? (imagine you have 1000 blogs to aggregate…)
Will it be always free (regardless the number of times you run the pipe) ?
Are they using plagger? :-)
Is there any technical information about the Yahoo implementation of this service?
Is Yahoo hiring in Europe? This is a serious interesting project…

Questions or corrections are welcome :-) Start playing with your pipes today!

There’s no place like ::1