Building a dynamic feed aggregator with Yahoo Pipes
Wed 12, 2007 19:54 |
Permalink |
Comments (4) |
Trackbacks (0)
Today I discovered Yahoo Pipes after reading this post. A friend of mine told me the service is already one year old. It seems I’m a little bit distracted.
First, I have to say this: IMHO, this wins the “best web service of the year” award, and by a large margin! Google has some serious competition with this kind of services…
So on this post I decided to give a brief introduction to this Yahoo service and demonstrate how easily you can build a dynamic feed aggregator using Yahoo Pipes!
For those of you that do not know Yahoo Pipes:
Pipes is a powerful composition tool to aggregate, manipulate, and mash-up content from around the web.
So basically, do you love the Unix pipe operator? Now imagine you using your mouse to drag the components (processes) and connecting them together using lines (pipes), all this on your regular browser, defining a visual attractive workflow without the requirement of external software. Now that’s Web 2.0!
After you create a pipe, it becomes accessible via an URL, and you can integrate it with your web-page, or better, you can download the results via XML (RSS) or JSON format, consuming whatever data your pipe produces! For free! Now my mind starts thinking evil… :-)
Requirements and motivation
First you need a valid Yahoo account :-). What we want to achieve? Imagine you want to build a feed aggregator (maybe to build a planet site). You need to parse, collect and process all members’ feeds, watch for errors, duplicates… You have to know how to process all the RSS and ATOM specifications and find the common denominator of these formats. Worst than that, you will need the CPU and network power to cope with more users and more posts.
You can use the gazillion of libraries to do this job and waste your CPU cycles and network bandwidth… or you can do it with Yahoo Pipes :-)
To achieve that we need to pass to Yahoo a list of blogs we want to aggregate. One way of doing that is by creating a CSV file that Yahoo can access every-time it runs your pipe. The CSV could have just one column with the URL of the blog (it does not have to be the actual feed URL, more on that later):
"URL" "http://www.digg.com" "http://www.slashdot.org" "http://www.osnews.com"
Put this file (blogs.csv) accessible on your web service, and we can start to build our pipe!
Building the pipe
Now we can get our hands dirty!
Fetch CSV
Start by creating a new pipe. The first element we need is the “Source → Fetch CSV” component. Drag it to the pipe edit pane. On the URL put the address of your CSV file. You can accept the default option to use the first column as the column name. Yahoo uses that later when you need to select which columns you want to extract. The component should look like this:
Fetch Feed Site (Loop)
Now for each blog on our CSV file, you want to fetch the RSS or ATOM on the site. Remember that I told you didn’t have to know the exact URL of the feed? That’s right, Yahoo uses the standards to find the correct feed by looking at the blog content, much like the way Firefox shows that a website has a feed available to subscribe.
For this you’ll want the module “Source → Fetch Site Feed” that knows how to do just that. But wait, don’t we have a list of blogs to parse? That’s right, we need to apply the “Fetch Site Feed” to all blogs. To do that, we use the “Operators → Loop” component!
So we drag a “Loop” module and put a “Fetch Site Feed” inside it. Now, we need to configure these components. Yahoo can do much of the configuration by itself when you connect the blocks. So just connect the two components, creating a pipe! Then the loop and site feed will react, allowing you to consume only what the previous module produces. It is a kind of “strongly typed pipes” :-)
Basically you have to configure:
- The URL on the “Fetch Site Feed” should be “item.URL” (i.e., the URL column from the CSV file).
- The “emit all” should be selected on the “Loop” module, we are interested in all the results.
The end result should be this:
| sort | uniq > output
Now we have our feed items almost ready! However, we should process them in order to output a more “user friendly” result. For me, this includes sorting the feed items by the date of publication (the newer first), and removing the duplicated elements (who knows…).
This couldn’t be simpler with Yahoo Pipes. Just drag a “Operators → Sort” and connect it to the previous “Loop” module. You should say that you want to order by “item.pubDate” (remember to connect the module before selecting “item.pubDate”) in descending order.
Then you can drag and connect a “Operators → Unique” component and say you want to reject duplicated “item.link” items! The result should be like this figure:
Connect to output and have fun
The final step is to connect the output of the “Unique” component to the “Pipe Output”. You’re done :-) Now you can play with the debugger and see the output of your pipe. You can click on any of your components and get the intermediate results too!
Now that you’ve tested your pipe, you should save it, click “Run Pipe” on top, and Yahoo redirects you to a page where you can see the results of running your pipe :-) Then, you can publish (make the pipe public), edit, delete, clone, adding the results to your Google account, see a bunch of statistics about your pipe, etc…
For me, the best feature is under the “More Options” menu. You can download the results of your pipe on XML or (even better) in JSON! You can then use this on your favorite application to do.. well.. whatever you want :-) Using the pipe this way is like the “pull model” because you poll the pipe every-time you want results.
The other option is to use the “Operators → Web Service” component that pushes the pipe results to your application URL, and you don’t have to poll Yahoo anymore.. (well, you have to run the pipe by invoking its URL so…).
Conclusion and final thoughts
I’ve just scratched the surface of “Yahoo Pipes”. There are tons of other features that allow you to build a pipe with whatever complexity you want. But we can already raise some questions:
- How easier could it be??
- Does it scale? (imagine you have 1000 blogs to aggregate…)
- Will it be always free (regardless the number of times you run the pipe) ?
- Are they using plagger? :-)
- Is there any technical information about the Yahoo implementation of this service?
- Is Yahoo hiring in Europe? This is a serious interesting project…
Questions or corrections are welcome :-) Start playing with your pipes today!
4 Comments | rss | atom | xml | json
Nice article. With you’re finally questions you reach the point: The reason of building Yahoo Pipes. I now my reasons because I started to use Yahoo Popes more than a month ago in my web site MacrosReader (http://reader.macrostandard.com) but I can not imagine the Yahoo reasons. Meanwhile take a look on my Yahoo Pipes: http://pipes.yahoo.com/macrostandard
just one word. Powerfull!
I tried something very much like that right after Yahoo!Pipes came out.
Problem is, not everything has a pubDate. So, that breaks your sort.
Another problem is that many sites link to the same article(s), but provide their own, slightly different, comments.
The purpose of doing a sort | uniq is to get rid of the duplicates. So, to make this work in a proper semantic sense, you need to somehow parse the body of the article and pull out the meaning, then compare that against the meaning of other similar articles. For any two articles that are too similar, you probably want to drop the newer ones (penalizing those sites who just copy content from others and spew it back out again).
It’s a pain. I never really got anywhere with it.
To answer one of your questions, I don’t think this scales very well. I have a blogroll of about 250 blogs. I set up an XML file, and used a structure similar to what you set up. When I used a test version of my XML file with 10 entries, it worked great. When I used the full XML file, it just churned for a long time and then reported “The Pipes engine request failed”.
Thanks for the pointers anyway, I got to my failure faster with your help :)
You can use Textile 2 markup here. The XHTML tags accepted are: a, abbr, acronym, b, blockquote, code, em, i, strike and strong.