Sunday, April 13, 2008

WtoTconvUtil

This is another utility program I wrote in Ruby + RubyCocoa. What you can do with this program is very simple: extract text from a web page. In fact, I'm not sure how useful this is, but I just wanted to experiment. This is based on PtoTconvUtil, a PDF to text converter utility program. After I wrote it, I wondered how I could extract text from a web page, so I experimented a while, but couldn't figure it out. But last night, when I was not able to think about my paper, I found a clue at a web site, and then spent 15-20 min. to figure out how.

This program still has many issues because the browser part is simply made of Cocoa binding, which means no scripting. I simply wrote scripts to extract text and save it as a text file. But thanks to built-in Cocoa functions, it recognizes a web address in a text box (though you always need to type "http://"), reload, forward and back buttons work, and accepts Safari Webarchive file and HTML file by drag&drop. And I also found that this program can extract text from a PDF file which is displayed on a browser with a plug-in. So if you know a web address of a text-embedded PDF file, you can show it on a browser box and extract text. Yes, this is very good, but because I just use built-in functions, it's not flexible. I want to add a function to read bookmark from other browsers, so you don't have to type an address everytime. It might be easier to read Safari Bookmark, so I might try it first, though I'm not sure when that will happen.

Anyway, this program helps you compare the original and the extracted text on one window. So if you build a corpus from web pages, you can either extract the entire text or simply copy and paste a part of it.

So again, if you somehow found this page and is reading this, AND if you use Leopard, try it and let me know what you think.

EDIT: This program is discontinued and integrated into CasualTextractor, which is available on CasualConc site under Utility Programs.

No comments: