Showing posts with label utility program. Show all posts
Showing posts with label utility program. Show all posts

Sunday, April 13, 2008

WtoTconvUtil

This is another utility program I wrote in Ruby + RubyCocoa. What you can do with this program is very simple: extract text from a web page. In fact, I'm not sure how useful this is, but I just wanted to experiment. This is based on PtoTconvUtil, a PDF to text converter utility program. After I wrote it, I wondered how I could extract text from a web page, so I experimented a while, but couldn't figure it out. But last night, when I was not able to think about my paper, I found a clue at a web site, and then spent 15-20 min. to figure out how.

This program still has many issues because the browser part is simply made of Cocoa binding, which means no scripting. I simply wrote scripts to extract text and save it as a text file. But thanks to built-in Cocoa functions, it recognizes a web address in a text box (though you always need to type "http://"), reload, forward and back buttons work, and accepts Safari Webarchive file and HTML file by drag&drop. And I also found that this program can extract text from a PDF file which is displayed on a browser with a plug-in. So if you know a web address of a text-embedded PDF file, you can show it on a browser box and extract text. Yes, this is very good, but because I just use built-in functions, it's not flexible. I want to add a function to read bookmark from other browsers, so you don't have to type an address everytime. It might be easier to read Safari Bookmark, so I might try it first, though I'm not sure when that will happen.

Anyway, this program helps you compare the original and the extracted text on one window. So if you build a corpus from web pages, you can either extract the entire text or simply copy and paste a part of it.

So again, if you somehow found this page and is reading this, AND if you use Leopard, try it and let me know what you think.

EDIT: This program is discontinued and integrated into CasualTextractor, which is available on CasualConc site under Utility Programs.

Saturday, April 12, 2008

CasualMecab

is the name I gave to a utility program that is based on MeCab. What this program does is POS/morphological analysis of Japanese text. What the program does at this moment is simply produce MeCab output. Choices are MeCab output, Chasen-like output, wakachi-gaki (words with spaces in between), and yomi (in katakana). The output can be saved as a text file. I want to add other output formats, but probably not in the near future. This program can also handle batch process although I haven't tested it extensively. The output file is encoded in UTF-8, mainly because that's what CasualConc can handle. I want to add Japanese concordancing feature to CasualConc in the future. If anyone ever finds this blog and is interested, please go to CasualConc site and download it. By the way, this program requires MeCab and MeCab-Ruby. The instruction to install these are also at CaualConc site. The installation is not simple (you need to use Terminal and command line to install), but the instruction is step-by-step. I hope anyone can understand it. As always, this is a Leopard only program and free.