Saturday, April 19, 2008

Another file format support update in CasualConc

After I updated CasualConc last night, I realized I could add html and WebKit Webarchive support. So I added these two to supported file format. Now it can read various files that contain text. But the process will be slower than plain text files. So if speed is important, convert the files to plain text. If you want faster search, then create a database file from plain text files. I will try to write a utility program to convert CasualConc supported files to UTF-8 plain text files.

Well, it's kind of sad to keep writing blogs knowing nobody is reading...

Friday, April 18, 2008

CasualConc update

With recent discoveries of file handling on Objective-C side (or RubyCocoa side?), I decided to add a new feature to CasualConc. Originally, CasualConc was able to handle UTF-8 or ASCII encoded files because I used Ruby's file handling method. Now I switched to Objective-C methods, so it can handle a few more encodings. Added encodings are UTF-16, Windows Latin 1, Windows Latin 2, Mac Roman, Shift-JIS, EUC-JP, and ISO-2202 JP. The last three encodings are all for Japanese. These are limited to the ones Objective-C can handle by default. I wasn't able to find Chinese or Korean encoding settings, so they are not included. But CasualConc cannot handle 2-byte character properly (in Concordance), so this shouldn't be an issue. I haven't really tested all the encodings, so if someone happens to find this blog and would like to try, let me know. I might add a link to download page to the CasualConc site soon (hopefully).

Also experimentally implemented is support of other file formats. This still returns error from time to time. Personally, all my corpus files are in text, so this is not for myself, but I just thought someone might be interested. The problem is no one is checking this blog or CasualConc site, so I highly doubt anyone even uses this function. By the way the added file formats are .doc, .docx., .rtf, .rtfd, and .pdf.

Well, I really hope someone would try to use CasualConc, though...

Sunday, April 13, 2008

WtoTconvUtil

This is another utility program I wrote in Ruby + RubyCocoa. What you can do with this program is very simple: extract text from a web page. In fact, I'm not sure how useful this is, but I just wanted to experiment. This is based on PtoTconvUtil, a PDF to text converter utility program. After I wrote it, I wondered how I could extract text from a web page, so I experimented a while, but couldn't figure it out. But last night, when I was not able to think about my paper, I found a clue at a web site, and then spent 15-20 min. to figure out how.

This program still has many issues because the browser part is simply made of Cocoa binding, which means no scripting. I simply wrote scripts to extract text and save it as a text file. But thanks to built-in Cocoa functions, it recognizes a web address in a text box (though you always need to type "http://"), reload, forward and back buttons work, and accepts Safari Webarchive file and HTML file by drag&drop. And I also found that this program can extract text from a PDF file which is displayed on a browser with a plug-in. So if you know a web address of a text-embedded PDF file, you can show it on a browser box and extract text. Yes, this is very good, but because I just use built-in functions, it's not flexible. I want to add a function to read bookmark from other browsers, so you don't have to type an address everytime. It might be easier to read Safari Bookmark, so I might try it first, though I'm not sure when that will happen.

Anyway, this program helps you compare the original and the extracted text on one window. So if you build a corpus from web pages, you can either extract the entire text or simply copy and paste a part of it.

So again, if you somehow found this page and is reading this, AND if you use Leopard, try it and let me know what you think.

EDIT: This program is discontinued and integrated into CasualTextractor, which is available on CasualConc site under Utility Programs.

Saturday, April 12, 2008

CasualMecab

is the name I gave to a utility program that is based on MeCab. What this program does is POS/morphological analysis of Japanese text. What the program does at this moment is simply produce MeCab output. Choices are MeCab output, Chasen-like output, wakachi-gaki (words with spaces in between), and yomi (in katakana). The output can be saved as a text file. I want to add other output formats, but probably not in the near future. This program can also handle batch process although I haven't tested it extensively. The output file is encoded in UTF-8, mainly because that's what CasualConc can handle. I want to add Japanese concordancing feature to CasualConc in the future. If anyone ever finds this blog and is interested, please go to CasualConc site and download it. By the way, this program requires MeCab and MeCab-Ruby. The instruction to install these are also at CaualConc site. The installation is not simple (you need to use Terminal and command line to install), but the instruction is step-by-step. I hope anyone can understand it. As always, this is a Leopard only program and free.

Friday, April 11, 2008

MeCab-Ruby

I finally found a way to successfully install MeCab (Japanese parser) and MeCab-Ruby, Ruby binding for MeCab on Leopard. I added this page to the CasualConc web site. It's only in Japanese at this moment because I'm not sure how many people actually check the site and how many of very limited visitors are interested in installing MeCab-Ruby on their Leopard machine. If anyone is interested, I can translate the page into English, but probably there are many better sites somewhere.

But now that I installed it, I might add Japanese concordancing features to CasualConc, if I ever have time. At least, I can try it now. Also if anyone can understand how to install MeCab-Ruby on their computer, I might add parcing feature (Japanese) to CasualConc, assuming people are willing to install it on their own. But I'll probably first work on GUI interface of MeCab-Ruby to create wakachi-gaki files or syntactically parsed files. But when do I have time???