Showing posts with label casualconc. Show all posts
Showing posts with label casualconc. Show all posts

Monday, April 21, 2008

Japanese Support

This would probably the last update for a while. While I was stuck with the idea for my dissertation, I simply spend some time here and there for the last few days to make minor fixes and feature enhancements to CasualConc. As I posted in the last couple of days, I finally made the download page available to public, though I'm not sure how many will find it, and added support for several different character encodings and file types. Finally, I added very limited Japanese support.

Now CasualConc can read Japanese (and possibly other East Asian Languages) files in two formats. One is a plain format without any space in between words. The other is wakachi-gaki, which has 1-byte space in between words. Wakachi-gaki files can be created with jparser unitility program. To analyze Japanese texts, a proper mode should be selected in the preference. Select Japanese (plain) under Concord options in the preference for the former and Japanese (wakachi) for the latter. If a proper mode is not selected, CasualConc cannot search words/characters. Wildcard search is implemented, but not tested thoroughly. Because of the way wakachi-gaki is written, 1-byte space should be inserted between words in phrase search. Because this is also experimental, CasualConc might crush when you try to analyze Japanese text. Japanese is only available for Text Mode. Once features are set, I will add database file support.

If you happen to find this blog or CasualConc page and are willing to try, please do so and let me hear what you think.

Sunday, April 20, 2008

CasualConc open to public

Well, I finally decided to make CasualConc public. This just means I added a link to the download page, which was already active, to the home page. I highly doubt anyone is visiting the CasualConc site, so this doesn't make much difference, but I'm hoping somehow someone might find the page and try it. When I googled, it didn't come up, so the only way to find the page is from a bbs (or usergroup?) post I wrote while ago (which I mistakingly posted multiple times because I thought I was able to edit my post, but it turned out I posted multiple times...) or from a link on my personal schedule page at my work. Or possibly, from this blog, if this can be googlable.

Anyway, because I don't get any feedback on existing features, I decided to work on something not currently implemented: Japanese (or Asian Languages) support. This is going to be highly experimental and I don't have much time now, so I can't tell when I will release it. So far, I can display kwic results of Japanese text in plain format (no space) and wakachi-gaki format (space-separated). The former can be sorted by L5-R5 context characters and the latter can be sorted by L5-R5 context words (or whatever the separated units are). In the future (only for Japanese), I want to include MeCab (which you need to install following the instruction on the CasualConc page) to process plain texts, but this won't happen near future.

If you ever find this blog and use Mac OS X Leopard and are interested in corpus analysis, check CasualConc and let me know what you think. The link to the CasualConc site is on the right or click this link.

Friday, April 18, 2008

CasualConc update

With recent discoveries of file handling on Objective-C side (or RubyCocoa side?), I decided to add a new feature to CasualConc. Originally, CasualConc was able to handle UTF-8 or ASCII encoded files because I used Ruby's file handling method. Now I switched to Objective-C methods, so it can handle a few more encodings. Added encodings are UTF-16, Windows Latin 1, Windows Latin 2, Mac Roman, Shift-JIS, EUC-JP, and ISO-2202 JP. The last three encodings are all for Japanese. These are limited to the ones Objective-C can handle by default. I wasn't able to find Chinese or Korean encoding settings, so they are not included. But CasualConc cannot handle 2-byte character properly (in Concordance), so this shouldn't be an issue. I haven't really tested all the encodings, so if someone happens to find this blog and would like to try, let me know. I might add a link to download page to the CasualConc site soon (hopefully).

Also experimentally implemented is support of other file formats. This still returns error from time to time. Personally, all my corpus files are in text, so this is not for myself, but I just thought someone might be interested. The problem is no one is checking this blog or CasualConc site, so I highly doubt anyone even uses this function. By the way the added file formats are .doc, .docx., .rtf, .rtfd, and .pdf.

Well, I really hope someone would try to use CasualConc, though...