Sunday, March 29, 2009

CasualPConc update

I'm almost certain no one has tried it yet, but I spent a little more time to add some more features to CasualPConc, a new parallel concordancer. This application is available at the CasualConc main site under Utility Programs, but the documentation is not up-to-date.

I don't think I'm going make this as fancy as CasualConc, but I'm trying to use much more RubyCocoa (or Cocoa) features (I'm learning...).

CasualPConc originally had just kwic and word frequency count features. Now it has word cluster and collocation features just as CasualConc. One specific feature to CasualPConc is finding keyword in the matched corpus after running kwic search. When you run kwic search, you have the matched portion of the matched corpus (paragraphs or sentences), which includes words that are equivalent or similar to the one you searched. CasualPConc goes through the matched portion of text and calculate keyness of words against the entire corpus. I'm not sure if I explain this clearly, but it's there, but I'm also not sure if this works as intended.

I also added stop word/skip character functions. My understanding is stop words are the ones that are very frequent in a language and eliminating them helps people see what they look for more clearly. You can create stop word lists for any number of languages or corpora. The skip characters function is for two-byte languages, like East Asian languges or more specifically Japanese because characters for period, comma, brackets, etc. in Japanese are not treated as such by regular expressions. They are treated as regular characters like alphabets and included in word lists and context words and they contaminate results. Both of these functions are experimental and not fully tested and they are separate at this moment, but I might combine them as a single function or list.

If you are interested, please try it and give me some feedback. I personally don't use parallel concordancer much and I don't have good parallel corpora, so I can't really test it. Any feedback is welcome (functionality, usability, bug report, etc.). The current version is 0.3, but this simple means I have made two major changes/enhancements with some testing and bug fixing since version 0.1.

Also if you use any of my applications, I'd really appreciate your feedback on them.

No comments: