Tuesday, April 29, 2008

Documentation

Over the last couple of weeks, I've worked on the documentation of CasualConc. Now, it covers most of the basic functions. I also started more step-by-step instruction with a lot of images and named it Getting Started with CasualConc. So far, I have only finished basic file management and database creation along with the kwic concordance function, which I think will be the most frequently used function (I do).

Now, my only hope is someone will find the site or this blog and start using it. Somehow, I can't search the CasualConc main site on Google. It doesn't show up in the result. When I add a post to this blog, it shows up in the next 20 hours or so and disappears. Well, maybe I should add one post per day until some more people find this blog and CasualConc.

If you happen to find this blog, please try it (if you use Leopard) or tell your friends who uses Leopard to try it. I know it still has bugs and a lot of limitations, but I really want other people's opinions to improve it (it serves most of my current uses, so I don't have much motivation to make a lot of changes). Well, even if I hear from people, I might not be able work on it for a while, but at least it's good to hear esp. if people like it.

Monday, April 28, 2008

A quick fix

I'm almost certain nobody has downloaded CasualConc since my last post. But anyway, accidentally, I introduced a bug to database creation function. This was caused by implementing a new tag-deletion code, which I forgot to apply to database creation part. So if you find the database creation function does not work properly (this crashes CasualConc), please go to the site and download the latest beta.

If you find any other bugs, please report it to me. The email address is on the main site. Or you can leave a comment on this blog.

Thanks!!

Sunday, April 27, 2008

A few more fixes

I found a few minor bugs which I introduced with the last changes, so fixed them. And I also found that the default font of CasualConc, Courier, is not monospace in Greek, which is the language my very first user tests (I guess) on CasualConc, so I added a function to select Courier or Courier New, which is monospace in Greek.

I didn't mention this in the last post, but I also made some changes to the codes of Concord, which only improved speed about 2-3%.

Now I hope more people find this blog or the main site and test CasualConc. So if you happen to find this blog or the main site and you know someone who uses Leopard and is interested in corpus analysis, please tell him/her to test CasualConc, even if you don't use Mac OS X Leopard. If you do, please try it!!

Saturday, April 26, 2008

A minor update

As I reported a couple of days ago, at least a few people around the world are testing CasualConc and I've already got a report of a bug... This very minor update is partly based on the report, which I don't really know the source of, and a minor change to a setting.

What I found was an inconsistency of handling special characters, such as curly quotes or curly apostrophes. These look like a single byte character on web pages or Word documents, but in fact two-byte characters in Unicode (UTF-8), so CasualConc replace them with a single byte quote or apostrophe. Recently I added .doc and .pdf support, and these documents often contain their own special characters (arrows, etc.). I only replaced these in some parts and not others, which caused inconsistency. Now I think I applied the same rule to all the tools, but I'm not sure.

Another change is drop of ASCII mode in concordance. In the Concordance Preferences, CasualConc has 4 ways to handle texts. Originally the two European Language supports are ASCII and with Acccented Characters. The former assumed the corpus files do not contain any multi-byte characters (in UTF-8). The latter only assumed a few accented (multi-byte) characters in a context (the context words shown in the concordance result table). But then I realized after the very first person downloaded CasualConc that he uses Greek, which, I think, uses full of multi-byte characters in UTF-8. So the new two modes for European Languages are A and B. A is the same as the previous with Accented Characters mode and B is for full multi-byte character languages, but still assuming not many 3-byte characters used in East Asian Languages. If the text contains many 3-byte characters like East Asian Language characters, like Japanese, which are 2-byte characters on the screen but processed as 3-byte characters in UTF-8, CasualConc might not be able to display concordance result or full context view properly. If there are any languages that have full of 2- and 3-byte characters, let me know. I'll see what I can do.

By the way, I decided to add 'Getting Started' section on the CasualConc site. The current 'How to use' is more like a manual or lists of functions CasualConc has, so it's not really how-to. The site only has basic file handling or 'how to select files for your analysis' type entry. I'll try to add more when I can find time.

Anyway, the current version is 0.9.4. If it is up to 0.9.9 and still not ready for version 1, I might go for 0.9.9.1..., but if enough people use it and does not have major problem, I might put it as Version 1.

Thursday, April 24, 2008

Old stuff

Now I learned I can add an html page with javascript to the Google Page Creator site by simply uploading it as a file and link to it, I added an old javascript-based concordancer/word counter to the CasualConc site. This is probably useless for people and I'm not sure if I need it on the site, but I just wanted to keep it somewhere and because this old script is the basis of CasualConc, I think it's the right place (for me).

I wrote this script about 2 years ago when I was playing with javascript. At that time, I wanted to learn javascript, which I just started to learn a few months before that. I only knew MS-BASIC before that. When I started to play with javascript, I figured the best way to learn it is to write something with it. I first wrote s few scripts for my colleague at the work to save a repetitive task. Then, I wanted to do something for my self. I always wanted to do something with corpus linguistics. I found a few sites that did it with javascript and many scripts in Perl. With trial and error, and a lot of revisions, this javascript page was written. The page says it's version 6, but the script version is 61 (its on the file name of the script file). But then I learned limitations of javascript as a tool for corpus analysis. Then I tried Perl because that seemed to be what everyone used (and a lot of people are using it for text analysis), but somehow, it didn't appeal to me (or I wasn't/isn't smart enough to learn it). Then a year later, I used Ruby for something at work and somehow I liked it (still like it). I didn't know Python, which I learned when I was learning Ruby. Another big plus was that because Ruby was originally and is still developed mainly by Japanese people, I found a lot of documents in Japanese. This and the inclusion of RubyCocoa in Leopard is why CasualConc exists now. I think I wrote something like this in the very first post on this blog, but anyway, it's fun to use Ruby though my scripts are still primitive. I hope I can learn more about Ruby and improve CasualConc. What I want is time, but now I need to spend more time on other more important stuff...

Wednesday, April 23, 2008

CasualConc launched!!

I finally found someone who got interested in using CasualConc!!. I was just surfing the web looking for info on concordancing on Mac. Though I'm developing CasualConc, if I can find better more flexible concordancer for Mac, I'm happy to use it. The only problem will be I can't make the changes I want. Anyway, I found a blog that was describing poor concordance software situation on Mac, so I posted a comment and he replied to it and wrote he downloaded CasualConc. He wrote he would post the impression of CasualConc on his blog, so I'm really looking forward to it and at the same time I'm a bit nervous. I think CasualConc at its beta state works ok for my casual use right now (mostly searching for collocation of words I want to use in my paper). With database mode, it's fast enough to use regularly. And because I wrote the program, I know how to use it, but I'm wondering how easy or difficult CasualConc is for others. I've been adding contents to the documentation. But at some point, I might need to work on step-by-step instruction of how to use it. Well, this will only happen if more people are interested and start using CasualConc.

If you ever find CasualConc and use it, any comments are welcome!

Monday, April 21, 2008

Japanese Support

This would probably the last update for a while. While I was stuck with the idea for my dissertation, I simply spend some time here and there for the last few days to make minor fixes and feature enhancements to CasualConc. As I posted in the last couple of days, I finally made the download page available to public, though I'm not sure how many will find it, and added support for several different character encodings and file types. Finally, I added very limited Japanese support.

Now CasualConc can read Japanese (and possibly other East Asian Languages) files in two formats. One is a plain format without any space in between words. The other is wakachi-gaki, which has 1-byte space in between words. Wakachi-gaki files can be created with jparser unitility program. To analyze Japanese texts, a proper mode should be selected in the preference. Select Japanese (plain) under Concord options in the preference for the former and Japanese (wakachi) for the latter. If a proper mode is not selected, CasualConc cannot search words/characters. Wildcard search is implemented, but not tested thoroughly. Because of the way wakachi-gaki is written, 1-byte space should be inserted between words in phrase search. Because this is also experimental, CasualConc might crush when you try to analyze Japanese text. Japanese is only available for Text Mode. Once features are set, I will add database file support.

If you happen to find this blog or CasualConc page and are willing to try, please do so and let me hear what you think.

Sunday, April 20, 2008

CasualConc open to public

Well, I finally decided to make CasualConc public. This just means I added a link to the download page, which was already active, to the home page. I highly doubt anyone is visiting the CasualConc site, so this doesn't make much difference, but I'm hoping somehow someone might find the page and try it. When I googled, it didn't come up, so the only way to find the page is from a bbs (or usergroup?) post I wrote while ago (which I mistakingly posted multiple times because I thought I was able to edit my post, but it turned out I posted multiple times...) or from a link on my personal schedule page at my work. Or possibly, from this blog, if this can be googlable.

Anyway, because I don't get any feedback on existing features, I decided to work on something not currently implemented: Japanese (or Asian Languages) support. This is going to be highly experimental and I don't have much time now, so I can't tell when I will release it. So far, I can display kwic results of Japanese text in plain format (no space) and wakachi-gaki format (space-separated). The former can be sorted by L5-R5 context characters and the latter can be sorted by L5-R5 context words (or whatever the separated units are). In the future (only for Japanese), I want to include MeCab (which you need to install following the instruction on the CasualConc page) to process plain texts, but this won't happen near future.

If you ever find this blog and use Mac OS X Leopard and are interested in corpus analysis, check CasualConc and let me know what you think. The link to the CasualConc site is on the right or click this link.

Saturday, April 19, 2008

Another file format support update in CasualConc

After I updated CasualConc last night, I realized I could add html and WebKit Webarchive support. So I added these two to supported file format. Now it can read various files that contain text. But the process will be slower than plain text files. So if speed is important, convert the files to plain text. If you want faster search, then create a database file from plain text files. I will try to write a utility program to convert CasualConc supported files to UTF-8 plain text files.

Well, it's kind of sad to keep writing blogs knowing nobody is reading...

Friday, April 18, 2008

CasualConc update

With recent discoveries of file handling on Objective-C side (or RubyCocoa side?), I decided to add a new feature to CasualConc. Originally, CasualConc was able to handle UTF-8 or ASCII encoded files because I used Ruby's file handling method. Now I switched to Objective-C methods, so it can handle a few more encodings. Added encodings are UTF-16, Windows Latin 1, Windows Latin 2, Mac Roman, Shift-JIS, EUC-JP, and ISO-2202 JP. The last three encodings are all for Japanese. These are limited to the ones Objective-C can handle by default. I wasn't able to find Chinese or Korean encoding settings, so they are not included. But CasualConc cannot handle 2-byte character properly (in Concordance), so this shouldn't be an issue. I haven't really tested all the encodings, so if someone happens to find this blog and would like to try, let me know. I might add a link to download page to the CasualConc site soon (hopefully).

Also experimentally implemented is support of other file formats. This still returns error from time to time. Personally, all my corpus files are in text, so this is not for myself, but I just thought someone might be interested. The problem is no one is checking this blog or CasualConc site, so I highly doubt anyone even uses this function. By the way the added file formats are .doc, .docx., .rtf, .rtfd, and .pdf.

Well, I really hope someone would try to use CasualConc, though...

Sunday, April 13, 2008

WtoTconvUtil

This is another utility program I wrote in Ruby + RubyCocoa. What you can do with this program is very simple: extract text from a web page. In fact, I'm not sure how useful this is, but I just wanted to experiment. This is based on PtoTconvUtil, a PDF to text converter utility program. After I wrote it, I wondered how I could extract text from a web page, so I experimented a while, but couldn't figure it out. But last night, when I was not able to think about my paper, I found a clue at a web site, and then spent 15-20 min. to figure out how.

This program still has many issues because the browser part is simply made of Cocoa binding, which means no scripting. I simply wrote scripts to extract text and save it as a text file. But thanks to built-in Cocoa functions, it recognizes a web address in a text box (though you always need to type "http://"), reload, forward and back buttons work, and accepts Safari Webarchive file and HTML file by drag&drop. And I also found that this program can extract text from a PDF file which is displayed on a browser with a plug-in. So if you know a web address of a text-embedded PDF file, you can show it on a browser box and extract text. Yes, this is very good, but because I just use built-in functions, it's not flexible. I want to add a function to read bookmark from other browsers, so you don't have to type an address everytime. It might be easier to read Safari Bookmark, so I might try it first, though I'm not sure when that will happen.

Anyway, this program helps you compare the original and the extracted text on one window. So if you build a corpus from web pages, you can either extract the entire text or simply copy and paste a part of it.

So again, if you somehow found this page and is reading this, AND if you use Leopard, try it and let me know what you think.

EDIT: This program is discontinued and integrated into CasualTextractor, which is available on CasualConc site under Utility Programs.

Saturday, April 12, 2008

CasualMecab

is the name I gave to a utility program that is based on MeCab. What this program does is POS/morphological analysis of Japanese text. What the program does at this moment is simply produce MeCab output. Choices are MeCab output, Chasen-like output, wakachi-gaki (words with spaces in between), and yomi (in katakana). The output can be saved as a text file. I want to add other output formats, but probably not in the near future. This program can also handle batch process although I haven't tested it extensively. The output file is encoded in UTF-8, mainly because that's what CasualConc can handle. I want to add Japanese concordancing feature to CasualConc in the future. If anyone ever finds this blog and is interested, please go to CasualConc site and download it. By the way, this program requires MeCab and MeCab-Ruby. The instruction to install these are also at CaualConc site. The installation is not simple (you need to use Terminal and command line to install), but the instruction is step-by-step. I hope anyone can understand it. As always, this is a Leopard only program and free.

Friday, April 11, 2008

MeCab-Ruby

I finally found a way to successfully install MeCab (Japanese parser) and MeCab-Ruby, Ruby binding for MeCab on Leopard. I added this page to the CasualConc web site. It's only in Japanese at this moment because I'm not sure how many people actually check the site and how many of very limited visitors are interested in installing MeCab-Ruby on their Leopard machine. If anyone is interested, I can translate the page into English, but probably there are many better sites somewhere.

But now that I installed it, I might add Japanese concordancing features to CasualConc, if I ever have time. At least, I can try it now. Also if anyone can understand how to install MeCab-Ruby on their computer, I might add parcing feature (Japanese) to CasualConc, assuming people are willing to install it on their own. But I'll probably first work on GUI interface of MeCab-Ruby to create wakachi-gaki files or syntactically parsed files. But when do I have time???