Monday, September 28, 2009

CasualPConc 0.7 and a new application

I made a few changes and fixed a few bugs on CasualPConc, a simple parallel concordancer for Mac OS X.  It is a little more stable now.  I also worked on the documentation.  Now it covers most of the features.

A new application is based on CasualPConc.  When I first released CasualPConc, someone asked if I would make it to handle more than two corpora.  This is kind of my answer to that.  CasualMultiPConc has limited features, but it can handle up to 5 parallel corpora. 

This new application simply does kwic concordance of up to 5 parallel corpora.  The future of this application is up to users (if there's any).  I don't have any experience in parallel corpus analysis and I don't need to use it right now, so I'm not even sure how well this works.  I only use small corpora to test it.  If you are interested in testing it, any feedback is welcome.  This one is also for Mac OS X 10.5 Leopard or later, though I only tested it on 10.6 Snow Leopard.

This application and CasualPConc are only on English site (Main Site link on the right).  Both of them are under Other Applications. 

Wednesday, September 23, 2009

CasualConc updated, but still not 1.0

I fixed a couple of bugs and added a few features.  Yes, I wrote I wouldn't spend time on East Asian language support, but I somehow figured out how to handle coloring in Concord with 2-byte character modes (Japanese (plain) and Japanese (wakachi)).  Well, it's in the middle of 5-day weekend in Japan...

Bug fixes
- crashes in Word Count when n-gram list is created in File Mode.
- File Information treated word lengths in bytes not in characters

Feature Improvements
- added two new Word Count sort options: Word Length and Reverse Word Length
- added Character as a search word choice
- full regular expression search
- a progress bar is added at the bottom of the main window, though it doesn't indicate the progress (it shows CasualConc is processing your request)
- much better East Asian Language support (Japanese (plain) and Japanese (wakachi))


Along with better East Asian Language support, I made a few other changes.  Two new Word Count sort options goes with File Information.  Now you can get information about the number of words with certain characters/letters.  So this new feature is to check which ones are the longest/shortest words.

An addition of a new search word mode is to search for the characters used for wildcard search.  Now you can search * ? ! and other non-word characters.

The change in regular expression search is that before this change, all the regular expressions are word level.  In other words, the actual regular expression processed inside CasualConc was inside the \b~\b (word boundaries).  Now this limitation is lift.  So if you want the same results as before, simply put \b in front and after the regular expression.

The progress bar added to the main window only shows CasualConc is processing.  It doesn't show how much processing has done.

East Asian Language support is much better now.  Context word coloring is added and now you can use the database mode.  But because of the nature of texts (no spaces between words), some of the functions behave differently.  More detailed information about East Asian Language support is documented on the site (only on the English site at this moment).

Now I will finally focus on bug fixes and minor changes.  I won't make any more major changes before 1.0, or so I think...

The documentations on CasualConc and CasualTagger are updated.  They should reflect the latest versions.

I'd appreciate if you could report any bugs as soon as you find them.

Sunday, September 20, 2009

CasualConc update

Just because I wanted to get file information, I added it to CasualConc. 

New feature
- File Information

It is very basic and only returns type, token, type-token ratio and number of n-letter words.

Improved
- Fisher's Exact Test calculation speed

I also rewrote the algorithm for Fisher's Exact test and it is much much faster, though I'm sure no one has used this since I added it last time.  But I haven't fully tested this, so I'd appreciate if anyone can test its accuracy.

Now the version is 0.9.9.9.  Well, it's almost 1.  If I can't find any other bugs when I use it in the next couple of weeks or I don't get any bug report, I'll simply make it 1.0.  I'm sure it still has bugs even if I make it 1.0, but that's the nature of computer programs. 

So, my decision for now is that I don't spend any more time on East Asian Language support because I haven't heard from anyone who uses CasualConc with East Asian languages.  I personally don't use CasualConc even with Japanese, so I don't see any necessity.  I'll work on this later if I have time, but I'll focus more on other programs once this hits 1.0.  So if you use this with East Asian languages and would like to have better support for East Asian languages, please let me know. 

If I think of other nice features, I'll probably add it to other programs and see if it works with CasualConc before I added them to it.  In fact, I added File Info to CasualTagger and it wasn't too complicated, so I decided to add it to CasualConc (well, I wanted this feature, but haven't tried to write scripts).

Anyway, if you use CasualConc or other programs, please, please let me know what you think. 

Thursday, September 17, 2009

CasualConc and CasualTagger updates

Tonight, I uploaded newer versions of CasualConc and CasualTagger.

CasualConc's update is minor.

New features:
- Fisher's Exact Test in Collocation Stats Calculators (experimental)
- Calculator for 2x2 contingency table

The first one is added upon request by someone who kindly checked accuracies of stats calculation.  Thank you, Sebastian!  It looks like most of the stats are reasonably accurate.  Anyway, because the calculation of the Fisher's p-value is CPU intensive (esp. with large N), I made it as an option.  To include Fisher's p-value, go to Preferences -> Other and check 'Include Fisher's Exact Test'.  I haven't got report of the accuracy of this, so I'm not sure how well it works.  If anyone can test it, I'd appreciate it.  The contingency table calculator is based on the same formulas with other stats calculation.  It returns Log-Likelihood, chi-square and Fisher's Exact Test (optional).   I hope this is useful for someone.

The update to CasualTagger is also feature enhancements.

Enhanced features:
- word count now works with untagged text ('None' is added to options)
- kwic search for specified word(s)/phrase(s) is available (it was only possible from a word list)
- simple sort in kwic
- word count and kwic with multiple files (optional)
- editor now accepts text files encoded other than UTF-8 (set in Preferences)
- ignore specified tags or file information (by specifying an end marker/tag) in word count and kwic

Because I made so many changes, there might be many bugs.  Now I'm trying to tag my own corpus, so I've been making changes to suit to my needs.  I'll make more changes as I need them, but if you ever try CasualTagger and have nice ideas, please let me know.  I'll try to include them if they are not too complicated or they look useful for my work.  Also I'll try to update the documentation.


Also I checked CasualMecab on Snow Leopard, but it doesn't work.  I installed MeCab and MeCab-Ruby on Snow Leopard and it works fine from Ruby scripts (I tested the exact same script).  But somehow MeCab-Ruby doesn't work in an application.  I'll try to fix it if I can find any solution.

Sunday, September 13, 2009

CasualPConc update

Somehow, CasualPConc didn't run well in Snow Leopard.  This could be because Ruby in Snow Leopard is updated to 1.8.7 from 1.8.6 and this change might have caused errors. 

Anyway, I fixed some major bugs.  There might be some other bugs which are caused by the same source (related to Array Controller).  I also updated the how-to on the site.

If you find any bugs, please let me know.

Saturday, September 5, 2009

CasualConc bug fix

I found a bug in Word Count when I was cleaning up the codes for it. 

Bug fix
- crashed when creating n-gram list in the Database mode.

This bug was introduced when I added a warning message for missing files in the last update.

As for the clean-up, the problem was in Cluster and Word Count, separate codes were written for each table (right and left).  This was because of my lack of scripting skill (I still don't have much).  I couldn't think of a good way to identify which button was clicked and process them accordingly.  If you know how Cocoa works, this should be obvious, but when I started this project, I had no experience in Cocoa.

The new version is 0.9.9.7.  It's almost 1.0, so I'll try to wrap up to make it 1.0 soon.  This means no more major feature before 1.0 and I'll focus on bug fixes.  But unless I hear a lot from users whether it is mostly bug free or still has many bugs, I'm not confident enough to make it out of beta, though beta simply means (to me) it's not tested enough.  Computer programs will never be bug-free.

Anyway, if you find any bug, esp. in Word Count and Cluster, please let me know.