Sunday, October 25, 2009

CasualConc minor bug fix

I found a bug in CasualConc when I was working on the beta, which has the same script.  It's a minor one (a feature I believe not a lot of people use).

Bug fix
- crashed when creating a database file in Advanced Corpora Database mode using tag deletion.

If you are one of the rare people who uses this function, please download the latest version.

Monday, October 19, 2009

CasualConc 1.0 and more...

This post will a long one.

I've decided to make the latest build of CasualConc a version 1.0 mostly because I didn't get feedback/bug report (well, I guess not a lot of people ever tried the latest beta or people don't bother to report any bugs...).  Anyway, I made a few changes to it and here's the list.  Bug fixes are very minor

Casualconc version 1.0

Bug Fix
- timer for File Info now working
- move tables in Cluster moves everything including span and type
- word list import now functions

Enhancement
- creating File Info table is much faster
- progress bar in File mode progresses based on the number of files processed
- now including help files (the same content as you find on the site; English only)


From now on, this version is only maintained if I ever get bug report.  I've already started adding new features and I'm planning to make more fundamental changes to it.  I'll release it as version 1.x some time in the future, but no time frame.  If you have any feature request, please send them to me.  I'll try to add them once I add what I want, though whether I can add what you want depend on my time and scripting skills.  I might release it as beta (or alpha) if people are interested even before it becomes stable.  If you are, please let me know.


In addition to this, I've also updated some of other applications.  I don't know if anyone ever uses them, but I really started using some of them personally, so I've been trying to include what I want.  Well, the reasons I started to write the applications vary.  Some were upon requests and some were just experimental (to try what I learned in scripting).  Now, I want to make them more like real applications.

Anyway, the updated applications and the details of updates are below.  I don't think people are interested, but these are for the record, so I can keep track of what I've done.


CasualMecab

- Aozora Bunko Kanji substitute handling
- experimental Word List function using Mecab output
- Snow Leopard Support (separate build)

Aozora Bunko Kanji substitute handling is to pick the Kanji substitute, such as ※[#「てへん+劣」、第3水準1-84-77] and replace with real Kanji's (now this is possible with Unicode characters).  Word List function uses Mecab format output and create word list with any of the info available (not just with the word on the text).  You can create a word list of base form and part-of-speech combination, etc.  Snow Leopard Support is just a work around.  If you use Snow Leopard, you need to download the one for it.


CasualTagger

- support rbtagger if installed
- better regex search/replace
- delete punctuation tags
- ignore header section (info part?)
- skip bracketed tags from tagging
- progress bar in batch process
- run tagger in editor mode

CasualTagger now support rbtagger.  You can find more information about rbtagger hererbtagger is a tagger based on Eric Brill's tagger by Todd A. Fisher.  You need to install it by yourself, but it's very simple.  Just type sudo gem install rbtagger in Terminal.app.  It still has some issues, but it's good to have alternatives.

Regex replace was supported before, but now you can use it for search.  Delete punctuation tags delete tags put on punctuation characters (not words).  Ignore header section is for my own purpose.  Some of my corpus files have header section ~ and I don't want to add tags to the text in this section.  So now CasualTagger can ignore this part (keep the original text).  Skip bracketed tags are to ignore section tags I have on some files, such as ~, etc.  And the progress bar is added to batch processing.  Finally, you can apply tags (engtagger/rbtagger) on a single file in Editor mode.


CasualTextractor

- PDF mode
  - search in PDF
  - enlarge/reduce in PDF
  - go to selected text (from PDF to extracted text)
  - delete selected text (on PDF from extracted text)
  - split word search and replace (PDF artifacts)
  - replace character list

- Web mode
  - open web files (html/htm/webarchive)
  - clear open page/file
  - web history
  - source view

- Document
  - split word search (for PDF)
  - replace character list

Overall
  - open recent files
  - regular expression search
  - simple tagging support
  - text format options (replace certain text/characters)

I've made so many changes, so I make notes on some of them.

In PDF mode, with delete selected text on PDF, you select text on PDF view and delete the section.  The text will be deleted from the text view and the text on PDF will be struck through.  This is handy if you want to delete header or footer on the PDF text.  Split word search is to find words split when PDF file was created, such as interest- ing due to line break.

In Web mode, you can open web files (not just drag&drop) and clear the page to allow you drag&drop another file.  Web history is what you usually see in a browser, though it's limited.  You can see the source of the page and make changes to it (you can see the result of the change).

Overall, it has regex search/replace.


The information on the site is still based on the previous beta.  I probably won't update it until I'm certain the features are set.  But if you try and wonder how a function works, please feel free to contact me.  Also any bug report is welcome.  

Monday, September 28, 2009

CasualPConc 0.7 and a new application

I made a few changes and fixed a few bugs on CasualPConc, a simple parallel concordancer for Mac OS X.  It is a little more stable now.  I also worked on the documentation.  Now it covers most of the features.

A new application is based on CasualPConc.  When I first released CasualPConc, someone asked if I would make it to handle more than two corpora.  This is kind of my answer to that.  CasualMultiPConc has limited features, but it can handle up to 5 parallel corpora. 

This new application simply does kwic concordance of up to 5 parallel corpora.  The future of this application is up to users (if there's any).  I don't have any experience in parallel corpus analysis and I don't need to use it right now, so I'm not even sure how well this works.  I only use small corpora to test it.  If you are interested in testing it, any feedback is welcome.  This one is also for Mac OS X 10.5 Leopard or later, though I only tested it on 10.6 Snow Leopard.

This application and CasualPConc are only on English site (Main Site link on the right).  Both of them are under Other Applications. 

Wednesday, September 23, 2009

CasualConc updated, but still not 1.0

I fixed a couple of bugs and added a few features.  Yes, I wrote I wouldn't spend time on East Asian language support, but I somehow figured out how to handle coloring in Concord with 2-byte character modes (Japanese (plain) and Japanese (wakachi)).  Well, it's in the middle of 5-day weekend in Japan...

Bug fixes
- crashes in Word Count when n-gram list is created in File Mode.
- File Information treated word lengths in bytes not in characters

Feature Improvements
- added two new Word Count sort options: Word Length and Reverse Word Length
- added Character as a search word choice
- full regular expression search
- a progress bar is added at the bottom of the main window, though it doesn't indicate the progress (it shows CasualConc is processing your request)
- much better East Asian Language support (Japanese (plain) and Japanese (wakachi))


Along with better East Asian Language support, I made a few other changes.  Two new Word Count sort options goes with File Information.  Now you can get information about the number of words with certain characters/letters.  So this new feature is to check which ones are the longest/shortest words.

An addition of a new search word mode is to search for the characters used for wildcard search.  Now you can search * ? ! and other non-word characters.

The change in regular expression search is that before this change, all the regular expressions are word level.  In other words, the actual regular expression processed inside CasualConc was inside the \b~\b (word boundaries).  Now this limitation is lift.  So if you want the same results as before, simply put \b in front and after the regular expression.

The progress bar added to the main window only shows CasualConc is processing.  It doesn't show how much processing has done.

East Asian Language support is much better now.  Context word coloring is added and now you can use the database mode.  But because of the nature of texts (no spaces between words), some of the functions behave differently.  More detailed information about East Asian Language support is documented on the site (only on the English site at this moment).

Now I will finally focus on bug fixes and minor changes.  I won't make any more major changes before 1.0, or so I think...

The documentations on CasualConc and CasualTagger are updated.  They should reflect the latest versions.

I'd appreciate if you could report any bugs as soon as you find them.

Sunday, September 20, 2009

CasualConc update

Just because I wanted to get file information, I added it to CasualConc. 

New feature
- File Information

It is very basic and only returns type, token, type-token ratio and number of n-letter words.

Improved
- Fisher's Exact Test calculation speed

I also rewrote the algorithm for Fisher's Exact test and it is much much faster, though I'm sure no one has used this since I added it last time.  But I haven't fully tested this, so I'd appreciate if anyone can test its accuracy.

Now the version is 0.9.9.9.  Well, it's almost 1.  If I can't find any other bugs when I use it in the next couple of weeks or I don't get any bug report, I'll simply make it 1.0.  I'm sure it still has bugs even if I make it 1.0, but that's the nature of computer programs. 

So, my decision for now is that I don't spend any more time on East Asian Language support because I haven't heard from anyone who uses CasualConc with East Asian languages.  I personally don't use CasualConc even with Japanese, so I don't see any necessity.  I'll work on this later if I have time, but I'll focus more on other programs once this hits 1.0.  So if you use this with East Asian languages and would like to have better support for East Asian languages, please let me know. 

If I think of other nice features, I'll probably add it to other programs and see if it works with CasualConc before I added them to it.  In fact, I added File Info to CasualTagger and it wasn't too complicated, so I decided to add it to CasualConc (well, I wanted this feature, but haven't tried to write scripts).

Anyway, if you use CasualConc or other programs, please, please let me know what you think.