Friday, December 18, 2009

CasualTagger 0.8

I'm not sure if there is anyone who have even tried CasualTagger, but because I've been using it to tag my own corpus, I've been adding features to it.  I don't have time to work on the documentation with screenshots, so I'll describe new features and how to use them (to some extent) on this blog.  But because I don't remember all the changes I've made since the last update, I only mention more significant ones here.

General feature
- memo panel

You can keep notes of what you are doing and CasualTagger keeps you memo.  The format of this might change in the future, though.

In Editor mode
- coloring of specified strings in kwic context (up to 2)
- specifying left/right kwic context span separately
- adding xml type tag in addition to pos-tags (with shortcuts)
- more versatile tag coloring
- search word/coloring string history

The coloring is simple.  You just specify word/phrase/whatever to add colors in kwic context.  You can specify if you add colors to left/right/both context.  The mode of search word (wildcard/character/regular expressions) applies to context string coloring.

For kwic span, you can now specify it for left and right context separately.

The xml tagging feature is added.  So now you can add two different types of tag formats (one for pos-tags and the other for xml type tags).  Both can be done with shortcuts.  For example, you can add pos-tags in _XXX format and at the same time, you can work on xml type tags ~.

Tag coloring is more versatile now.  You can specify different types of tags including xml type tags.

Search word and context coloring strings has history features just like search word/context word history in CasualConc.

New modes
- Item Counter
- Custom File Info

Item Counter is simply to count occurrences of strings in your corpus that match a regular expression.  To use this, first add files to the file list table on the left.  Then open Option panel (Menu -> Window -> Counter Option Panel).  You can specify the end of the file information part (just like you can in CasualConc).  You can also specify any strings to ignore in counting (with regular expression).  If the files have any string that match the specified regular expression, they will be deleted before CasualTagger counts what you want to count.  You can have any number of items on a table and check the ones you want to apply.  You can also specify them for each table.  If you use () to back reference, only those in the brackets will appear on the table.

Custom File Info is basically multiple Item Counter.  To use this, add files to the file table on the left.  Then click Settings button on the top right corner.  You can specify end of the file info part and any string to ignore in all the counts.

On the setting table, add items to count.  Label is a label for table columns.  Check "U" if you want to count only unique occurrences.  "C" is case sensitivity for regular expression.  Check "M" to allow multiple line matches.  Then specify items to count and items to ignore in regular expressions.  You can specify multiple items for Items to ignore.  Just use a comma [,] to separate regular expressions.  You can export/import that list for later use.  Drag and drop to change the order.

Once you set everything, close the setting window and click Run.  You can export the result as a tab-delimited text (in UTF-8) to open in Excel/Numbers or any spreadsheet application.

You can use Item Counter to check how your regular expressions work in Custom File Info, though it's not perfect (it doesn't have ignore case/multiple line for ignore items in Item Counter.

Anyway, I don't know if there's anyone to try CasualTagger, but if you are interested, you can download it from the CasualConc site.  If you ever use it, please let me know what you think.

Sunday, November 8, 2009

CasualConc update and CasualTranscriber

Well, I stated here that there would be no more new feature in version 1.0, but I changed my mind...  This is due to a couple of reasons.  One is I found a couple of bugs and I had to fixed them.  Another is adding them was not so time consuming.  I added to the current beta first and mostly simply copy/paste the items and scripts to version 1.0.  Especially one of the new features is what I wanted to include in version 1.0, but hadn't figured out how.

Anyway, the current version is 1.0.2 and here's the list of bug fixes and added features.

CasualConc Version 1.0.2

Bug fix
- In Cluster, the same cluster was counted twice if a search word/phrase appears twice in a cluster (such as 'is that is')
- Related to the above one: in Cluster, if a search word appears twice in a cluster, only one word was colored
- In Cluster and Collocation with Lemma search is on, not all words with the same lemma appear in the list (right most column)

The bug in Cluster was a little serious.  In the original script, clusters were collected on every search word.  This means that if there is a sequence 'is that is' and you search 'is', 'is that is' was counted twice (once with the first 'is' and then with the second 'is' in the cluster).  Now, when this happens duplicates are eliminated, so 'is that is' is counted only once.  And because it was assumed the search word appears only once in a cluster, only one of them was colored.  Now if there are two occurrence of the search word in a cluster, both should be colored.

The bug with Lemma is a minor one, which just means I don't expect many people use this function.  When you run Cluster or Collocation with the Lemma feature on, CasualConc shows the frequencies of words in the same lemma (or clusters with them) on the very right column.  But when Lemma contains only one word, it didn't display correctly.  Now all the words should be displayed.


New features
- Concordance Plot -- you need to set 'Scope of Context' to 'File' and check 'Create Concordance Plot'
- Search word/context search word history

Concordance Plot shows where in a file the search word appears.  The plots are generated when you run Concord with 'Scope of Context' set as 'File' and 'Create Concordance Plot' is checked in Preferences.

Another new feature is search word/context search word history.  CasualConc remembers the words you searched in Concord/Cluster/Collocation and you can select one from the pull-down menu.  You can set how many search/context search words to remember in Preferences (in General [Search Word] and Concord [Context Word]).

I hope these new features didn't introduce new bugs.


I also updated CasualTranscriber, a transcription helper application.  Now it should run on Mac OS X 10.6 Snow Leopard.  I also fixed some bugs and added new features.  It should be a little more stable.

The new/enhanced features are insert tag function and much more powerful regular expression search/replace.

To use tag function, go to Menu -> Window -> Tag Panel.  Then enter tags under Tag.  To insert a tag, type Command + Option + (the number on the left).  So to insert the tag in 1, type Command + Option + 1.  If Tag is laugh, the tag will be inserted as .  If any text is selected on the text view, it appears between the tags.  If you check the box and enter options, they appear like attributes in XML.  Options will be divided by a comma (,) so if you enter option1,option2 in the Options box, the inserted tag will be .  The selected text appears between tags.

If you don't want to type the combination, you can click a button to insert a tag.  On the main window, you will see a button on the top right corner (the one to show tool bar on any Cocoa application).  Click it to show an icon like a gear.  Clicking the gear icon shows a drawer on the right.  Click a button next to the tag you want to insert.  You can change the tag on the drawer (the change will be reflected on the panel).  But if you want to change the options, you need to do that on the panel.

I fixed some bugs, but I can't tell which ones.  This is because so many things broke in Snow Leopard and I couldn't tell which bugs were in the previous version and which ones were due to change in the OS.

Anyway, if you try either of them and find any bug, please let me know.

Sunday, November 1, 2009

Another CasualConc bug fix

In the development of the next version, I found a bug which affects the version 1.0 and fixed it.

Bug
- the context text (in context view) is not properly displayed when the search word appears in the first paragraph of a file.

This only affected if you use Database mode and the searched word appears in the very first paragraph of a file (original file, not the database file).


The development of the next version is slow.  Because I started to change the fundamentals, only a part of functions work now (file management and a part of Concord).  Anyway, here's the list of tentative new/revised features (might not show up in the next version).

- stop word list
- skip character list
- experimental pos tag search/count (only in European Language modes)

If you have any suggestions, please let me know.  I'll see if I can include them in the next version.

Sunday, October 25, 2009

CasualConc minor bug fix

I found a bug in CasualConc when I was working on the beta, which has the same script.  It's a minor one (a feature I believe not a lot of people use).

Bug fix
- crashed when creating a database file in Advanced Corpora Database mode using tag deletion.

If you are one of the rare people who uses this function, please download the latest version.

Monday, October 19, 2009

CasualConc 1.0 and more...

This post will a long one.

I've decided to make the latest build of CasualConc a version 1.0 mostly because I didn't get feedback/bug report (well, I guess not a lot of people ever tried the latest beta or people don't bother to report any bugs...).  Anyway, I made a few changes to it and here's the list.  Bug fixes are very minor

Casualconc version 1.0

Bug Fix
- timer for File Info now working
- move tables in Cluster moves everything including span and type
- word list import now functions

Enhancement
- creating File Info table is much faster
- progress bar in File mode progresses based on the number of files processed
- now including help files (the same content as you find on the site; English only)


From now on, this version is only maintained if I ever get bug report.  I've already started adding new features and I'm planning to make more fundamental changes to it.  I'll release it as version 1.x some time in the future, but no time frame.  If you have any feature request, please send them to me.  I'll try to add them once I add what I want, though whether I can add what you want depend on my time and scripting skills.  I might release it as beta (or alpha) if people are interested even before it becomes stable.  If you are, please let me know.


In addition to this, I've also updated some of other applications.  I don't know if anyone ever uses them, but I really started using some of them personally, so I've been trying to include what I want.  Well, the reasons I started to write the applications vary.  Some were upon requests and some were just experimental (to try what I learned in scripting).  Now, I want to make them more like real applications.

Anyway, the updated applications and the details of updates are below.  I don't think people are interested, but these are for the record, so I can keep track of what I've done.


CasualMecab

- Aozora Bunko Kanji substitute handling
- experimental Word List function using Mecab output
- Snow Leopard Support (separate build)

Aozora Bunko Kanji substitute handling is to pick the Kanji substitute, such as ※[#「てへん+劣」、第3水準1-84-77] and replace with real Kanji's (now this is possible with Unicode characters).  Word List function uses Mecab format output and create word list with any of the info available (not just with the word on the text).  You can create a word list of base form and part-of-speech combination, etc.  Snow Leopard Support is just a work around.  If you use Snow Leopard, you need to download the one for it.


CasualTagger

- support rbtagger if installed
- better regex search/replace
- delete punctuation tags
- ignore header section (info part?)
- skip bracketed tags from tagging
- progress bar in batch process
- run tagger in editor mode

CasualTagger now support rbtagger.  You can find more information about rbtagger hererbtagger is a tagger based on Eric Brill's tagger by Todd A. Fisher.  You need to install it by yourself, but it's very simple.  Just type sudo gem install rbtagger in Terminal.app.  It still has some issues, but it's good to have alternatives.

Regex replace was supported before, but now you can use it for search.  Delete punctuation tags delete tags put on punctuation characters (not words).  Ignore header section is for my own purpose.  Some of my corpus files have header section ~ and I don't want to add tags to the text in this section.  So now CasualTagger can ignore this part (keep the original text).  Skip bracketed tags are to ignore section tags I have on some files, such as ~, etc.  And the progress bar is added to batch processing.  Finally, you can apply tags (engtagger/rbtagger) on a single file in Editor mode.


CasualTextractor

- PDF mode
  - search in PDF
  - enlarge/reduce in PDF
  - go to selected text (from PDF to extracted text)
  - delete selected text (on PDF from extracted text)
  - split word search and replace (PDF artifacts)
  - replace character list

- Web mode
  - open web files (html/htm/webarchive)
  - clear open page/file
  - web history
  - source view

- Document
  - split word search (for PDF)
  - replace character list

Overall
  - open recent files
  - regular expression search
  - simple tagging support
  - text format options (replace certain text/characters)

I've made so many changes, so I make notes on some of them.

In PDF mode, with delete selected text on PDF, you select text on PDF view and delete the section.  The text will be deleted from the text view and the text on PDF will be struck through.  This is handy if you want to delete header or footer on the PDF text.  Split word search is to find words split when PDF file was created, such as interest- ing due to line break.

In Web mode, you can open web files (not just drag&drop) and clear the page to allow you drag&drop another file.  Web history is what you usually see in a browser, though it's limited.  You can see the source of the page and make changes to it (you can see the result of the change).

Overall, it has regex search/replace.


The information on the site is still based on the previous beta.  I probably won't update it until I'm certain the features are set.  But if you try and wonder how a function works, please feel free to contact me.  Also any bug report is welcome.  

Monday, September 28, 2009

CasualPConc 0.7 and a new application

I made a few changes and fixed a few bugs on CasualPConc, a simple parallel concordancer for Mac OS X.  It is a little more stable now.  I also worked on the documentation.  Now it covers most of the features.

A new application is based on CasualPConc.  When I first released CasualPConc, someone asked if I would make it to handle more than two corpora.  This is kind of my answer to that.  CasualMultiPConc has limited features, but it can handle up to 5 parallel corpora. 

This new application simply does kwic concordance of up to 5 parallel corpora.  The future of this application is up to users (if there's any).  I don't have any experience in parallel corpus analysis and I don't need to use it right now, so I'm not even sure how well this works.  I only use small corpora to test it.  If you are interested in testing it, any feedback is welcome.  This one is also for Mac OS X 10.5 Leopard or later, though I only tested it on 10.6 Snow Leopard.

This application and CasualPConc are only on English site (Main Site link on the right).  Both of them are under Other Applications. 

Wednesday, September 23, 2009

CasualConc updated, but still not 1.0

I fixed a couple of bugs and added a few features.  Yes, I wrote I wouldn't spend time on East Asian language support, but I somehow figured out how to handle coloring in Concord with 2-byte character modes (Japanese (plain) and Japanese (wakachi)).  Well, it's in the middle of 5-day weekend in Japan...

Bug fixes
- crashes in Word Count when n-gram list is created in File Mode.
- File Information treated word lengths in bytes not in characters

Feature Improvements
- added two new Word Count sort options: Word Length and Reverse Word Length
- added Character as a search word choice
- full regular expression search
- a progress bar is added at the bottom of the main window, though it doesn't indicate the progress (it shows CasualConc is processing your request)
- much better East Asian Language support (Japanese (plain) and Japanese (wakachi))


Along with better East Asian Language support, I made a few other changes.  Two new Word Count sort options goes with File Information.  Now you can get information about the number of words with certain characters/letters.  So this new feature is to check which ones are the longest/shortest words.

An addition of a new search word mode is to search for the characters used for wildcard search.  Now you can search * ? ! and other non-word characters.

The change in regular expression search is that before this change, all the regular expressions are word level.  In other words, the actual regular expression processed inside CasualConc was inside the \b~\b (word boundaries).  Now this limitation is lift.  So if you want the same results as before, simply put \b in front and after the regular expression.

The progress bar added to the main window only shows CasualConc is processing.  It doesn't show how much processing has done.

East Asian Language support is much better now.  Context word coloring is added and now you can use the database mode.  But because of the nature of texts (no spaces between words), some of the functions behave differently.  More detailed information about East Asian Language support is documented on the site (only on the English site at this moment).

Now I will finally focus on bug fixes and minor changes.  I won't make any more major changes before 1.0, or so I think...

The documentations on CasualConc and CasualTagger are updated.  They should reflect the latest versions.

I'd appreciate if you could report any bugs as soon as you find them.

Sunday, September 20, 2009

CasualConc update

Just because I wanted to get file information, I added it to CasualConc. 

New feature
- File Information

It is very basic and only returns type, token, type-token ratio and number of n-letter words.

Improved
- Fisher's Exact Test calculation speed

I also rewrote the algorithm for Fisher's Exact test and it is much much faster, though I'm sure no one has used this since I added it last time.  But I haven't fully tested this, so I'd appreciate if anyone can test its accuracy.

Now the version is 0.9.9.9.  Well, it's almost 1.  If I can't find any other bugs when I use it in the next couple of weeks or I don't get any bug report, I'll simply make it 1.0.  I'm sure it still has bugs even if I make it 1.0, but that's the nature of computer programs. 

So, my decision for now is that I don't spend any more time on East Asian Language support because I haven't heard from anyone who uses CasualConc with East Asian languages.  I personally don't use CasualConc even with Japanese, so I don't see any necessity.  I'll work on this later if I have time, but I'll focus more on other programs once this hits 1.0.  So if you use this with East Asian languages and would like to have better support for East Asian languages, please let me know. 

If I think of other nice features, I'll probably add it to other programs and see if it works with CasualConc before I added them to it.  In fact, I added File Info to CasualTagger and it wasn't too complicated, so I decided to add it to CasualConc (well, I wanted this feature, but haven't tried to write scripts).

Anyway, if you use CasualConc or other programs, please, please let me know what you think. 

Thursday, September 17, 2009

CasualConc and CasualTagger updates

Tonight, I uploaded newer versions of CasualConc and CasualTagger.

CasualConc's update is minor.

New features:
- Fisher's Exact Test in Collocation Stats Calculators (experimental)
- Calculator for 2x2 contingency table

The first one is added upon request by someone who kindly checked accuracies of stats calculation.  Thank you, Sebastian!  It looks like most of the stats are reasonably accurate.  Anyway, because the calculation of the Fisher's p-value is CPU intensive (esp. with large N), I made it as an option.  To include Fisher's p-value, go to Preferences -> Other and check 'Include Fisher's Exact Test'.  I haven't got report of the accuracy of this, so I'm not sure how well it works.  If anyone can test it, I'd appreciate it.  The contingency table calculator is based on the same formulas with other stats calculation.  It returns Log-Likelihood, chi-square and Fisher's Exact Test (optional).   I hope this is useful for someone.

The update to CasualTagger is also feature enhancements.

Enhanced features:
- word count now works with untagged text ('None' is added to options)
- kwic search for specified word(s)/phrase(s) is available (it was only possible from a word list)
- simple sort in kwic
- word count and kwic with multiple files (optional)
- editor now accepts text files encoded other than UTF-8 (set in Preferences)
- ignore specified tags or file information (by specifying an end marker/tag) in word count and kwic

Because I made so many changes, there might be many bugs.  Now I'm trying to tag my own corpus, so I've been making changes to suit to my needs.  I'll make more changes as I need them, but if you ever try CasualTagger and have nice ideas, please let me know.  I'll try to include them if they are not too complicated or they look useful for my work.  Also I'll try to update the documentation.


Also I checked CasualMecab on Snow Leopard, but it doesn't work.  I installed MeCab and MeCab-Ruby on Snow Leopard and it works fine from Ruby scripts (I tested the exact same script).  But somehow MeCab-Ruby doesn't work in an application.  I'll try to fix it if I can find any solution.

Sunday, September 13, 2009

CasualPConc update

Somehow, CasualPConc didn't run well in Snow Leopard.  This could be because Ruby in Snow Leopard is updated to 1.8.7 from 1.8.6 and this change might have caused errors. 

Anyway, I fixed some major bugs.  There might be some other bugs which are caused by the same source (related to Array Controller).  I also updated the how-to on the site.

If you find any bugs, please let me know.

Saturday, September 5, 2009

CasualConc bug fix

I found a bug in Word Count when I was cleaning up the codes for it. 

Bug fix
- crashed when creating n-gram list in the Database mode.

This bug was introduced when I added a warning message for missing files in the last update.

As for the clean-up, the problem was in Cluster and Word Count, separate codes were written for each table (right and left).  This was because of my lack of scripting skill (I still don't have much).  I couldn't think of a good way to identify which button was clicked and process them accordingly.  If you know how Cocoa works, this should be obvious, but when I started this project, I had no experience in Cocoa.

The new version is 0.9.9.7.  It's almost 1.0, so I'll try to wrap up to make it 1.0 soon.  This means no more major feature before 1.0 and I'll focus on bug fixes.  But unless I hear a lot from users whether it is mostly bug free or still has many bugs, I'm not confident enough to make it out of beta, though beta simply means (to me) it's not tested enough.  Computer programs will never be bug-free.

Anyway, if you find any bug, esp. in Word Count and Cluster, please let me know.

Saturday, August 29, 2009

Snow Leopard

Apple's new OS, Mac OS 10.6 Snow Leopard was released yesterday. I installed it on my Mac mini and did some tests with CasualConc. All the basic functions seem to work fine. I haven't checked every single feature, but I don't expect to see any serious issues with this upgrade.

Yet, if you find any broken feature or any other bugs, please let me know.

Thursday, August 27, 2009

CasualConc minor update

I uploaded a newer version of CasualConc last week. I haven't had time to work on this for a while, but I got one bug report and one minor feature request, so I fixed the bug and added the feature. The new version is 0.9.9.6.

Bug fix
- crashed when a lemma file is selected and the Lemma mode is on but CasualConc can't find the selected lemma file (it was moved, deleted, etc.).

New feature
- file name (FN) can be selected in sorting Concord results


If you find any other bugs, send it to me at casualconc (at) gmail.com. If the bug is very serious, I'll try to fix it as soon as possible.

Friday, May 8, 2009

CasualConc update

I got a couple of bug reports and a feature request, so I fixed them and add the feature. I also found some other bugs related the reported one and fixed them too. In addition to them, I made one minor change.

Bug fixes
- crash in Concord with Scope of Context set as Sentence in the Database mode.
- corrupted export CSV files from Collocation
- crash in saving Collocation table

Addition
- Reverse Alphabetical sorting in Word Count

What this does (if it works) is to sort words in alphabetical order, but from the last letter to the first letter. So in the normal alphabetical order, a, an, that, the, this are ordered in this order, but in Reverse Alphabetical order, the order will be a, the, an, this, that.

Change
- settings of minimum frequency in Cluster, Collocation/Coocurrence, and Word Count are moved to Preferences

Now you can't set Min Freq. for each table in Cluster and Word Count, but CasualConc remembers the Min Freq. for each tool.


I also got a request to support exporting results in Excel format. I've been experimenting this and this is on the to-do list (I can't tell you when I will add this because I need to figure out how to implement this first). This would probably require you to install a Ruby Module in Terminal (a single line of command). Is there any other people who are interested?

Monday, March 30, 2009

CasualConc quick bug fix

I just found a bug in CasualConc. When it opens kwic result in a new window, it crashes. If you use this function, please download version 0.9.9.3 from the site. I think this was introduced when I made a few changes last time.

If you find any other bugs, please let me know. If they are minor and easy to fix, I'll try to fix them in a day or two.

CasualPConc more updates

Today, I learned at least one person in the world knows CasualPConc exists other than myself. I'm really glad that.

I added a few more features to CasualPConc today. Now almost all the functions I can think of and I wanted to add are there. I might add a function to export results if anyone is interested. Or if anyone has a good idea, I might consider that. But from now on, I'll focus on bug fixing and documentation. I'll update the CasualPConc page on the CasualConc main site in the coming weeks.

I got one request to make CasualPConc be able to handle more than two parallel corpora. But I think it's hard to add that function to CasualPConc. It would probably be easier to write a new program based on CasualPConc. I might work on this once I finalize CasualPConc and if I have time to focus on its development.

Anyway, if you happen to be reading this blog and are interested, please try it ang give me some feedback. Using basic functions should not be difficult. Or you can wait for a few days or weeks until I update the documentation (how to use).

Sunday, March 29, 2009

CasualPConc update

I'm almost certain no one has tried it yet, but I spent a little more time to add some more features to CasualPConc, a new parallel concordancer. This application is available at the CasualConc main site under Utility Programs, but the documentation is not up-to-date.

I don't think I'm going make this as fancy as CasualConc, but I'm trying to use much more RubyCocoa (or Cocoa) features (I'm learning...).

CasualPConc originally had just kwic and word frequency count features. Now it has word cluster and collocation features just as CasualConc. One specific feature to CasualPConc is finding keyword in the matched corpus after running kwic search. When you run kwic search, you have the matched portion of the matched corpus (paragraphs or sentences), which includes words that are equivalent or similar to the one you searched. CasualPConc goes through the matched portion of text and calculate keyness of words against the entire corpus. I'm not sure if I explain this clearly, but it's there, but I'm also not sure if this works as intended.

I also added stop word/skip character functions. My understanding is stop words are the ones that are very frequent in a language and eliminating them helps people see what they look for more clearly. You can create stop word lists for any number of languages or corpora. The skip characters function is for two-byte languages, like East Asian languges or more specifically Japanese because characters for period, comma, brackets, etc. in Japanese are not treated as such by regular expressions. They are treated as regular characters like alphabets and included in word lists and context words and they contaminate results. Both of these functions are experimental and not fully tested and they are separate at this moment, but I might combine them as a single function or list.

If you are interested, please try it and give me some feedback. I personally don't use parallel concordancer much and I don't have good parallel corpora, so I can't really test it. Any feedback is welcome (functionality, usability, bug report, etc.). The current version is 0.3, but this simple means I have made two major changes/enhancements with some testing and bug fixing since version 0.1.

Also if you use any of my applications, I'd really appreciate your feedback on them.

Thursday, March 26, 2009

A new project

As I didn't get much feedback and I was kind of busy, I didn't touch any of the programs (scripting) for a while. But I recently did small translation work and thought a parallel concordancer might help in that situation. So I spent the last few days to start a new project. It is a simple parallel concordancer for Mac OS X Leopard and I named it CasualPConc.

Currently it doesn't do much (possibly many bugs) and because I don't really use parallel corpora, I don't have a good idea about how to develop it. So I'd really appreciate any feedback. I used to work as a translator for a short period of time, so if I just follow my intuition, it will be more like a database program for a translator or language learner. The program is available on the CasualConc main site (direct link).

I don't expect many people use it, so if you give me feedback, it is likely that the functions you request will be added (as long as I can handle them). Please email me directly, or leave comment here, or post on the Discussion Board.