Friday, April 9, 2010

The current status of CasualConc beta - General/Global

I start with some features that are common across tools.

First is Search Word Choice.


Experimental tag search options are added.  This feature is only supported in European Language A/B mode and works in Concord, Cluster, Collocation, and Word Count.  What this does in each tool will be explained (if I have time) in the post for each tool.  But generally, if you select Tag(s) in Concord, Cluster, Collocation, you can search tagged corpus just by tag.  In Word Count, you can choose to display words and tags in separate columns.  For example, if your corpus is POS-tagged and jj is used for adjectives and nn is used for nouns, searching 'jj nn' returns all the sequence of 'jj nn' in your corpus, such as 'beautiful_jj day_nn'.  Tag Only does not work in Concord, but in Cluster, Collocation, and Word Count, you can create lists only with tags.

If you select Regular Expression, you can set case sensitivity.


Another new feature is setting character replacement.  In the current official version (1.0.2), some of the characters (mostly symbols, non alphabet characters) are automatically replaced.  For example, smart quotes (“”) which usually used in word processing applications are treated as a regular character (like alphabets) because of their assignments in Unicode (double byte?).  Also if you copy/paste or convert text from MSWord or read MSWord documents directly, there would be many multi-byte characters.  In the new version (or beta), you can specify (in other words, you need to specify) replacement.

To enable this feature, in Preferences -> General, check Replace Characters.


Then, click Show List to open Replace Characters panel.  Click Add button to add a new entry and enter character directly on the table.  To apply any of the replacement pairs, check the box on the left before you run the analysis.  This feature is not fully functional, so if you enter a weird character on the table, it might not accept it (this happened when I enter a garbage character [a half of the Unicode code of a character]).



Next is Include Words.  You can specify some sequence of character to be treated as a word.  In the current official version, you can specify this on the Preferences, but they are always applied to any files.  With this new function, you can create different groups of characters.  To enable this, check Include Items on 'Include Words' list on the Preferences -> General.


Clicking the Include Words List button will display a panel, but this is integrated with Stop Word/Skip Character list, so I will explain this with them below.

Also you can specify any character to be treated as a part of a word (if this works properly).  Check Other characters to be included and specify any character you want.  If you check Includes word initial, a word that starts with the specified character should be treated as a word with them (I'm not sure how well this works, though).



Stop Word/Skip Character function.  This is a totally new function.  You can create and manage stop word/skip character.  The latter is for multi-byte character language, such as Japanese, to ignore punctuation characters.  In these languages, punctuation characters are also multi-byte, so they are treated as a regular character (like alphabets).  So this function is to avoid it.

To use this function, go to Main Menu -> Window -> Stop Word/Skip Character List Panel.


A panel appears.  On the left, you manage groups.  Fist, enter a group name and click Add button to create a new group.  Then select a group on the left table.  You can delete or rename a group.  On the right, you have three choices: Stop Words, Skip Characters, and Include Words.  Select an appropriate tab and enter a word/character, and click Add button.  You can remove any word by clicking Remove button.  Also you can import/export the list.  This accepts a plain text file and you can select an encoding.  The format if one word/character per line.  Exported text will be in the same format (one word/character per line).


You can set if you want to apply stop words on particular tools.  Go to Preferences -> Others and check the tool to apply stop word deletion.



The language/group choice and stop/skip word application as well as a current search word mode are now displayed on the main window.  SK is stop word is enabled, and SK is Skip Characters is enabled.  You can switch Search Word mode and stop word/skip character group on the main window.



Well, this post is getting long, so I stop here.

No comments: