Friday, April 30, 2010

A CasualConc bug fix

I got a bug report and fixed it (thank you, Adriano).

Bug fix
- crashed when searching with wildcard in the Database mode.

The report was about Concord, but I think it also happened in Cluster and Collocation/Cooccurrences.

This is not a problem on the beta version.  I fixed this on the beta when I found this bug, but forgot to apply it to the working version.

You can download CasualConc from the CasualConc site.

If you find any other bugs on the working version (1.0.2), please send a bug report to me.  Currently, I'm only using the beta version, so I will only fix bugs which are in common with both the working version and the beta version unless I receive a report.

Wednesday, April 21, 2010

The current status of CasualConc beta

Since it's hard to follow what I've written in the series of posts on the current status of CasualConc beta, I will put the link on this post.

The current status of CasualConc Beta
The current status of CasualConc beta - General/Global
The current status of CasualConc beta - General/Global part 2
The current status of CasualConc beta - Concord
The current status of CasualConc beta - Cluster/Collocation/Cooccurrence
The current status of CasualConc beta - Word Count
The current status of CasualConc beta - Corpus File Information
The current status of CasualConc beta - Interface
The current status of CasualConc beta - experimental features
The current status of CasualConc beta - experimental features 2

I might update these pages or add a new post if I find other features I've added (but couldn't remember when I wrote these).

You can download the beta and the current working version from the CasualConc site (English and Japanese).  Please follow the link on the right.

Saturday, April 10, 2010

The current status of CasualConc beta - experimental features 2

This post will be the last of this series of posts 'The current status of CasualConc beta'.  The last feature is also a experimental one.  It's a gap n-gram list creation (for a lack of better word). 

What this does is simple, you can create a n-gram list with one of the words in n-gram (3-5) as a gap or wildcard or whatever you call it.  In the experimental beta version, when you select 3-gram, 4-gram, or 5-gram in Word Count, a check box appears.


Check this box and click Count.


Because this process can take a long time and needs a lot of memory if your corpus size is big, a warning message appears.  When I tried this with a corpus size of 500,000, this process took almost 10 minutes.

If you are brave enough, this is what you get.  The corpus I used is Inaugural Addresses of the Presidents of the United States corpus, prepared by Prof. Tabata at Osaka University for a workshop I attended.  As you can see, the gap is represented by * and the words that appear in that slot is in the Gap words column with frequency information.


You can select one of the line and see the entire list of gap words on a table.


Select a line and right click the table.


A panel with a table appears with the list.


You can copy the list and paste it on other applications.  In this case all the gap words appeared on the Word Count table, but this is basically designed to see all the words when they are not displayed.


You can see all of them on the table.



OK, that's it.  I think I covered almost all the new and enhanced features for the next version (current beta) of CasualConc.  The current beta has all these new features except for the last two experimental features.  You can download the beta version from Download CasualConc Beta page (Japanese page).

If you are interested in the experimental beta build, please contact me directly at casualconc (at) gmail.com. 

Since this is a beta version, CasualConc can be unstable and might have bugs related to the changes I have made.  If you ever try this beta version, I'd appreciate your feedback/bug reports.

The current status of CasualConc beta - experimental features

In this post (and possibly the next) I will present two new experimental features of CasualConc.  These two features are still too experimental to be included in a beta on the site.  So if you get interested after reading this post, please contact me directly.  The email address is on the CasualConc main site.

The first one is visualization of collocations and the second is n-gram search with a gap.  I will explain what these mean.

The visualization of collocations is an idea proposed by Prof. Tabata at Osaka University, Japan.  This is simply a realization of his idea (not mine).  Now, let me show you the current implementation.

To use this function, you need a word list and collocation with the same corpus.  So run Word Count and Collocation.  The Visual (name is tentative) button is on the Collocation tool.  Click it to display the Visualizer panel.


The settings of the Visualizer as follows:


The top one is which statistics to use for visualization.  The choices are shown below.


Then select the span/position of collocated words.  The upper choice is a position from L5 to R5.  If you select R1, only the information on the R1 position (frequency) is used to calculate a selected statistic.  If you enable Span, the calculation will be based on the tally of frequency up to the selected position.  For example, if you select R3 and check Span, the information in R1, R2, and R3 positions will be used to calculate statistics.  The lower choice is a span to the left and right of the keyword.  You can select from L1~R1 to L5~R5. 


Take xxx words means the first xxx words on the Collocation table will be used for visualization.  So if you sort the results on the Collocation table, the words taken from the list will be affected.

Now let's see what this does.


With this setting, the result will look like this.  Larger the number, bigger the font size.


This simply reflect the frequency of words in the L1 position.  Our (80) is the most frequent and American (40) follows.  But the picture is quite different with MI (Mutual information).


You can also incorporate frequency information with other statistic result.  If you enable Include Freq Info, the frequency information will be added with a gray scale.


The result will look like this:


If you click Stats button, you can see the actual numbers on a table.  You can sort by alphabetical order of words or stats values.


If you check Ignore zero occurrence, words with zero frequency will be removed from the display.





If you choose Log-Likelihood, because of its values, higher values can go extreme, so you can convert LL value with log(10).  To enable this, click Convert LL val to log.


With the original LL values, the image will look like this:


With the conversion, it will look like this:


The choice is up to you.

The most experimental part is visualization with 4 statistics.  By clicking Use Multiple info, you can incorporate three additional statistics values.

The current implementation is highly experimental and not tuned to display most effective color scheme, but basic idea is that value of each statistic can be assigned to one of RGB.  Higher the value, lower the color value.  So if the value for a certain word is high in all three, the font color should be closer to black.  If in the above example, Log-Likelihood value is very low and z-score value and Log-log value are very high, the font color should be close to red.  These are primary colors, so 100% on all of them means white and 0% on all of them means black.

When the above three statistics are applied with MI as the primary statistic, the image will look something like this:


Blueish or Greenish font colors mean relative values of z-score and Log-log are low compared to a relative value of Log-Likelihood.  But the actual values of each statistic can vary a lot, the displayed color scheme may not reflect a true relationships among statistics.  I need to figure out the way to visualize the optimum relationships among statistics values.  If you have any suggestion, I'd most appreciate it.


Finally, the statistic values of all four indicators can be checked on the stats value table.



You can sort the items by clicking the header of columns.



That's about it.  As I mentioned at the beginning, this feature is not available on the current beta.  If you'd like to try this, please contact me directly.  My email address is on the CauslConc sites (the links are on the right side column of this blog).

The current status of CasualConc beta - Interface

The most salient difference in the new version for Japanese users is the interface.  If you use CasualConc in Japanese language environment, interface items and messages will be displayed in Japanese.  Here is the example from Concord.




Messages are also in Japanese (unless I forgot to change them).


Of course, Preferences are also in Japanese.


When this new version is out of beta, I will include help files in Japanese (the current beta does not have help files included).


OK, this is the end of new feature show case.  These should be available in the most up-to-date beta. 

For the next couple of posts, I will show you some experimental features.  Those are not enabled in the beta on the site, but if you are interested in testing the features, please contact me directly.

The current status of CasualConc beta - Corpus File Information

Perhaps Corpus File Information gets the most enhancement.

In the current working version, Corpus File Info only creates basic file information and the number of n letter words, which is not very interesting.  In the new version, you can do a lot more with Corpus File Info.

Corpus File Info now has three modes. 


Basic Info is what the current working version does.  Basic frequency stats with frequencies of n-letter words.  Word Freq Info is to create a frequency matrix (frequencies of specified words in each file).  TF-IDF is a measure of prominence or or importance of words in a file.  For more information read this Wikipedia entry.   To run TF-IDF analysis, you need to run Word Count or import a word list with the information regarding how many files in a provided corpus a certain word appears.

Let me start with Word Freq Info.  If you select Word Freq Info, the following items appear on the window.


First you select how you create a list of word you count in the selected corpus.


You can select one of the three sources.


Once you select the source click Import button.  You can limit the number of words you import in Preferences -> File Info.  If you uncheck this, CasualConc tries to import all the word available in the source.


If you select From Word Count, the words on the left table of Word Count from the first one to the number specified in the Limit or all if not.  The order of words to be imported is whatever order on the Word Count.  So if you sort the list by alphabets, the imported list is in that order.

If you select From File and click Import, you can import a word list, you will be prompted to specify the format of the word.  You can only import a plain text file with a certain format (CSV or tab-delimited).  You can specify how many columns from the left or rows from the beginning.  You can click Check button to check what will be imported. 


If you select From Import Panel and click Import, the import panel appears.


You can directly enter or copy/paste a list (one item/word per line).  Once you finish creating a list, click Read button.

You can check what are imported, by clicking Check button on the main window.


You can sort words in alphabetical order by clicking the column header.  The last column is the original order of the imported words.  You can delete words from the list if you want.

Once you are sure about the list, you can specify the range of words to count.  If you want to count only the first 20, you enter 1 and 20 in the boxes.


Then click Get File Info.  With the default settings, the result will look like this.  The header is the word that were counted and the numbers are the frequencies of the word in each file (and Total).


If you check Normalize word freq in Preferences -> File Info, you can convert the frequency to percent or per xxx words.


If you select %, the result will look like this.


If enable Sort frequency list of each file by word frequency, you can sort the result for each file by the order of frequency index.


If you create a list with percent and sort the result by frequency, you will get a result like this:


The process to import word list is the same for TF-IDF.  But to run TF-IDF, you need to have a word list on Word Count and the list should have the information of the number of files a certain word appear in the corpus.  If you run Word Count, this information is on the table.  Here is the result of the default setting.  .00 means the word appears on all the files in the corpus. 


You can select how to sort the results in Preferences -> File Info.


If you select Sum of all files, the TF-IDF values for a word will be added up and the sorting is based on the sum of the value of TF-IDF on each file.


If you select Each file, the sorting will be done for each file based on the TF-IDF values.


Personally, I've never used this for my research, but it seems to be a well-known indicator in text mining.

In any case, if you want calculate TF-IDF values for all the words, but only display a limited number of words, uncheck Limit the number of words to import to and import all the words from Word List.  Then set Limit result table columns to a reasonable number. 


I once tried this with no limits on both with a corpus of about one hundred thousand tokens and the corpus had several thousand unique words.  This means the table had several thousand columns.  When I tried to scroll the table, even scrolling only one row took several minutes.  So don't try it!