Saturday, April 10, 2010

The current status of CasualConc beta - Corpus File Information

Perhaps Corpus File Information gets the most enhancement.

In the current working version, Corpus File Info only creates basic file information and the number of n letter words, which is not very interesting.  In the new version, you can do a lot more with Corpus File Info.

Corpus File Info now has three modes. 


Basic Info is what the current working version does.  Basic frequency stats with frequencies of n-letter words.  Word Freq Info is to create a frequency matrix (frequencies of specified words in each file).  TF-IDF is a measure of prominence or or importance of words in a file.  For more information read this Wikipedia entry.   To run TF-IDF analysis, you need to run Word Count or import a word list with the information regarding how many files in a provided corpus a certain word appears.

Let me start with Word Freq Info.  If you select Word Freq Info, the following items appear on the window.


First you select how you create a list of word you count in the selected corpus.


You can select one of the three sources.


Once you select the source click Import button.  You can limit the number of words you import in Preferences -> File Info.  If you uncheck this, CasualConc tries to import all the word available in the source.


If you select From Word Count, the words on the left table of Word Count from the first one to the number specified in the Limit or all if not.  The order of words to be imported is whatever order on the Word Count.  So if you sort the list by alphabets, the imported list is in that order.

If you select From File and click Import, you can import a word list, you will be prompted to specify the format of the word.  You can only import a plain text file with a certain format (CSV or tab-delimited).  You can specify how many columns from the left or rows from the beginning.  You can click Check button to check what will be imported. 


If you select From Import Panel and click Import, the import panel appears.


You can directly enter or copy/paste a list (one item/word per line).  Once you finish creating a list, click Read button.

You can check what are imported, by clicking Check button on the main window.


You can sort words in alphabetical order by clicking the column header.  The last column is the original order of the imported words.  You can delete words from the list if you want.

Once you are sure about the list, you can specify the range of words to count.  If you want to count only the first 20, you enter 1 and 20 in the boxes.


Then click Get File Info.  With the default settings, the result will look like this.  The header is the word that were counted and the numbers are the frequencies of the word in each file (and Total).


If you check Normalize word freq in Preferences -> File Info, you can convert the frequency to percent or per xxx words.


If you select %, the result will look like this.


If enable Sort frequency list of each file by word frequency, you can sort the result for each file by the order of frequency index.


If you create a list with percent and sort the result by frequency, you will get a result like this:


The process to import word list is the same for TF-IDF.  But to run TF-IDF, you need to have a word list on Word Count and the list should have the information of the number of files a certain word appear in the corpus.  If you run Word Count, this information is on the table.  Here is the result of the default setting.  .00 means the word appears on all the files in the corpus. 


You can select how to sort the results in Preferences -> File Info.


If you select Sum of all files, the TF-IDF values for a word will be added up and the sorting is based on the sum of the value of TF-IDF on each file.


If you select Each file, the sorting will be done for each file based on the TF-IDF values.


Personally, I've never used this for my research, but it seems to be a well-known indicator in text mining.

In any case, if you want calculate TF-IDF values for all the words, but only display a limited number of words, uncheck Limit the number of words to import to and import all the words from Word List.  Then set Limit result table columns to a reasonable number. 


I once tried this with no limits on both with a corpus of about one hundred thousand tokens and the corpus had several thousand unique words.  This means the table had several thousand columns.  When I tried to scroll the table, even scrolling only one row took several minutes.  So don't try it!

No comments: