Friday, September 19, 2008

TextExtractor

I haven't had time to write any script at all for a while, but I've found some time to experiment on RubyCocoa. The stuff I'm working on won't change anything on the surface of CasualConc. The changes will be mostly internal and slow.

A very few feature requests I got are xml file handling and parallel concordancing. But I don't have any experience in these, so I need more information. I added two threads to Google Groups discussion board about these two. If people ever read this and give me information about these two, I might add them or write a separate program for them (parallel concordancer).

Another thing I'm considering about CasualConc is dropping East Asian Languages support. I don't hear from anybody who uses CasualConc for Japanese/Korean/Chinese, so I don't know if I really want to keep trying to accommodate these function on CasualConc. It would probably be easier for me to maintain if I separate East Asian Languages concordancer (eps. kwic) as another program. I'll think about this more, but if you have any suggestion, let me know.

Anyway, as a part of my experiment on RubyCocoa, I updated TextExtractor, an utility program to extract text data from verious text embedded files and to convert text encoding of plain text files to UTF-8. I'm not sure if you looked at the utility program section of the CasualConc site, I have a few utility programs that deal with text files. I combined two of them (PDF to Text and HTML to Text converters) and added a few extra functions. The first version (0.1) of TextExtractor had a function of jparser (a Japanese parsing program using MeCab), but it didn't run without MeCab and MeCab-Ruby. So I dropped this function.

Instead, I made that part to simply convert non-UTF-8 text files (.txt) to UTF-8 text files and MS Word, PDF, HTML, OpenOffice documents to UTF-8 text files or Rich Text Format files. All other parts (PDF to Text, Web file to Text, and batch process) can save files as RTF files. When you convert files to RTF files, you can either keep text/font information of the original files (fonts/font style/etc.) or throw away this info and save as a plain text on RTF file.

I also added basic instruction in English (not translated to Japanese yet). So if you are interested, please try it and let me know what you think.

EDIT: This program is renamed as CasualTextractor

Sunday, September 7, 2008

A few very minor changes

Well, obviously, I haven't done anything with CasualConc for the last three months. I finally announced this on Corpora List and some people got interested and started testing CasualConc. But I heard from only a few people. Still it's good to know someone uses it and likes it.

I made a few minor changes/bug fixes to CasualConc. The only one that's worth mentioning here is that now CasualConc remembers the files you selected in File Mode when you quit the program. The next time you start CasualConc, the files you selected last time should be on the file list. Now the version number is 0.9.8 beta.

As always, I'd like to know what you think about the program.

Sunday, June 8, 2008

Bug fixes

I finally got a bug report. Now I know at least a few more people are using CasualConc.

The bugs are related to the recent changes I made to Lemmatization and Collocation.

The bug related to lemmatization was that when lemmatization was activated without specifying a lemma file, CasualConc crashed. This was because CasualConc looked for a lemma file when it started or returned from the preferences and if the file was not found, it crashed.

The two bugs related to collocation were 1) it didn't run in file mode, and 2) search in concord didn't work when 'Treat Keywords as One Word' option is activated in preferences. These should be fixed and work properly now.

I would appreciate any report of bugs. And I'd like to know how you like CasualConc.

Friday, May 30, 2008

A minor change

I found a bug (sort of) in Concord a couple of days ago. It's a minor bug and this happens only when you use the database mode in Concord. Well, it's more of memory leak. I implemented a forced garbage collection when full text is displayed in the context view of Concord, but somehow memory is not released. So I changed the way to read the text from a database file. Now it should not keep using additional memory when you select a different concordance line to show full text.

I use the same technique to read data from a database file when CasualConc searches a string, but if I implemented the same change to the search function, it used more memory because the search returns more hits. What this means is if you search word(s)/phrase(s) in any of the tools many times, CasualConc keeps using memory. I haven't tested if it uses up all the available memory and starts using virtual memory or if Ruby starts GC when it uses up all the available physical memory. In any case, until I can find a way to solve this problem, you might want to quit CasualConc after a while and restart it.

Tuesday, May 20, 2008

A few more fixes again

I fixed a few more bugs this weekend that are mainly related to lemmatization and collocation statistics. I also added some more documentations to the main site (some in Japanese). The latest version is still 0.9.7 but the date is 05192008.

Now, most of the features I wanted to include in CasualConc is there and mostly functioning. I don't have time to improve Japanese kwic feature now, so that should wait until sometime in summer or fall. And unless I find or someone reports any major bugs, I will try not to spend too much time on this for a while. I don't know how many people actually downloaded CasualConc and are using it, but I guess there aren't many. If you happened to be one of them, I'd like to hear what you think about it.

Well, I might need to publicize this a bit more, so I might start trying to get more beta testers somewhere.