Friday, March 14, 2008

Garbage Collection again

After some experiments, some of my efforts paid off, but not all. Then, I realized it was not just Ruby that used memory. Becaue it's written in RubyCocoa, Cocoa or Objective-C part should use some memory. I believe Objective-C 2.0 has GC, but OS X uses as much memory it has and manage it. I might be wrong, but using more memory itself might not be that bad.

Concord, Cluster, Collocation might be usable with modest amount of memory, but Word Count (n-gram) requires a lot of memory. This is because it creates a huge array (all create arrays, though). I know my current implementation is not ideal, but maybe I have to improve Word Count first. When I was testing the original Ruby scripts, I only used smaller corpora (far less than 100 mil.). Now I need to figure out a way to reduce memory usage, but how? Does anyone have good idea? My implementation is to use hash to count, just as any basic Ruby book shows. But I tweaked it a bit to increase processing speed.

Anyway, this is partly why I put CasualConc can handle 1 mil. corpus at reasonable speed. Well, I need time.

Wednesday, March 12, 2008

Garbage Collection

I use CasualConc regularly to look up how certain words are used in a context. When I was using it, I realized CasualConc is a memory hog. I knew Word List, espcially when used for n-gram list, needs a lot of memory to process because it keeps counting new ones while it stores counted ones (not exactly, though). I knew Ruby has garbage collection built in, but it seemed like it wasn't working when I wated it to work (maybe because there still was a lot of unused memory). So I decided to force GC to start at some points (GC.start).

But when?

I've been trying several differnt points per each tool and associated method and monitor the differences. But because I've never seriously studied programming (I'm not and have never been in computer science), I don't think I understand how GC works (or in fact, I'm still not sure what exactly OO language entails. If you are breave enough to take a look at the Ruby/RubyCocoa source code of CasualConc, you can see my scripts are not written in Ruby way. I hope I have some time to learn to program a little more seriously someday, but for now, CasualConc works ok (at least for me).

Anyway, I'm not sure if someone ever reads this entry or any entry on this blog, but I'll try to keep my record on this. I want to add some memos on Ruby/RubyCocoa codes on this blog if I can.

Sunday, March 9, 2008

PDF to Text converter

In the last post, I wrote I found a way to extract embedded text from PDF files. I wanted to do something with it before I forget, so I wrote a simple utility program in Ruby+RubyCocoa and posted it to CasualConc site. The system requirement is Mac with Leopard. I named it simply PDFtoTextConverter. What it does is open a PDF and show it's embedded text in the text box on the same window. The extracted text can be saved as .txt file. It also has a batch process mode. You can add PDF files to the list and select a folder to save the text as .txt or save .txt file to the same folder where the origial PDF files are stored. If you are interested, please try it. You can go to CasualConc site by following the link on the right.

EDIT: This program is discontinued and integrated into CasualTextractor which is available on the CasualConc Main site under Utility Programs.

Friday, March 7, 2008

PDF

I finally found a way to extract text from text-embedded PDF files in RubyCocoa. I personally don't care about this much, but I guess this might be useful for some people. The problem with handling PDF text is not the extracting part. I mean, the real issues with implementing it to CasualConc are:

1. each line of text is separated by a line feed character LF (\n or \r\n?)
2. page headers/footers, etc. that are not the main text are also included
3. embedded text often includes extra spaces, garbled characters (often with ligatures), etc.

1 is probably the main issue. Currently, the basic unit of analysis in CasualConc is paragraph, which means text separated by LF characters. So it cannot handle text files that separate each line with LF characters such as Brown Corpus files. This require some coding (means not just adding a few lines) and I can't find time to do it now. I'll try to implement this feature in the future, but I don't know when.

2 and 3 cannot be avoided, I guess. So I might try to add a feature to extract text from PDF files within CasualConc, but this also requires certain amount of time.

But at least I know how to extract text from PDF files. So the feature will be included in a future version of CasualConc. If many people are interested, I might prioritize this (but probably won't happen at least until Summer).

Thursday, March 6, 2008

Google Search

I've been adding documentation to CasualConc site, although I haven't yet added a download page. Now it has a page for Concordance and Word Cluster along with Basic File Handling.

But now I'm wondering how Google works. I mean the Google Page Creator Help says the page created by it "can be crawled by Google within a few hours of publication". Well, it says "can be", so the actual time might be longer than a few hours. In fact, the CasualConc site was searchable on Google a couple of days after I published it. BUT now it's not on the search result. It disappeared!!

Maybe I should tell my friends to check this first...

Monday, March 3, 2008

CasualConc

I started this blog to keep track of what I do for CasualConc, experimental concordancing software for Mac OS X 10.5 Leopard (and possibly later version of OS X).

I started to learn a scripting language called Ruby, which is similar to Perl or Python, last summer. The main reason I chose Ruby was that there are many documentations in Japanese. I don't know if I made a right decision, but at least I tried Perl and Ruby but I like Ruby better for no particular reason (Perl simply didn't appeal to me when I tried). Another reason was that I read somewhere that Apple decided to include software (?) that bridges Ruby and Cocoa, Mac OS X's GUI framework (?) in Leopard. It's called RubyCocoa and it allows users to add Mac GUI to Ruby scripts (btw, there's a similar one for Python). Isn't this cool?

At first, I used Ruby for my work (I'm working as Instructional Technology Consultant at my school), but later decided to learn it more seriously. I'm interested in corpus linguistics and want to do some corpus-based/driven research, so I decided to write some scripts for basic corpus analyses. When OS X 10.5 Leopard came out, I had a few simple scripts for kwic, word count, etc., so I tried to add GUI to them. It wasn't very easy because there isn't much documentation for RubyCocoa. So I had to learn both Ruby and Cocoa and combine them to make GUI work.

Now, I have added some more features to kwic and word count and named it CasualConc. It is Mac GUI based software written in Ruby+RubyCocoa. Because the developing environment is OS X 10.5 Leopard, it only runs on Leopard. There might be a way to make it run on Tiger, but I don't want to spend time on it simply because I don't have time (and I don't have expertise). The current version is 0.9 and still in beta (well, beta simply means I call it so). I don't have much time to make a lot of changes now. From now on, I try to fix major bugs and write up some documents. And now I want to have someone to test it.

There is no guarantee that this works for you, but if you are interested, I'm happy to have you as a beta tester. Here's basic info:

System requirement: a Mac with a lot of memory (at least 1GB) and that runs Mac OS X 10.5 Leopard (Universal, well, this is mostly written in Ruby...), optimized for screen at least 1280px wide (13.3 inch or larger on notebook or 17 inch or larger on desktop LCD)
Acceptable file format: text files (.txt) encoded in ASCII or UTF-8 (Ruby is not good at handling character encodings)
Acceptable languages: any single-byte character language (double-byte character languages (East Asian languages) can be analyzed except for kwic concordancing as long as words are separated by single-byte space)
Target User: Mac users who don't want to start up Windows machine, switch to BootCamp, or run Virtual PC/Parallels/VM Ware for simple concordancing for preliminary analysis, preparing teaching materials, learning, etc. (CasualConc is probably not good enough as your primary research tool)

I use CasualConc on my Mac mini (1.86GHz Core 2 Duo) and have used it on G4 (1.5GHz) machine. It works fine for me, but with faster CPU and more memory, performance is better. With 1 million corpus, it works at reasonable speed (not as fast as WordSmith Tools). With a corpus larger than that, well, you can try.

If you are interested, check out CasualConc site. Documentation is not complete (far from it), so if you have never used any concordancer, you might find it difficult to use. But if you have, you can probably use most of the basic features.

By the way, this is freeware.