Friday, September 19, 2008

TextExtractor

I haven't had time to write any script at all for a while, but I've found some time to experiment on RubyCocoa. The stuff I'm working on won't change anything on the surface of CasualConc. The changes will be mostly internal and slow.

A very few feature requests I got are xml file handling and parallel concordancing. But I don't have any experience in these, so I need more information. I added two threads to Google Groups discussion board about these two. If people ever read this and give me information about these two, I might add them or write a separate program for them (parallel concordancer).

Another thing I'm considering about CasualConc is dropping East Asian Languages support. I don't hear from anybody who uses CasualConc for Japanese/Korean/Chinese, so I don't know if I really want to keep trying to accommodate these function on CasualConc. It would probably be easier for me to maintain if I separate East Asian Languages concordancer (eps. kwic) as another program. I'll think about this more, but if you have any suggestion, let me know.

Anyway, as a part of my experiment on RubyCocoa, I updated TextExtractor, an utility program to extract text data from verious text embedded files and to convert text encoding of plain text files to UTF-8. I'm not sure if you looked at the utility program section of the CasualConc site, I have a few utility programs that deal with text files. I combined two of them (PDF to Text and HTML to Text converters) and added a few extra functions. The first version (0.1) of TextExtractor had a function of jparser (a Japanese parsing program using MeCab), but it didn't run without MeCab and MeCab-Ruby. So I dropped this function.

Instead, I made that part to simply convert non-UTF-8 text files (.txt) to UTF-8 text files and MS Word, PDF, HTML, OpenOffice documents to UTF-8 text files or Rich Text Format files. All other parts (PDF to Text, Web file to Text, and batch process) can save files as RTF files. When you convert files to RTF files, you can either keep text/font information of the original files (fonts/font style/etc.) or throw away this info and save as a plain text on RTF file.

I also added basic instruction in English (not translated to Japanese yet). So if you are interested, please try it and let me know what you think.

EDIT: This program is renamed as CasualTextractor

No comments: