Friday, March 7, 2008

PDF

I finally found a way to extract text from text-embedded PDF files in RubyCocoa. I personally don't care about this much, but I guess this might be useful for some people. The problem with handling PDF text is not the extracting part. I mean, the real issues with implementing it to CasualConc are:

1. each line of text is separated by a line feed character LF (\n or \r\n?)
2. page headers/footers, etc. that are not the main text are also included
3. embedded text often includes extra spaces, garbled characters (often with ligatures), etc.

1 is probably the main issue. Currently, the basic unit of analysis in CasualConc is paragraph, which means text separated by LF characters. So it cannot handle text files that separate each line with LF characters such as Brown Corpus files. This require some coding (means not just adding a few lines) and I can't find time to do it now. I'll try to implement this feature in the future, but I don't know when.

2 and 3 cannot be avoided, I guess. So I might try to add a feature to extract text from PDF files within CasualConc, but this also requires certain amount of time.

But at least I know how to extract text from PDF files. So the feature will be included in a future version of CasualConc. If many people are interested, I might prioritize this (but probably won't happen at least until Summer).

No comments: