Sinking in a sea of documents

Lately because of one of my projects I am having to work my way through a bunch of long, semi structured documents containing things such as requests for proposals, government program reports, threat models and all kinds of things like that. They are in techno-legalese as I would call them: highly structured, with section numbering and 3, 4 and 5 levels of nesting. All in English, they look like the writers were getting paid by the word if you know what I mean.

I need a more efficient way to locate those paragraphs of nuggets that matter to me. So what I'd like is kind of a local document index/repository, that would allow me to have some standing queries and easily locate sections in documents that talk about my queries. Here's an example:
  • I'd like to load in 10 large PDF files, each of say 100 pages. Each PDF contains English text, formatted very nicely into paragraphs and sections.
  • I'd like to specify that I am interested in "blogging platforms", "weaknesses in Ruby", "localization and internationalization"
  • Ideally then look at a list that showed the section of text, the name of the document, and other information that seemed to be related to and/or include the words and phrases I specified.
I am sure something like this exists. I would call it something like document indexing, document comprehension or structured searching.

Any suggested leads or ideas?


Posted on May 6, 2010 and filed under Technology.