TWiki, KinoSearch and Office 2007 documents (20 Jul 2009)

Both at work and on this site, I use TWiki as my wiki engine of choice. TWiki has managed to attract a fair share of plugin and add-on writers, resulting in wonderful tools like an add-on which integrates KinoSearch, a Perl library on top of the Lucene search engine.

This month, I installed the add-on at work. It turns out that in its current state, it does not support Office 2007 document types yet, such as .docx, .pptx and .xlsx, i.e. the so-called "Office OpenXML" formats. That's a pity, of course, since these days, most new Office documents tend to be provided in those formats.

The KinoSearch add-on doesn't try to parse (non-trivial) documents on its own, but rather relies on external helper programs which extract indexable text from documents. So the task at hand is to write such a text extractor.

Fortunately, the Apache POI project just released a version of their libraries which now also support OpenXML formats, and with those libraries, it's a piece of cake to build a simple text extractor! Here's the trivial Java driver code:

package de.clausbrod.openxmlextractor;

import java.io.File;

import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;

public class Main {
    public static String extractOneFile(File f) throws Exception {
        POITextExtractor extractor = ExtractorFactory.createExtractor(f);
        String extracted = extractor.getText();
        return extracted;
    }

    public static void main(String[] args) throws Exception {
        if (args.length <= 0) {
            System.err.println("ERROR: No filename specified.");
            return;
        }

        for (String filename : args) {
            File f = new File(filename);
            System.out.println(extractOneFile(f));
        }
    }
}

Full Java 1.6 binaries are attached; Apache POI license details apply. Copy the ZIP archive to your TWiki server and unzip it in a directory of your choice.

With this tool in place, all we need to do is provide a stringifier plugin to the add-on. This is done by adding a file called OpenXML.pm to the lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins directory in the TWiki server installation:

# For licensing info read LICENSE file in the TWiki root.
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details, published at 
# http://www.gnu.org/copyleft/gpl.html

package TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyPlugins::OpenXML;
use base 'TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyBase';
use File::Temp qw/tmpnam/;

__PACKAGE__->register_handler(
  "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", ".xlsx");
__PACKAGE__->register_handler(
  "application/vnd.openxmlformats-officedocument.wordprocessingml.document", ".docx");
__PACKAGE__->register_handler(
  "application/vnd.openxmlformats-officedocument.presentationml.presentation", ".pptx");

sub stringForFile {
  my ($self, $file) = @_;
  my $tmp_file = tmpnam();

  my $text;
  my $cmd = 
    "java -jar /www/twiki/local/bin/openxmlextractor/openxmlextractor.jar '$file' > $tmp_file";
  if (0 == system($cmd)) {
    $text = TWiki::Contrib::SearchEngineKinoSearchAddOn::Stringifier->stringFor($tmp_file);
  }

  unlink($tmp_file);
  return $text;  # undef signals failure to caller
}
1;

This script assumes that the openxmlextractor.jar helper is located at /www/twiki/local/bin/openxmlextractor; you'll have to tweak this path to reflect your local settings.

I haven't figured out yet how to correctly deal with encodings in the stringifier code, so non-ASCII characters might not work as expected.

To verify local installation, change into /www/twiki/kinosearch/bin (this is where my TWiki installation is, YMMV) and run the extractor on a test file:

  ./ks_test stringify foobla.docx

And in a final step, enable index generation for Office documents by adding .docx, .pptx and .xlsx to the Main.TWikiPreferences topic:

   * KinoSearch settings
      * Set KINOSEARCHINDEXEXTENSIONS = .pdf, .xml, .html, .doc, .xls, .ppt, .docx, .pptx, .xlsx



When asked for a TWiki account, use your own or the default TWikiGuest account.



to top

You are here: Blog > DefinePrivatePublic20090720TWikiKinoSearch

r1.2 - 20 Jul 2009 - 10:13 - ClausBrod to top

Blog
RSS
  2009: 11 - 9 - 8 - 7 -
     6 - 5 - 4 - 3
  2008: 5 - 4 - 3 - 1
  2007: 12 - 8 - 7 - 6 -
     5 - 4 - 3 - 1
  2006: 4 - 3 - 2 - 1
  2005: 12 - 6 - 5 - 4
  2004: 12 - 11 - 10
  C++
  COM & .NET
  Lisp
  Scripting
  Windows
  CoCreate Modeling
  OpenSource
  Java
  Stuff
Changes
Index
Search
Maintenance
Impressum
Home



  • My links
  • Show me topics of interest

TWiki

Welcome


TWiki Web TWiki Web Home Changes Topics Index Search


TWiki Webs Atari Blog Main OneSpaceModeling? Sandbox TWiki TWiki Webs Atari Blog Main OneSpaceModeling? Sandbox TWiki

Jump:

Copyright © 1999-2010 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback