TWiki, KinoSearch and Office 2007 documents (20 Jul 2009)

Both at work and on this site, I use TWiki as my wiki engine of choice. TWiki has managed to attract a fair share of plugin and add-on writers, resulting in wonderful tools like an add-on which integrates KinoSearch, a Perl library on top of the Lucene search engine.

This month, I installed the add-on at work. It turns out that in its current state, it does not support Office 2007 document types yet, such as .docx, .pptx and .xlsx, i.e. the so-called "Office OpenXML" formats. That's a pity, of course, since these days, most new Office documents tend to be provided in those formats.

The KinoSearch add-on doesn't try to parse (non-trivial) documents on its own, but rather relies on external helper programs which extract indexable text from documents. So the task at hand is to write such a text extractor.

Fortunately, the Apache POI project just released a version of their libraries which now also support OpenXML formats, and with those libraries, it's a piece of cake to build a simple text extractor! Here's the trivial Java driver code:

package de.clausbrod.openxmlextractor;

import java.io.File;

import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;

public class Main {
    public static String extractOneFile(File f) throws Exception {
        POITextExtractor extractor = ExtractorFactory.createExtractor(f);
        String extracted = extractor.getText();
        return extracted;
    }

    public static void main(String[] args) throws Exception {
        if (args.length <= 0) {
            System.err.println("ERROR: No filename specified.");
            return;
        }

        for (String filename : args) {
            File f = new File(filename);
            System.out.println(extractOneFile(f));
        }
    }
}

Full Java 1.6 binaries are attached; Apache POI license details apply. Copy the ZIP archive to your TWiki server and unzip it in a directory of your choice.

With this tool in place, all we need to do is provide a stringifier plugin to the add-on. This is done by adding a file called OpenXML.pm to the lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins directory in the TWiki server installation:

# For licensing info read LICENSE file in the TWiki root.
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details, published at 
# http://www.gnu.org/copyleft/gpl.html

package TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyPlugins::OpenXML;
use base 'TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyBase';
use File::Temp qw/tmpnam/;

__PACKAGE__->register_handler(
  "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", ".xlsx");
__PACKAGE__->register_handler(
  "application/vnd.openxmlformats-officedocument.wordprocessingml.document", ".docx");
__PACKAGE__->register_handler(
  "application/vnd.openxmlformats-officedocument.presentationml.presentation", ".pptx");

sub stringForFile {
  my ($self, $file) = @_;
  my $tmp_file = tmpnam();

  my $text;
  my $cmd = 
    "java -jar /www/twiki/local/bin/openxmlextractor/openxmlextractor.jar '$file' > $tmp_file";
  if (0 == system($cmd)) {
    $text = TWiki::Contrib::SearchEngineKinoSearchAddOn::Stringifier->stringFor($tmp_file);
  }

  unlink($tmp_file);
  return $text;  # undef signals failure to caller
}
1;

This script assumes that the openxmlextractor.jar helper is located at /www/twiki/local/bin/openxmlextractor; you'll have to tweak this path to reflect your local settings.

I haven't figured out yet how to correctly deal with encodings in the stringifier code, so non-ASCII characters might not work as expected.

To verify local installation, change into /www/twiki/kinosearch/bin (this is where my TWiki installation is, YMMV) and run the extractor on a test file:

  ./ks_test stringify foobla.docx

And in a final step, enable index generation for Office documents by adding .docx, .pptx and .xlsx to the Main.TWikiPreferences topic:

   * KinoSearch settings
      * Set KINOSEARCHINDEXEXTENSIONS = .pdf, .xml, .html, .doc, .xls, .ppt, .docx, .pptx, .xlsx


Java-Forum Stuttgart (06 Jul 2009)

At this year's Java forum in Stuttgart, I was one of 1100 geeks who divulged in Suebian Brezeln and presentations on all things Java.

After a presentation on Scala, I passed by a couple of flipcharts which were set aside for birds-of-a-feather (BoF) sessions. On a whim, I grabbed a free flipchart and scribbled one word: Clojure. In the official program, there was no presentation covering Clojure, but I thought it'd be nice to meet a few people who, like me, are interested in learning this new language and its concepts!

Since I had suggested the topic, I became the designated moderator for this session. It turned out that most attendees didn't really know all that much about Clojure or Lisp - and so I gravitated, a bit unwillingly at first, into presentation mode. Boy, was I glad that right before the session, I had refreshed the little Clojure-fu I have by reading an article or two.

In fact, some of the folks who showed up had assumed the session was on closures (the programming concept) rather than Clojure, the language big grin But the remaining few of us still had a spirited discussion, covering topics such as dynamic versus static typing, various Clojure language elements, Clojure's Lisp heritage, programmimg for concurrency, web frameworks, Ruby on Rails, and OO databases.

To those who stopped by, thanks a lot for this discussion and for your interest. And to the developer from Bremen whose name I forgot (sorry): As we suspected, there is indeed an alternative syntax for creating Java objects in Clojure.

  (.show (new javax.swing.JFrame)) ;; probably more readable for Java programmers

  (.show (javax.swing.JFrame.)) ;; Clojure shorthand



to top


You are here: Blog > WebLeftBar > BlogOnSoftwareJava

r1.1 - 20 Jul 2009 - 10:15 - ClausBrod to top

Blog
This site
RSS

  2017: 12 - 11 - 10
  2016: 10 - 7 - 3
  2015: 11 - 10 - 9 - 4 - 1
  2014: 5
  2013: 9 - 8 - 7 - 6 - 5
  2012: 2 - 10
  2011: 1 - 8 - 9 - 10 - 12
  2010: 11 - 10 - 9 - 4
  2009: 11 - 9 - 8 - 7 -
     6 - 5 - 4 - 3
  2008: 5 - 4 - 3 - 1
  2007: 12 - 8 - 7 - 6 -
     5 - 4 - 3 - 1
  2006: 4 - 3 - 2 - 1
  2005: 12 - 6 - 5 - 4
  2004: 12 - 11 - 10
  C++
  CoCreate Modeling
  COM & .NET
  Java
  Mac
  Lisp
  OpenSource
  Scripting
  Windows
  Stuff
Changes
Index
Search
Maintenance
Impressum
Datenschutzerklärung
Home



Jump:

Copyright © 1999-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback