#define private public - Claus Brod on Java

Catching up with web frameworks (07 May 2013)
TWiki, KinoSearch and Office 2007 documents (20 Jul 2009)
Java-Forum Stuttgart (06 Jul 2009)

Catching up with web frameworks (07 May 2013)

Just watched demos for Vaadin and Play to get first impressions.

In both cases, I liked what I saw, but of course I need to learn a lot more about those frameworks to see when and for which kind of projects I would choose either Vaadin or Play over Spring MVC, which I have been using professionally for a while now.

It's obvious, however, that Spring MVC and Play share a similar, web-oriented notion of the MVC pattern, while I am not sure whether MVC plays much of a role in Vaadin at all, with its focus on RIA and one-page web applications. (In Vaadin, I guess MVC would be applied in a way similar to desktop apps.)

TWiki, KinoSearch and Office 2007 documents (20 Jul 2009)

Both at work and on this site, I use TWiki as my wiki engine of choice. TWiki has managed to attract a fair share of plugin and add-on writers, resulting in wonderful tools like an add-on which integrates KinoSearch, a Perl library on top of the Lucene search engine.

This month, I installed the add-on at work. It turns out that in its current state, it does not support Office 2007 document types yet, such as .docx, .pptx and .xlsx, i.e. the so-called "Office OpenXML" formats. That's a pity, of course, since these days, most new Office documents tend to be provided in those formats.

The KinoSearch add-on doesn't try to parse (non-trivial) documents on its own, but rather relies on external helper programs which extract indexable text from documents. So the task at hand is to write such a text extractor.

Fortunately, the Apache POI project just released a version of their libraries which now also support OpenXML formats, and with those libraries, it's a piece of cake to build a simple text extractor! Here's the trivial Java driver code:

package de.clausbrod.openxmlextractor;

import java.io.File;

import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;

public class Main {
    public static String extractOneFile(File f) throws Exception {
        POITextExtractor extractor = ExtractorFactory.createExtractor(f);
        String extracted = extractor.getText();
        return extracted;
    }

    public static void main(String[] args) throws Exception {
        if (args.length <= 0) {
            System.err.println("ERROR: No filename specified.");
            return;
        }

        for (String filename : args) {
            File f = new File(filename);
            System.out.println(extractOneFile(f));
        }
    }
}

Full Java 1.6 binaries are attached; Apache POI license details apply. Copy the ZIP archive to your TWiki server and unzip it in a directory of your choice.

With this tool in place, all we need to do is provide a stringifier plugin to the add-on. This is done by adding a file called OpenXML.pm to the lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins directory in the TWiki server installation:

# For licensing info read LICENSE file in the TWiki root.
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details, published at 
# http://www.gnu.org/copyleft/gpl.html

package TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyPlugins::OpenXML;
use base 'TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyBase';
use File::Temp qw/tmpnam/;

__PACKAGE__->register_handler(
  "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", ".xlsx");
__PACKAGE__->register_handler(
  "application/vnd.openxmlformats-officedocument.wordprocessingml.document", ".docx");
__PACKAGE__->register_handler(
  "application/vnd.openxmlformats-officedocument.presentationml.presentation", ".pptx");

sub stringForFile {
  my ($self, $file) = @_;
  my $tmp_file = tmpnam();

  my $text;
  my $cmd = 
    "java -jar /www/twiki/local/bin/openxmlextractor/openxmlextractor.jar '$file' > $tmp_file";
  if (0 == system($cmd)) {
    $text = TWiki::Contrib::SearchEngineKinoSearchAddOn::Stringifier->stringFor($tmp_file);
  }

  unlink($tmp_file);
  return $text;  # undef signals failure to caller
}
1;

This script assumes that the openxmlextractor.jar helper is located at /www/twiki/local/bin/openxmlextractor; you'll have to tweak this path to reflect your local settings.

I haven't figured out yet how to correctly deal with encodings in the stringifier code, so non-ASCII characters might not work as expected.

To verify local installation, change into /www/twiki/kinosearch/bin (this is where my TWiki installation is, YMMV) and run the extractor on a test file:

  ./ks_test stringify foobla.docx

And in a final step, enable index generation for Office documents by adding .docx, .pptx and .xlsx to the Main.TWikiPreferences topic:

   * KinoSearch settings
      * Set KINOSEARCHINDEXEXTENSIONS = .pdf, .xml, .html, .doc, .xls, .ppt, .docx, .pptx, .xlsx

Java-Forum Stuttgart (06 Jul 2009)

At this year's Java forum in Stuttgart, I was one of 1100 geeks who divulged in Suebian Brezeln and presentations on all things Java.

After a presentation on Scala, I passed by a couple of flipcharts which were set aside for birds-of-a-feather (BoF) sessions. On a whim, I grabbed a free flipchart and scribbled one word: Clojure. In the official program, there was no presentation covering Clojure, but I thought it'd be nice to meet a few people who, like me, are interested in learning this new language and its concepts!

Since I had suggested the topic, I became the designated moderator for this session. It turned out that most attendees didn't really know all that much about Clojure or Lisp - and so I gravitated, a bit unwillingly at first, into presentation mode. Boy, was I glad that right before the session, I had refreshed the little Clojure-fu I have by reading an article or two.

In fact, some of the folks who showed up had assumed the session was on closures (the programming concept) rather than Clojure, the language big grin But the remaining few of us still had a spirited discussion, covering topics such as dynamic versus static typing, various Clojure language elements, Clojure's Lisp heritage, programmimg for concurrency, web frameworks, Ruby on Rails, and OO databases.

To those who stopped by, thanks a lot for this discussion and for your interest. And to the developer from Bremen whose name I forgot (sorry): As we suspected, there is indeed an alternative syntax for creating Java objects in Clojure.

  (.show (new javax.swing.JFrame)) ;; probably more readable for Java programmers

  (.show (javax.swing.JFrame.)) ;; Clojure shorthand

to top

End of topic
Skip to action links | Back to top

You are here: Blog > WebLeftBar > BlogOnSoftwareJava

r1.2 - 07 May 2013 - 13:25 - ClausBrod to top

Blog