Apache Tika based binary file indexer

UNIX name Owner Status
eztika Paul Borgermans stable
Version Compatible with
1.3 4.x
A wrapper script for the standalone Tika toolkit that allows indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF, ....

Requirements

  • Sun Java VM (JRE 1.5 or higher)
  • eZ Publish 4.x

Supported binary file formats

[application/pdf]
[application/msword]
[application/vnd.ms-excel]
[application/vnd.ms-powerpoint]
[application/vnd.visio]
[application/vnd.ms-outlook]
[application/xml]
[application/rtf]
[application/vnd.oasis.opendocument.text]
[application/vnd.oasis.opendocument.presentation]
[application/vnd.oasis.opendocument.spreadsheet]
[application/vnd.oasis.opendocument.formula]
[application/vnd.openxmlformats-officedocument.wordprocessingml.document]
[application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
[application/vnd.openxmlformats-officedocument.presentationml.presentation]
[application/octet-stream] (anything tika recognizes)
[application/zip]

Installation

See the included INSTALL.txt, make sure you adapt the various paths according to your environment

Known issues

  • still no support for keynote presentation files, but will be corrected in the next iteration of eztika as tika itself is getting support for it

Changelog

Changelog eztika 1.2 to 1.3

  • updated tika.jar to 0.8-snapshot (rev 933934); now supporting correctly CJK pdfs
  • updated meta-info

Changelog eztika 1.1 to 1.2

  • updated tika.jar to 0.6-dev (rev 897576) which has better support for ms excel, a reduced footprint and output encoding options
  • it now correctly converts OOo and MSxx formats with asian content properly to UTF-8
  • added encoding option --encoding=utf8 to eztika wrapper script
  • updated ezinfo.php
  • changed structure for building with http://projects.ez.no/ezextensionbuilder

Changelog eztika 1.0 to 1.1

  • (added 2010-01-17) let ezpdftotext specify the no pagebreaks option, which potentialy break the UTF payload sent to ez find
  • updated tika.jar to 0.5-dev (rev 814142), including more support for office xml formats and various bugfixes
  • added office xml format mimetypes to binaryfile.ini
  • updated ezinfo.php

This project has no reviews yet. Be the first one to review it!