Apache Tika based binary file indexer
Last updated: Tuesday 18 May 2010 16:03
Version |
Compatible with |
1.3
|
4.x
|
A wrapper script for the standalone Tika toolkit that allows indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF, ....
Requirements
- Sun Java VM (JRE 1.5 or higher)
- eZ Publish 4.x
Supported binary file formats
[application/pdf]
[application/msword]
[application/vnd.ms-excel]
[application/vnd.ms-powerpoint]
[application/vnd.visio]
[application/vnd.ms-outlook]
[application/xml]
[application/rtf]
[application/vnd.oasis.opendocument.text]
[application/vnd.oasis.opendocument.presentation]
[application/vnd.oasis.opendocument.spreadsheet]
[application/vnd.oasis.opendocument.formula]
[application/vnd.openxmlformats-officedocument.wordprocessingml.document]
[application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
[application/vnd.openxmlformats-officedocument.presentationml.presentation]
[application/octet-stream] (anything tika recognizes)
[application/zip]
Installation
See the included INSTALL.txt, make sure you adapt the various paths according to your environment
Known issues
- still no support for keynote presentation files, but will be corrected in the next iteration of eztika as tika itself is getting support for it
Changelog
Changelog eztika 1.2 to 1.3
- updated tika.jar to 0.8-snapshot (rev 933934); now supporting correctly CJK pdfs
- updated meta-info
Changelog eztika 1.1 to 1.2
- updated tika.jar to 0.6-dev (rev 897576) which has better support for ms excel, a reduced footprint and output encoding options
- it now correctly converts OOo and MSxx formats with asian content properly to UTF-8
- added encoding option --encoding=utf8 to eztika wrapper script
- updated ezinfo.php
- changed structure for building with http://projects.ez.no/ezextensionbuilder
Changelog eztika 1.0 to 1.1
- (added 2010-01-17) let ezpdftotext specify the no pagebreaks option, which potentialy break the UTF payload sent to ez find
- updated tika.jar to 0.5-dev (rev 814142), including more support for office xml formats and various bugfixes
- added office xml format mimetypes to binaryfile.ini
- updated ezinfo.php
eZ Tika 1.3 released (binary file indexing)
Tuesday 18 May 2010 14:46
Paul Borgermans