GNU Libextractor
GNU Libextractor is a library used to extract meta data from files. The goal is to provide developers of file-sharing networks, browsers or WWW-indexing bots with a universal library to obtain simple keywords and meta data to match against queries and to show to users instead of only relying on filenames. libextractor contains the shell command extract that, similar to the well-known file command, can extract meta data from a file an print the results to stdout.
Currently, libextractor supports the following formats: HTML, MAN, PS, DVI, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), FLAC, MP3 (ID3v1 and ID3v2), OGG, WAV, S3M (Scream Tracker 3), XM (eXtended Module), IT (Impulse Tracker), NSF(E) (NES music), SID (C64 music), EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), LZH, LHA, RAR, ZIP, CAB, 7-ZIP, AR, MTREE, PAX, CPIO, ISO9660, SHAR, RAW, XAR FLV, REAL, RIFF (AVI), MPEG, QT and ASF. Also, various additional MIME types are detected.
GNU libextractor uses helper-libraries (plugins) to perform the actual extraction. As a result, GNU libextractor can be extended simply by installing additional plugins. Writing robust parsers can be difficult. GNU libextractor protects the main applications from haning or crashing plugins by executing all plugins out-of-process.
GNU libextractor is a GNU package. Our official GNU website can be found at http://www.gnu.org/software/libextractor/.
Downloading Libextractor
- Source Code
- Libextractor can be found on the main GNU ftp server: http://ftp.gnu.org/gnu/libextractor/ (via HTTP) and ftp://ftp.gnu.org/gnu/libextractor/ (via FTP). It can also be found on the GNU mirrors; please use a mirror if possible.
- Debian .deb package
- The debian package can be downloaded from the official debian archive. The extract package can be found under Utilities and the library under Libraries. The respective packages for libextractor are extract, libextractor and for development libextractor-dev. Backports for Debian Stable are also available.
- Tar Package
-
The latest version can be found on GNU mirrors.
If the mirror does not work, you should be able to find them on the main FTP server at
ftp://ftp.gnu.org/gnu/libextractor/.
Latest release is libextractor-1.1.tar.gz.
Latest Java-binding is libextractor-java-1.0.0.tar.gz.
Latest Mono-binding is libextractor-mono-0.5.23.tar.gz.
Latest Python-binding is libextractor-python-0.5.tar.gz. - RPM Package
- RPMs for SuSE 9.3 can be found here (i386, x86_64, SRPM)
- Windows
- Latest Windows binary is libextractor-0.5.23-w32.zip.
Documentation
Documentation for Libextractor is available online, as is documentation for most GNU software. You may also find more information about Libextractor by running info libextractor or man libextractor, or man extract, or by looking at /usr/share/doc/libextractor/, /usr/local/doc/libextractor/, or similar directories on your system. A brief summary is available by running extract --help. You might also be interested in an API compatibility report comparing the various Libextractor versions.
Articles related to libextractor:
- Reading File Metadata with extract and libextractor
- How to recover lost files after you accidentally wipe your hard drive
- All your Metadata are belong to Us
Mailing lists
Libextractor has the following mailing lists:
- bug-libextractor is used to discuss most aspects of Libextractor, including development and enhancement requests, as well as bug reports.
- help-libextractor is for general user help and discussion.
Announcements about Libextractor and most other GNU software are made on info-gnu (archive). If you only want to get notifications about Libextractor, we suggest you subscribe to the project at freshmeat.
Security reports that should not be made immediately public can be sent directly to the maintainer. If there is no response to an urgent issue, you can escalate to the general security mailing list for advice.
Getting involved
Development of Libextractor, and GNU in general, is a volunteer effort, and you can contribute. For information, please read How to help GNU. If you'd like to get involved, it's a good idea to join the discussion mailing list (see above).
- Development
- Known bugs and open feature requests are tracked in our bugtracker.
- Subversion access
-
You can access the current development version of libextractor using
$ svn checkout https://gnunet.org/svn/Extractor
A Java binding for libextractor is in$ svn checkout https://gnunet.org/svn/Extractor-java
A Mono binding for libextractor is in$ svn checkout https://gnunet.org/svn/Extractor-mono
A Python binding can be found under$ svn checkout https://gnunet.org/svn/Extractor-python
A source package is here. This binding has been packaged as a python egg, available here A second Python binding that includes a binding for doodle can be found here.
A Perl binding is in CPAN The latest version of the Perl binding is available using git clone git://git.perldition.org/File-Extractor.git/
A Ruby binding has been published here (mirror). Another Ruby binding has been published here (mirror).
An initial draft of a PHP binding can be found under$ svn checkout https://gnunet.org/svn/Extractor-php
- Translating Libextractor
- To translate Libextractor's messages into other languages, please see the Translation Project page for Libextractor. If you have a new translation of the message strings, or updates to the existing strings, please have the changes made in this repository. Only translations from this site will be incorporated into Libextractor. For more information, see the Translation Project.
- Maintainer
- Libextractor is currently being maintained by Christian Grothoff.
Quick Introduction
- Installation
-
The simplest way to install GNU libextractor is to use one of the binary
packages which are available online for many distributions. Note that
under Debian, the extract tool is in a separate
package extract and headers required to compile other
applications against libextractor are in libextractor-dev.
Thus, under Debian, you should use:
# apt-get install libextractor-dev extract
Compiling by hand follows the usual sequence:$ tar xzvf libextractor.x.y.z.tar.gz $ cd libextractor.x.y.z $ ./configure $ make # make install
Note that you need various dependencies (read README for an up-to-date list) in order to compile all of the plugins. - Using the extract tool
-
After installing GNU libextractor, the extract tool can be used to obtain
meta data from documents. By default, the extract tool uses the
canonical set of plugins, which consists of all format-specific
plugins supported by the current version of libextractor together with
the mime-type detection plugin. If you are a user
of BibTeX
the option -b is likely to come in handy to automatically
create bibtex entries from documents that have been properly equipped
with meta-data (if available).
Further options are described in the extract manpage (man 1 extract). - Example Output
-
$ extract libextractor-0.1.3-1.src.rpm Keywords for file libextractor-0.1.3-1.src.rpm: os - linux resource-identifier - http://ovmj.org/libextractor/ group -System Environment/Libraries license - LGPL copyright - LGPL size - 251545 build-host - wedge.cs.purdue.edu creation date - Wed Dec 25 07:50:07 2002 description - libextractor is a simple library... summary - keyword extraction library release - 1 version - 0.1.3 title - libextractor unknown - SOURCE RPM 3.0 mimetype - application/x-rpm
$ extract extractor_logo.png Keywords for file extractor_logo.png: image dimensions - 272x188 thumbnail - (binary, 5932 bytes) image dimensions - 272x188 thumbnail - (binary, 6427 bytes) image dimensions - 272x188 thumbnail - (binary, 6427 bytes) mimetype - image/png mimetype - image/png image dimensions - 272x188 keywords - The libextractor logo
- Using the GNU libextractor library in your programs
-
The following listing shows the code of a minimalistic program that
uses GNU libextractor. Compiling the fragment requires passing the
option -lextractor to gcc. For details and additional
functions for loading plugins and manipulating the keyword list, see
the libextractor manpage (man 3 libextractor).
Java programmers should note that a Java class that uses JNI to
communicate with libextractor is also available. Python programmers
will find that libextractor (since 0.5.0) can also be used from
Python, just import Extractor.
#include <extractor.h> int main (int argc, char * argv[]) { struct EXTRACTOR_PluginList *plugins = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY); EXTRACTOR_extract (plugins, argv[1], NULL, 0, &EXTRACTOR_meta_data_print, stdout); EXTRACTOR_plugin_remove_all (plugins); return 0; }
- Writing new Plugins for GNU libextractor
-
The most complicated thing when writing a new plugin for GNU
libextractor is the writing of the actual parser for a specific
format. Nevertheless, the basic pattern is always the same. The
plugin library must be called libextractor_XXX.so where XXX
denotes the file format supported by the plugin and must be placed in
the plugin directory (typically $PREFIX/lib/libextractor/).
The library must export a method EXTRACTOR_XXX_extract_method
with the following signature:
void EXTRACTOR_XXX_extract_method (struct EXTRACTOR_ExtractContext *ec);
ec provides a callback to invoke with meta data as well as functions for reading data from the file that is being processed. Most plugins start by reading the first bytes of the file and checking that that the header of data matches the specific format. The extract function is expected to call ec->proc with each meta data item found. ec->cls must be passed as the first argument to proc and other function invoked from within ec. Finally, ec->config is an arbitrary string of options that the plugin is free to interpret. Most plugins ignore config.
If the meta data extracted is a string, it is supposed to be converted into the UTF-8 character set by the plugin. However, in cases where the character encoding used in the document is unknown, no conversion should be done. Binary meta data can also be extracted. Plugins indicate the format of the meta data using the format argument to proc. Supported formats are UTF-8 strings, C strings (for strings of unknown encoding) and binary data. In addition to this rough categorization, the plugin is also supposed to indicate the mime type of the meta data. For strings, that mime type is most often text/plain. Finally, the plugin must specify the meta data type. Common meta data types are "author", "title" and "mime-type". The full signature of the "proc" callback is:typedef int (*EXTRACTOR_MetaDataProcessor)(void *cls, const char *plugin_name, enum EXTRACTOR_MetaType type, enum EXTRACTOR_MetaFormat format, const char *data_mime_type, const char *data, size_t data_len);
If "proc" returns non-zero, the plugin should abort processing the current file and return. - Related projects and useful resources
- Projects that use libextractor
Licensing
Libextractor is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.