Skip to content

Details

This will be a virtual, hands-on workshop to introduce the capabilities of the tika-eval module. This workshop is designed for those interested in:

  1. profiling files (digests, mime types)
  2. profiling text extracted from files (number of tokens, automatic language detection, out-of-vocabulary statistic/junk detection)
  3. comparing text extracted from different text extractors.

There will be a heavy emphasis on processing PDF files.

Attendees should be comfortable running tika-app from the commandline or curl'ing to a local tika-server. See the link below for prerequisites (still a work in progress).

https://cwiki.apache.org/confluence/display/TIKA/Apache+Tika+Meetups

Related topics

New Technology

You may also like