Last Updated: September 09, 2019

·

732

· danielpetisme

Extract content of .doc(x), .ppt(x), etc.

As simple as 1,2,3, Apache Tika make the jobs, and make it well. This lib is able to extract the content ans metadata of any structured document such as Microsoft Office documents. The outstanding Apache Lucence relies on top of it to extract documents contents to make hem searchable.

Apache Tika offers a standalone version of the application with a easy CLI (you can also invoked a GUI).

To extract a PowerPoint content to a HTML format you just have to run the following command:

java -jar tika-app.jar Text.pptx --html > Test.html

Written by Daniel PETISME

Related protips

How to read .doc using PHP

17.91K

2

apiDoc - Part 1 Introduction - RESTful web API Documentation

6.851K

0

A faster bundle install

6.835K

1

Have a fresh tip? Share with Coderwall community!

Best #Doc Authors

17.91K

6.851K

6.835K

6.078K

1.319K

Related Tags

#native_company#

Awesome Job

Post a job for only $299

Thanks to our sponsor

#native_title# #native_desc#