Last Updated: September 09, 2019
·
676
· danielpetisme

Extract content of .doc(x), .ppt(x), etc.

As simple as 1,2,3, Apache Tika make the jobs, and make it well. This lib is able to extract the content ans metadata of any structured document such as Microsoft Office documents. The outstanding Apache Lucence relies on top of it to extract documents contents to make hem searchable.

Apache Tika offers a standalone version of the application with a easy CLI (you can also invoked a GUI).

To extract a PowerPoint content to a HTML format you just have to run the following command:

java -jar tika-app.jar Text.pptx --html > Test.html