Extract content of .doc(x), .ppt(x), etc.
As simple as 1,2,3, Apache Tika make the jobs, and make it well. This lib is able to extract the content ans metadata of any structured document such as Microsoft Office documents. The outstanding Apache Lucence relies on top of it to extract documents contents to make hem searchable.
Apache Tika offers a standalone version of the application with a easy CLI (you can also invoked a GUI).
To extract a PowerPoint content to a HTML format you just have to run the following command:
java -jar tika-app.jar Text.pptx --html > Test.html
Written by Daniel PETISME
Related protips
Have a fresh tip? Share with Coderwall community!
Post
Post a tip
Best
#Doc
Authors
Sponsored by #native_company# — Learn More
#native_title#
#native_desc#