Apache Tika
A Poor Journalist’s Text-Mining Toolkit
|
How can you search and analyze collections of documents on your own computers with simple tools? At DataHarvest, Robert Gebeloff and I ran a workshop to answer that question. As people were seemed interested, here’s a write-up of the two key tools we worked with: Apache Tika for content extraction and regular expressions in Sublime Text as an advanced search tool.