Ok, now this was a very simple problem (in hindsight) but took me quite some time to figure out:
We use Elasticsearch to search through pdf's and the Elasticsearch Attachment Mapper to index the pdf's.
However we have over 500k documents and we noticed that after one or two days Elasticsearch tended to spike the CPU use to 100% due to the attachment mapper. Because even Elasticsearch experts could not find the problem (btw try to find an expert first ;)), we decided to use Tika (the plugin used by the attachment mapper) directly.
It seemed pretty straightforward; added tika-core to the Pom file and change 2 lines of code and away you go....
At least so we thought because all our test code worked flawlessly. However running it in tomcat directly did not extract any text from the pdf's.
After lots of debugging, we noticed that the parser called from the tests were different from the ones invoked by the server. Further investigation revealed that my Java IDE IntelliJ contained the tika-core library as well as the tika-parsers library.
So after checking the documentation (yeah probably a bit late) I found a comment that the tika-core library can identify but not parse the contents of the document.
After exchanging the tika-core with the tika-parsers library in the pom.xml, I got an error that suggested library incompatibilities:
Handler processing failed; nested exception is java.lang.VerifyError: class net.sf.cglib.core.DebuggingClassWriter overrides final method visit.(IILjava/lang/String;Ljava/lang/String;Ljava/lang/String;[Ljava/lang/String;)V
This error took some more research but after a lot of coffee and even more strong words I found that there were a lot of libraries in the tika-parsers library that were already in our own pom, but that the culprit was the asm library.
So here is what worked for us, depending on your pom you may need to exclude other libraries as well.
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.7</version>
<exclusions>
<exclusion>
<groupId>org.ow2.asm</groupId>
<artifactId>asm-debug-all</artifactId>
</exclusion>
</exclusions>
</dependency>
 3 February 2015  comments (0)