Leave feedback
  • How to implement free full text search in StreamServe Collector in 5 minutes (a step by step guide)

Write an Article
Wednesday 30 July, 2014
Hipolito Jimenez Hipolito Jimenez
6 likes 4184 views

Using the Archive Web Service SDK and the Apache Solr search platform to provide a simple solution that enable you to search freely for any text in any PDF file archived in StreamServe Collector.

Some years ago a customer asked me if there is any way to search for any text inside the PDF documents archived in StreamServe Collector.

(The customer is generating invoices, delivery notes, packing list, etc… from different legacy applications; and he is using “post-processing” to compose a single dossier document with all the related invoices, delivery notes, etc… for every new shipping)

You know the answer: it is impossible to search for arbitrary text inside the PDF documents archived in StreamServe Collector using the usual StreamStudio interface.

 

But now we have the Archive Web Service SDK, so we do not need to rely on the StreamStudio interface only.

With this SDK, some spare time during the latest StreamServe migration and the Apache Solr search platform, we (my customer and myself) have implemented a very simple solution (a proof of concept) that enable you to search freely for any text in any PDF file archived in StreamServe Collector.

 

These are the step you need to follow to try this “proof of concept”:

  1. You need a StreamServe Collector database up and running in your laptop (we have tried version 5.6.1 of StreamServe but it should work with versions 5.5 and 5.6 also)
  2. You need StreamServe Service Gateway up and running in your laptop (again we have tried version 5.6.1 of StreamServe but it should work with versions 5.5 and 5.6 also)
  3. You need Apache Tomcat up and running in your laptop (we have Tomcat 7 running with Java 8)
  4. Download the preconfigured version of the Apache Solr web application from http://goo.gl/MXhRnF
  5. Deploy the “solr.war” in the “webapps” folder of your Apache Tomcat
  6. Make sure “Apache Solr” is up and running accessing http://localhost:8080/solr/core/browse
  7. Download the “Collector2Solr” web application (this application just read documents from Collector and index them using Solr) from http://goo.gl/pbqwIx
  8. Deploy the “Collector2Solr.war” in the “webapps” folder of your Apache Tomcat. This will index all the documents until today (not including today). The original documents and metadata remains in the Collector database, only the new “text index” are created in Solr
  9. Search for something in “Solr” using http://localhost:8080/solr/core/browse (* is a wildcard)

 

Yes, it is that easy; but you need to remember that this is just a proof of concept: the “Solr” search web interface is ugly and the “Collector2Solr” needs to be fine tuned.

 

Hope this helps.

 

Comments (4)

  • Hi Hipolito,

    Nice feature, Is there a way to force the collector2solr to index?

    After I deployed both the war files and reboot the apache I do see activity in my gateway, yet I am unable to find any documents.

    regards,

     

    Tim Hageraats

    Wednesday 06 August, 2014 by Tim Hageraats
  • Have a look at the directory:

    <TOMCAT>/webapps/solr/WEB-INF/solr/core/data/index

    (these are the index used by solr); if you have indexed something you should have some files with names like "Lucence"

     

    Also have a look at the Solr log file, it is in <TOMCAT>/logs/solr.log

     

    You can also try to reindex Collector, just stop tomcat, delete the file <TOMCAT>/webapps/Collector2Solr/WEB-INF/TimeStamp.txt (this file containts the last date that was indexed), and restart tomcat.

     

    Best Regards.

     

    Wednesday 06 August, 2014 by Hipolito Jimenez
  • Hi Hipolito,

    The timestamp file contains  2014-06-11, quite strange as the system dates are correct.

    the index directory contains: write.lock , segments.gen and segments_1 no Lucense.

     

    The log file gives me nothing really :(

    The only noticable thing is this skipping the commit:

    org.apache.solr.update.DirectUpdateHandler2; start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}INFO - 2014-08-05 17:04:56.356; org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes. Skipping IW.commit.INFO - 2014-08-05 17:04:56.371; org.apache.solr.core.SolrCore; SolrIndexSearcher has not changed - not re-opening: org.apache.solr.search.SolrIndexSearcherINFO - 2014-08-05 17:04:56.371; org.apache.solr.update.DirectUpdateHandler2; end_commit_flushINFO - 2014-08-05 17:04:56.371; org.apache.solr.update.processor.LogUpdateProcessor; [core] webapp=/solr path=/update params={waitSearcher=true&commit=true&wt=javabin&version=2&softCommit=false} {commit=} 0 46INFO - 2014-08-05 17:05:43.375; org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/admin/cores params={indexInfo=false&_=1407251145791&wt=json} status=0 QTime=16 INFO - 2014-08-05 17:05:43.406;

    Thursday 07 August, 2014 by Tim Hageraats
  • It looks like there is something wrong when trying to index the first PDF.

    Are you archiving "Device Independient Copy" documents in Collector? (At this time its only works with PDF files)

     

    It could also be a problem with the JVM memory settings (just follow the recomended settings for StreamStudio).

     

    Have a look at the log files:

    - catalina.log

    - commons-daemon.log

    - localhost.log

    - tomcat7-stderr.log

    - tomcat7-stdout.log

     

    Best Regards.

     

    Thursday 07 August, 2014 by Hipolito Jimenez

   


Post comment