Saturday, April 9, 2011

FS4SP PDF issues and Adobe iFilter

Subject:  Extending the FAST pipeline (FS4SP), user_converter_rules.xml, Advanced Filter Pack, Adobe iFilter 9  for 64-bit platforms.
Problem:  Why aren’t my PDF’s indexed correctly using the built-in Advance Filter Pack with FS4SP?
Response: I suppose I am more prone to seeing this issue more than others.  I the reason for this is that I have helped a number of people index content from Exchange into the FS4SP index.  It turns out that the built-in Filter Converters for FS4SP do not work correctly when indexing PDF attachments.  Though I see this with attachments to email messages, which most people will not run into, the problem also exists in “ZIP” files.   FS4SP treats files contained within Zip files as attachments.
Solution:
Extend the FS4SP pipeline and use Adobe’s PDF iFilter 9 for 64-bit platforms. It can be easily obtained for free at http://www.adobe.com/support/downloads/detail.jsp?ftpID=4025.  Upon realizing that the built-in converter which ships with FS4SP does not handle attachments correctly I decided to see what would happen if I switch to Adobe’s iFilter.  Sure enough extending the pipeline to avoid using the built-in converter turned out to fix the issue.  Not wanting to have a negative impact for customers by replacing built-in functionality, I tested the performance of the built-in converter versus Adobe’s version.  When testing against a 2500 corpus of PDF’s I found out that the performance was virtually identical.  With this in mind here are the steps to determine if you have the same issue.

1.      Place a Word Document and a PDF into a folder which is accessible to the SharePoint Crawler.
2.      Add for files to a compressed folder within the same folder
3.      Setup new Content Source FileShare crawl within the FAST Content SSA.
4.      Run a full crawl.
5.      Go to the default Search Center and search for results based on content which resides within the physical content of the PDF.
1.      Only 1 result for the PDF which resided as the stand-alone PDF will be found.
2.      The Zip file will not be returned as part of the Search results
6.      Repeat the same search for the Word Document.
1.      Both the stand-alone Word Document and the Zip file are returned in the Search as the Word Document is correctly indexed as an attachment contained within the Zip File.
7.      Download the Adobe PDF iFilter 9 for 64-bit platforms.
8.      Install the PDF IFilter on any FS4SP server which has Document Processors configured.
9.      Configure the FS4SP pipeline to use the new IFilter
a.      Navigation to the
<FAST Install Drive>\FASTSearch\etc\config_data\DocumentProcessor\formatdetector
b.      Edit: user_converter_rules.xml
c.      Add new entries for PDFs for both the “trust” and “MimeMapping” nodes.

<ConverterRules>
               <IFilter>
                              <trust>
                                             <!--     <ext name=".xxx" mimetype="application/xxx" /> -->
                                             <ext name=".pdf" mimetype="application/pdf"/>
                              </trust>
               </IFilter>

               <MimeMapping>
                              <!--
                              A mapping between mime types and the description of them.
                              <mime type="application/xxx">XXX Document</mime>
                              -->
                              <mime type="application/pdf">Adobe PDF</mime>
               </MimeMapping>
</ConverterRules>

d.      Reset the FAST Processor Servers (pipeline)
a.      Open FAST Command Shell as Administrator
b.      Issue the command: “psctrl reset”.  (As all processor servers are tied together so this command only needs to be run once after all processor servers are configured)

e.      Reset the FAST Index (Optional – A full crawl will replace the items within the index)
f.       Repeat steps #4, 5, 6

Summary:   Yes the built-in Converter for PDF’s with FS4SP is broken.  When repeating and performing step 12 you will find that the PDF’s have now been indexed correctly whether they are a stand-alone PDF or a PDF which is treated as an attachment to another file such as a MSG or Zip File.
KORITFW

2 comments:

  1. Will this work for PDF Portfolios? We have a client that has begun consolidating a number of files into PDF "Portfolio" documents.

    ReplyDelete
  2. Eric;
    Was this issue resolved with SP SP1?

    ReplyDelete