Saturday, April 30, 2011

Fast Search for SharePoint 2010 (FS4SP) and AdvancedFilterPack file types

Topic: SharePoint 2010, FAST Search for SharePoint 2010 (FS4SP), AdvancedFilterPack, SearchExportConverter
Subject:  File types included in the Advanced Filter Pack
Problem: I can’t seem to find any documentation on what file types are included OOB for IFilters or what file types are included in the AdvancedFilterPack?
Response: This is a great question that I get all the time.  I see a lot of blogs regarding PDFs not indexing correctly and the advice of enabled the AdvancedFilterPack. I have blogged regarding problems with the built-in PDF Converter ( http://fs4sp.blogspot.com/2011/04/why-arent-my-pdfs-indexed-correctly-and.html ) but you may be surprised to learn that PDFs are enabled without the advanced filter pack.  The majority of issues I see with PDFs or other OOB types are permission related.  When using the FS4SP configuration wizard always remember to “Run as Administrator”.  The configuration wizard (which can be run manually and provides a whole lot more information) will propagate permissions through the FASTSearch installation directory.  If the permissions are not set correctly the IFilter conversion process may fail.  If you use the FAST Shell command “docpush” to test the installation never use a text based file such as .txt or .html as these files do not invoke the IFilterConverter.
DocPush Example:  “docpush –c <Collection Name – “sp” is default> “C:\TEST\1.pdf” –l verbose

So let’s take a look at how to determine what is included OOB and what is included with the AdvancedFilterPack.

Solution\Example:
1.      Here is the list of OOB file types
a.      .doc,.docm,.docx,.dot,.dotx,.eml,.html,.mht,.mhtml,.msg,.nws,.odp,.ods,.odt,.one,.pdf,.pot,.pps,.ppt,.pptm,.pptx,.pub,.rtf,.txt,.vdw,.vsd,.vss,.vst,.vsx,.vtx,.xlb,.xlc,.xls,.xlsb,.xlsm,.xlsx,.xlt,.xlm,.xps,.zip
b.      Pretty easy … I stole if right off TechNet. http://technet.microsoft.com/nb-no/library/gg471168(en-us).aspx
c.      It doesn’t make any reference to advanced filter pack file types and I wouldn’t blog about it if the answer was providing a link to technet.

2.      Let’s ignore Step 1 and start again. How can I find what is OOB and what is in AdvancedFilterPack?

3.      Most people simply enable the AdvancedFilterPack as soon as they know about it and it makes sense in most situations. If the corpus of documents that is going to be indexed all fall into the list provided in #1 there is no need to enable the advanced filter pack and add additional overhead to the FAST pipeline.

4.      Enable the Advance Filter Pack

a.      Open the FAST Command Shell as Administrator on the FAST Admin Node
                                                    i.     Navigate to the <FAST Install Drive>\FASTSearch\installer\scripts directory.
                                                   ii.     Issue: “AdvancedFilterPack –enabled”

5.      There are two ways to verify the Advanced Filter Pack is enabled.
a.      1st - Re-issue the “AdvancedFilterPack –enabled” command.
                                                    i.     If it is already enabled it will tell you.

b.      2nd – The optional processor “SearchExportConverter” will be toggled to active=”yes”
                                                    i.     Use Windows Explorer to navigate to:
1.      <FAST Install Driver>\FASTSearch\etc\config_data\DocumentProcessor
                                                   ii.     Open the optionalprocessing.xml file in Internet Explorer, WordPad, or notepad.
                                                  iii.     Locate the processor name ”SearchExportConverter”

When Enabled:
<optionalprocessing>
               <processor name="SearchExportConverter" active="yes" />
</optionalprocessing>

When Disabled:
<optionalprocessing>
               <processor name="SearchExportConverter" active="no" />
</optionalprocessing>

6.      The FAST Pipeline processor stage “SearchExportConverter” is the stage which is invoked by the AdvancedFilterPack.

7.      The SearchExportConverter process uses two configuration files:
a.      user_convert_rules.xml which is used extend the pipeline to use additional IFilters (http://fs4sp.blogspot.com/2011/04/fs4sp-and-userconverterrulesxml.html)
b.      converter_rules.xml.  This file holds the information as to what is included in the OOB file types and the Advanced Filter Pack file types

8.      Inspect the converter_rules.xml
a.      Use Windows Explorer to navigation to <FAST Install Drive>\FASTSearch\etc\formatdetector
b.      Open the convert_rules.xml using Internet Explorer, WordPad, or Notepad
c.      Inspect the <filetypes> node under the <OutsideIn> node
d.      Any file type where process=”true”  is a file type which is covered under the Advanced Filter Pack
e.      Any file type where process=”false” is a file type which is covered OOB.

9.      Let’s find out if Autodesk file types and PDFs are covered.  The extension for AutoCad is .dwg

10.   Search for “dwg” or “AutoCAD”
a.      Side Note: You may have to look at the comments as not all file types have an associated mime type.

11.   Search for “pdf” or “Adobe”

      <ConverterRules>
   <OutsideIn>
    <filetypes>
          ....
     <file id="1557" process="false" mimetype="application/pdf"/> <!-- Adobe Acrobat (PDF) -->
         ....     
   <file id="1552" process="true" mimetype="image/vnd.dwg"/> <!-- AutoCAD Drawing 12-->    
   <file id="1553" process="true" mimetype="image/vnd.dwg"/> <!-- AutoCAD Drawing 13 -->
       ….
  </filetypes>
     ....
  </OutsideIn>
        ....
</ConverterRules>

12.   Note that process=”true” for PDF and process=”false” for dwf.

13.   Inspect the <ignore> node under the <OutsideIn> node. If it looks familiar is it because it is the exact list as displayed from TechNet in #1.  It also happens to closely coincide with the <filetypes> node where process=”false”

        <ignore>
            <!--
            A list of extensions that OutsideIn should ignore. NOTE, the extension
            here is the format detected (or bypassed) extension of a document and may not
            necessarily correspond to the URL extension.
            -->
            <ext>.doc</ext>
            <ext>.docm</ext>
            <ext>.docx</ext>
            <ext>.dotx</ext>
            <ext>.dot</ext>
            <ext>.eml</ext>
            <ext>.html</ext>
            <ext>.mht</ext>
            <ext>.msg</ext>
            <ext>.nws</ext>
            <ext>.odp</ext>
            <ext>.ods</ext>
            <ext>.odt</ext>
            <ext>.one</ext>
            <ext>.pdf</ext>
            <ext>.pot</ext>
            <ext>.pps</ext>
            <ext>.ppt</ext>
            <ext>.pptm</ext>
            <ext>.pptx</ext>
            <ext>.pub</ext>
            <ext>.rtf</ext>
            <ext>.txt</ext>
            <ext>.vdx</ext>
            <ext>.vsd</ext>
            <ext>.vss</ext>
            <ext>.vst</ext>
            <ext>.vtx</ext>
            <ext>.xlb</ext>
            <ext>.xlc</ext>
            <ext>.xls</ext>
            <ext>.xlsb</ext>
            <ext>.xlsm</ext>
            <ext>.xlsx</ext>
            <ext>.xlt</ext>
            <ext>.xml</ext>
            <ext>.xps</ext>
            <ext>.zip</ext>
        </ignore>

Conclusion: Understanding the Processor stages and how they work within the pipeline and a little research can divulge a lot of information.   If you are wondering if a file type is covered by the AdvancedFilterPack the converter_rules.xml is the place to start. There are 456 different file types covered  (in some flavor i.e. Wordstar 5.0, 4.0, & 2000 or as in this example 2 versions for AutoCAD)  by the advanced filter pack. (Way too many to list in the blog.  Though this provides a list of file types covered by the Advanced Filter Pack the best way to know if a file type is covered is to try to index it with the Advanced Filter Pack enabled.

KORITFW

1 comment:

  1. Thanks for the excellent post. In Step 12, it appears to me the correct statement would be:

    Note that process="false" for PDF and process="true" for dwg.

    ReplyDelete