Wednesday, May 4, 2011

Manipulating crawled properties in the FAST Search (FS4SP) pipeline

Topic: SharePoint 2010, FAST Search for SharePoint 2010 (FS4SP), processors.Basic Module, CustomerExtensibility, <load module="processors.Basic" class="AttributeCopy" />
Subject:  Manipulating crawled properties in the FS4SP pipeline.
Problem: I have a crawled property I want to manipulate.  What is the best way to do this?
Response: Any and all documentation will tell you that “CustomerExtensibility” is the only way to do this but what happens if the problem statement is: 
1.      The crawl property I need to manipulate is causing an issue in the pipeline before the “CustomerExtensibility” stage?

The answer is yes there are other methods to manipulate crawled properties and you may be faced with an issue where the “CustomerExtensibility” stage may not work.  The alternative methods are not documented therefore deemed “Not Supported” by Microsoft.  If you have other issues later on you will want to temporarily remove any “Non Supported” changes in your pipeline before calling support or they may not help even though the technique used is the exact modules MS already used in the pipeline.  
In the Solution\Example below I will use a real problem a customer was having and Microsoft didn’t have a solution.   I could have re-arranged the FS4SP pipeline to move customer extensibility up in the order of processing and fixed the issue but what would happen if I needed to use customer extensibility to perform crawled property manipulation further down in the pipeline based on properties which are set by other processor stages?  I probably could duplicate the “CustomerExtensibility” stage and have 2 within my pipeline but it seems like an awful lot of work to fix a simple issue. If you read my blogs on “Implementing the Windows 2008 TIFF IFilter and FAST Search for SharePoint 2010” ( http://fs4sp.blogspot.com/2011/04/implementing-windows-2008-tiff-ifilter.html) and “FS4SP and User_Converter_Rules.xml” (http://fs4sp.blogspot.com/2011/04/fs4sp-and-userconverterrulesxml.html ) you may run into this issue if you are using a custom crawler or protocol handler. The User_Converter_Rules.xml uses the crawl property “FileExtension” which is set by the SharePoint crawler based on the access url of the item being crawled. 

In this case the Custom crawler is indexing a records based system which contains attachments.  The attachment is a non-ocr’d tiff.   The “User_Converter_Rules.xml” has been modified to use the Windows 2008 TIFF IFilter which will OCR the tiff file before indexing.
The Custom crawler uses an access url launch point of a custom aspx page which displays the record and any associated attachments (in this case .tif).  In this case our FileExtension crawled property is set to “ASPX” therefore the User_Converter_Rules.xml will not match the true “TIFF” extension.
Solution/Example:
1.      The processors.basic module functionality is used in several stages of the FS4SP pipeline.  It obviously works so why not use it to resolve the issue.

2.      Edit the pipeline configuration
a.      In windows explorer navigate to <FAST Search Install Driver>\FASTSearch\etc

b.      Edit pipelineconfig.xml

c.      Search for nodes in which “processors.Basic” is the load module

Results:
<processor name="DocInit" type="general">
   <load module="processors.Basic" class="DocInit"/>
</processor>
<processor name="Sizer" type="general">
   <load module="processors.Basic" class="Sizer"/>
</processor>
….
<processor name="DocumentSecurityUnknown" type="general">
   <load module="processors.Basic" class="DefaultValue"/>
  
</processor>

3.       The “processors.Basic” module is used in the Processor Stages: “DocInit”, “Sizer”, and “DocumentSecurityUnknown” using the classes “DocInit”, “Size”,”DefaultValue” of the processors.Basic module.

4.      The “processors.Basic” module has 19 classes associated with it available.

5.      We will use the   <load module="processors.Basic" class="AttributeCopy" /> to solve this problem.

6.      Modify the FS4SP pipeline to implement a new processor stage
a.      In windows explorer navigate to <FAST Search Install Driver>\FASTSearch\etc

b.      Edit pipelineconfig.xml

c.      In the <processors> node add:

<processor name="FixExtension" type="general">
   <load module="processors.Basic" class="AttributeCopy"/>
   <input>
      <attribute name="&lt;Input&gt;"/>
   </input>
   <output>
      <attribute name="&lt;Output&gt;"/>
   </output>
  <config>
        <param name="Input" 
value="2014D5E9-5DCB-43D0-BCC8-090D134A29F2:MYFILEEXTENSION:31" type="str"/>
        <param name="Output"
value="0B63E343-9CCC-11D0-BCDB-00805FCCCE04:FileExtension:31" type="str"/>  
        <param name="Attributes" value="" type="str"/>
  </config>     
</processor>

d.      Where the “Input” value is the crawled property value used to replace the “Output” crawled property value.  In this case “MYFILEEXTENSION” is a custom crawled property associated with my crawler and “FileExtension” is the built-in crawl property populated by SharePoint and used by User_Converter_Rules.xml.

e.      Side Note:  The FS4SP pipeline can be very particular regarding case.  If you are not 100% sure of you crawled property attributes for the Input/Output parameters.  Run the crawl 1st with a Spy stage enable.  Copy the values directly from your Spy trace into the Input/Output values.


7.      Add the new Stage to the Pipeline
a.      Navigate to the <pipeline name="Office14 (webcluster)" default="1">

b.      Modify the <!—Document Conversion --> section of the pipeline to look like:
      <!-- Document Conversion -->
      <processor name="FixExtension"/>
      <processor name="AttachmentsHandler"/>
      <processor name="UTFDetectorConverter"/>
      <processor name="FastFormatDetector"/>
      <processor name="FormatDetector"/>
      <processor name="XMLMapper"/>
      <processor name="SimpleConverter"/>
      <processor name="PDFConverter"/>
      <processor name="IFilterConverter"/>
      <processor name="SearchExportConverter"/> 

c.      Optional: Add to Spy Stages around the new processor stage.
                                                    i.     I added to show results.
                                                   ii.     If you are un-familiar with using the Spy stage for Debugging see my Blog on “Spy Stage in the FAST Search (FS4SP) pipeline” (http://fs4sp.blogspot.com/2011/04/spy-stage-in-fast-search-for-sharepoint_05.html )

     <processor name="Spy1"/>
     <processor name="FixExtension"/>
     <processor name="Spy2”/>
     <processor name="AttachmentsHandler"/>

d.      Save the changes and Reset the pipeline
                                                    i.     From the FAST Command Shell as Administrator issue:
1.      “psctrl reset”

e.      Crawled the Content Source

f.       Results from Spy1 which occurs before the new “FixExtension” stage.  The “FileExtension” crawled property got set to the value “ASPX” based on the url property which means the User_Convert_Rules.xml will not fire the customer TIFF IFilter Converter.


#### ATTRIBUTE url <type 'str'>: http://myappserver/Record.aspx?ID=1
#### ATTRIBUTE 0B63E343-9CCC-11D0-BCDB-00805FCCCE04:FileExtension:31 <type 'str'>: ASPX
#### ATTRIBUTE 2014D5E9-5DCB-43D0-BCC8-090D134A29F2:MYFILEEXTENSION:31 <type 'str'>: tif

g.      Results from Spy2 which occurs after the “FixExtension. Note the “FileExtension” crawled property now contains the value we need to the “User_Converter_Rules.xml” to identify the tif file.
#### ATTRIBUTE url <type 'str'>: http://myappserver/Record.aspx?ID=1
#### ATTRIBUTE 0B63E343-9CCC-11D0-BCDB-00805FCCCE04:FileExtension:31 <type 'str'>: tif
#### ATTRIBUTE 2014D5E9-5DCB-43D0-BCC8-090D134A29F2:MYFILEEXTENSION:31 <type 'str'>: tif

Conclusion: Understanding the pipeline OOB processor stages and how they works can be very beneficial and solving problems. 

 
KORITFW

2 comments:

  1. Good post! And I agree that using the native python stages can be useful, but doing it in an "unsupported" way increases your solution complexity and manageability.

    For your current example you could have had URL point to the attachment in order kick off the ifilter, and then used a custom managed property for you launch url in the UI.

    You would put more logic in the UI, but in a supported manner, and also easier to manage.

    When fixes and service packs arrive for fs4sp you have to make sure that your edit to the pipeline is not overwritten.

    I totally agree that more flexibility in the pipeline in a supported manner would be nice, and it will always be the question of calculating the cost for the final implementaion.

    ReplyDelete
  2. Nice example! I am looking to change crawling behavior of SharePoint search to meet a business requirement. As you might be aware SharePoint 2010 only crawl latest approved version of a document and only this version visible to users in search. Neither it crawl any historic version of a document nor they visible in search. I have a requirement for business where they wants SharePoint search to perform on all versions of a document. I was trying to figure out if I can change crawl behavior to enable version search but so far found no solution.
    Although I have found a product which offers version search for SharePoint 2010 and FAST and there demo at link http://stoictech.sharepoint.com/Pages/VersionSearch.aspx is also impressive.
    Is there any better solution or product available to meet this requirement!

    Thanks,
    Vikas

    ReplyDelete