Thursday, April 7, 2011

File Format and FAST Search for SharePoint 2010

Subject: FastFormatDetector, FAST Search for SharePoint 2010  (FS4SP) pipeline, mime and format crawled properties.

Problem:  What is the best way to determine file format in the FAST pipeline?
Solution: There are many crawled properties which can be used to attempt to determine the file type of an item as it passes through the FAST pipeline but there is only one which is reliable. The crawled properties “mime” and “format” are set by the Processor stage “FastFormatDetector”.  The “format” is good but only comes in as a close second to “mime”.  Other crawled properties such as “FileExtension” are un-reliable.   
Question: Why is FileExtension which is set by the SharePoint Crawler un-reliable?
Answer: The SharePoint crawler doesn’t validate the crawl property. To show this I will use a specific example.
1.      Setup two Spy stages in the “pipelineconfig.xml”.  One before the “FastFormatDetector” stage and one directly after the Stage. 
a.      If you haven’t done this before see my blog on http://fs4sp.blogspot.com/2011/04/spy-stage-in-fast-search-for-sharepoint_05.html.   You will quickly understand why I wrote this blog first. It was intended for this post.

2.      Enable the “AdvancedFilterPack” on FAST
a.      Remember the “AdvancedFilterPack” must be enable to index the physical contents of a PDF

3.      Create a folder which is accessible to your SharePoint Crawler
a.      Place a PDF in the folder
b.      Rename the Extension “.doc”
c.      You will notice quickly that the Windows file system believes it is now a PDF file

4.      Create a new FileShare Content Source under the FAST Content SSA and Crawl the Folder

5.      Check your Crawl Log
a.      The crawl log will show no errors.

6.      Compare the two spy stage logs produces from the pipeline.
a.      Note that the “format” and “mime” type are NOT available in the 1st Spy log but are in the 2nd Spy log as they are set by the “FastFormatDetector”.
b.      Find the following 3 crawled properties from the 2nd Spy log.

#### ATTRIBUTE 0B63E343-9CCC-11D0-BCDB-00805FCCCE04:FileExtension:31 <type 'str'>: doc
#### ATTRIBUTE format <type 'str'>: Adobe PDF
#### ATTRIBUTE mime <type 'str'>: application/pdf 

7.      The “FastFormatDetector” has done its’ job and determine the correct file type.

8.      Open your Search Center and search based on the physical content of the PDF.
a.      Sure enough FAST has done its’ job though the Search Center still thinks the file is a Word Document in the CoreResults based off FileExtension.

Summary: The “FastFormatDetector” stage of pipeline does an excellent job of determining the file type of an item passing through the pipeline because it does not rely on subjective crawl properties and opens the item to determine the file type.  So if you are going to use a feature such as the “CustomerExtensibilty” stage of the pipeline and require the file type the only reliable crawled property to use is the crawled property “mime”.
If you are going to use the “mime” or “format” properties in the FAST pipeline customerextensibility stage here are the properties:
<CrawledProperty propertySet="7262a2f9-30d0-488f-a34a-126584180f74" varType="31" propertyName="mime"/>
<CrawledProperty propertySet="7262a2f9-30d0-488f-a34a-126584180f74" varType="31" propertyName="format"/>

No comments:

Post a Comment