Saturday, April 16, 2011

FAST Search for SharePoint 2010 (FS4SP) and Cleansing of Insignificant Data

Topic: SharePoint 2010, FS4SP and Warnings regarding insignificant data
Subject:  SharePoint 2010, FS4SP, FastHtmlParser
Problem: Why do I get the following crawl warnings?  What does it mean? Did my content get indexed?
The FAST Search backend reported warnings when processing the item. ( Document 'file://server/c$/Large/1.pdf' has been truncated since size 2240743 is larger than threshold 2097152. Size is now 2097135 )
The FAST Search backend reported warnings when processing the item. ( Document 'file://server/c$/Large/1295017221.zip' has been cleansed of insignificant numbers since size 3015965 is larger than threshold 1048576. Size is now 3015233;Document 'file://server/c$/Large/1295017221.zip' has been cleansed of repeated text parts since size 3015233 is larger than threshold 2097152. Size is now 2970235;Document 'file://server/c$/Large/1295017221.zip' has been truncated since size 2970235 is larger than threshold 2097152. Size is now 2097147 )

Response: These warnings are being generated from the “FastHtmlParser”. FS4SP has some size limitations by default.  They can be modified in the pipelineconfig.xml file.  I will try not to get too far into detail (as it is complex) but I will try to give some examples to show how it works and what these messages mean.
There is a cleansing threshold limit which is determined by (max-index-size / fixml expansion factor) * threshold value. By default:
Max-Index-Size = 16mb
fixml expansion factor = 8
threshold value = 1
Threshold limit = (16/8)*1 = 2Mb. 
This means FAST wants the Html Body (html is produced from the IFilterConverter stages - see my blog on IFilter Converter http://fs4sp.blogspot.com/2011/04/debugging-ifilterconverter-ifilter2html.html) of items crawled to fit into 2MB.  The expansion factor is used so that the final fixml has room for additional data which may be included in the fixml through features such as Lemmatization.
If a document exceeds the Cleansing Threshold Limit the “FastHtmlParser” will attempt to cleanse the item content to make it fit within the threshold limit.
1.      First the “FastHtmlParser” will remove text chunks with numbers. (Searching for numbers isn’t very successful for a search application due to the high matching results set anyway)

2.      If the body is still too large the “FastHtmlParser” will attempt to cleanse repeated text.

3.      Finally if the Body is still beyond the threshold limit the “FastHtmlParser” will truncate the content beyond the threshold limit.

Examples:
1.      Place 2 PDF’ in a folder. 
a.      Use one which is roughly 2.1 – 2.5MB
b.      Use a 8mb+ PDF for the 2nd file

2.      Setup a File Share crawl of the folder within your FAST Content SSA

3.      The smaller PDF’s raw content would exceed the threshold limit but would more than likely be cleansed enough to fit within the threshold limit.  In this example you would get a warning that “Document X has been cleansed of insignificant numbers”.

4.      The larger PDF will most probably be truncated depending on the Content of the PDF but the “FastHtmlParser” in theory could cleanse enough data to fit the threshold limit.  The warning will probably include a combination of “Document X has been cleansed of insignificant numbers” and “Document X has been cleansed of repeated text parts”.

5.      If you don’t get the warnings you are looking for to test try different documents.

6.      Open the smaller PDF
a.      Find a unique search term which occurs at the beginning of the PDF.
b.      Find a unique search term which occurs at the end of the PDF.
c.      Open a search center and search on both Terms

7.      If the smaller PDF has not been truncated both searches performed in Step 6 will find the PDF.

8.      Open the larger PDF and perform Step 6 again.

9.      Because the larger PDF has been truncated you will only find the PDF from the 1st search Term anything beyond the threshold limit will be truncated and therefore not searchable.

Conclusion: Because FS4SP has configuration settings which limit the content size, you may see these messages frequently on some documents.  Depending on the Content Sources being crawled you may never run into the warning.  A repository of office documents containing 10-250K documents will not produce such errors.  The majority of the time I see these warning is on PDF documents which tend to be larger.  Let’s hope the significant search terms do not occur at the end of the larger documents or it is time to start changing the default configuration and testing performance of the new settings.

KORITFW

2 comments:

  1. Exactly what I was looking for, thanks! We have a custom solution to crawl a BCS component and are running into these messages. I'm glad I landed on your article.

    ReplyDelete
  2. Hi Eric,
    thanks for explaining the error messages and warnings. Have you tested larger thresholds yet?

    ReplyDelete