Monday, April 18, 2011

Debugging IFilterConverter – IFilter2Html

Topic: SharePoint 2010, FS4SP, FastHtmlParser, IFilter2Html, pdftotext
Subject:  SharePoint 2010, FS4SP, Debugging IFilters
Problem: I have some items which I submit to the FS4SP pipeline which fail to convert.  How do I debug?
Response: If you read my blog on “FAST Search for SharePoint 2010 (FS4SP) and Cleansing of Insignificant Data” (http://fs4sp.blogspot.com/2011/04/fast-search-for-sharepoint-2010-fs4sp.html) I talk a lot about the HTML body being too large.  This will explain what I mean about HTML Body.  If you have a document which errors upon processing through the Document processor pipeline because it fails to convert you can process the document step-by-step to try to determine why it cannot be converted.

Solution\Example:
For this solution example I have placed a PDF document in a folder C:\Large\1.pdf. Please replace with your source file appropriately.
1.      Place a document which fails to convert in a folder on a FAST server

2.      Open the FAST Command shell as Administrator

3.      Execute: “docpush –c <collectionname - typically “sp”> -l verbose “C:\Large\1.pdf”
a.      If you crawled this document and had it failed to convert it should also fail on this step.

4.      Execute: “doclog –a”
a.      This will produce crawled properties associated with the document but will typically not give much information on why the document failed to convert.

5.      Execute: “IFilter2Html “C:\Large\1.pdf” > 1.html
a.      This will perform the exact step which the “IFilterConverter” stage of the pipeline is attempting to accomplish.
b.      It is attempting to turn the PDF into a readable txt file in the form of HTML.

6.      Open the file “1.html” either in Notepad (preferred) or IExplorer.
a.      The results of the output are what the IFilterConverters is producing
b.      It will look like to an HTML document.
                                                    i.     <html><head><body></body></head></html>

7.      Examine the html output <body> tag.
a.      Look for issues with the document

8.      If you are coming to this blog from the blog on “Insignificant Data” note the structure of the output from the “IFilterConverters” stage.  The body can be very large depending on the document.  If the <body></body> exceeds the Threshold limit it will be cleansed later in the pipeline.

9.      Another, less useful, command for PDF’s is:
a.      Execute: “pdftotext “C:\Large\1.pdf” pdf.txt
b.      This will convert a PDF to text.
                                                    i.     This could lead you to the underlying problem but “IFilter2Html” is a much better source to peruse.
Conclusion:  Breaking down the steps performed in the pipeline may lead to a better understanding of what the “IFilterConverters” are actually doing. The converters are turning different types of documents into a text readable html documents.  The output will pass through other pipeline stages, such as the “FastHtmlProcessor” and data might be cleansed, but the output will eventually end up in the index and fixml files.
Side Note:  I have seen many installations go bad where permissions are the issue when the built-in IFilter fails to perform.  If this is the case, grant full control to the <FAST service user> on the <FASTSearch Install Directory>.  Not sure why it happens because if hasn’t happened to me, but I have definitely seen it and the solution is simply applying the permissions for the <FAST Service User> to the <FAST Search Install Directory> which should have been granted on the initial install.

KORITFW

No comments:

Post a Comment