Wednesday, May 25, 2011

FS4SP Full and Incremental Crawls and deleted items

Topic: SharePoint 2010 and FS4SP Enterprise Search and the SharePoint Crawler
Subject: SharePoint 2010 and FS4SP and deleting of items from the index during Full and Incremental crawls.
Problem: I have switch from using the SharePoint 2010 index to the FS4SP as my index.  Delete items appear to take a different path when they are removed from a content source during full and Incremental crawls.  Is this an issue?
This question statement could as easily be:
When indexed items are removed from a Content Source the access URL does not appear to be deleted from the Crawl Database after a Full or Incremental Crawl.
Response:
I hear a lot of questions regarding FAST and deleting items not working.  The SharePoint crawler does appear to have slightly different end result when items are removed from the index through Full or Incremental crawls.   It would appear that against FS4SP there is an issue but the real question is whether it has any deep negative effects.  In the solution\example section I will show a high level of how the items move around the crawl database and what to expect.
 For easiest comparison I will perform the steps against both a SharePoint Search Service Application and a FAST Content Search Service Application at the same time.  You can choose to perform all the steps individually.
Solution\Example:
1.      Choose  or setup 2 Search Service Applications
a.      1 SharePoint Search Services Application (example: SSA)
b.      1 FAST Search Service Application (example: FSA)

2.      Perform an Index Reset on both SSAs

3.      Determine Crawl Databases association with each SSA.  The crawl Database can be obtained from the Search Administrator Topology for each SSA.

a.      Example: SSA – CrawlDatabase: SSACrawlDB
b.      Example: FSA – CrawlDatabase: FSACrawlDB


4.      Create a Folder accessible to the Search Service Application(s)
a.      Place 5 documents within the folder
b.      Example
                                                    i.     1.doc
                                                   ii.     2.doc
                                                  iii.     3.doc
                                                  iv.     4.doc
                                                   v.     5.doc

5.      In both the Search Service Applications
a.      Setup a new FileShare Content source Name “Test Delete”

6.      Execute a Full Crawl of the “Test Delete” content source for both Search Service Applications
a.      Execute the Crawl Log Report Jobs
                                                    i.     From CA -> Monitoring -> Review Job Definitions
1.      Find the timer job “Crawl Log Report for <your SSA>”
2.      Double Click to Edit and Click “Run Now”
                                                   ii.     Repeat for both Search Service Applications
**SideNote: The crawl log reports are now on Timer Jobs in SharePoint 2010.  Execute Manually will ensure the “MSSCrawlUrlReport” table has been updated.
7.      Open SQL Server Management Studio
a.      Open SQL Query and Execute the Follow SQL:
SELECT * FROM [<Your SSA Crawl DB>].dbo.MSSCrawlURL
SELECT * FROM [<Your SSA Crawl DB>].dbo.MSSCrawlUrlReport
SELECT * FROM [<Your SSA Crawl DB>].dbo.MSSCrawlDeletedURL
SELECT * FROM [<Your FSA Crawl DB>].dbo.MSSCrawlURL
SELECT * FROM [<Your FSA Crawl DB>].dbo.MSSCrawlUrlReport
SELECT * FROM [<Your FSA Crawl DB>].dbo.MSSCrawlDeletedURL
b.      You will get a record for each Crawled Item in the “MSSCrawlURL” and “MSSCrawlUrlReport” for both SharePoint and FAST in addition to folders.

                                                    i.     In SharePoint you will also get additional records for anchor Points records. We will ignore the anchor points
                                                   ii.     In SharePoint you will also get Deleted records in the “MSSCrawlDeleteURL” table for any anchor point.  We will ignore these records.
.
8.      Test the Index.
a.      Open a Search Center for SharePoint and perform a search for “doc”
                                                    i.     You should get back all 5 documents
b.      Open a Search Center for FAST and perform a search for “doc”
                                                    i.     You should get back all 5 documents

9.      Delete the document “5.doc” from the File Share Folder

10.   Execute an Incremental Crawl and Execute the “Crawl Log Reports” for both Search Service Applications.

11.   Repeat Step #8
a.      Both indexes should be cleaned of “5.doc”
b.      In the case of SharePoint we could have query the Property database but with such a small corpus it is just as easy to use a search center

12.   Optional
a.      Change the SQL to include a where clause to look at less data and focus on the single item.

SELECT * FROM [<Your SSA Crawl DB>].dbo.MSSCrawlURL
WHERE DisplayUrl like ‘%5.doc’
SELECT * FROM [<Your SSA Crawl DB>].dbo.MSSCrawlUrlReport
WHERE DisplayUrl like ‘%5.doc’
SELECT * FROM [<Your SSA Crawl DB>].dbo.MSSCrawlDeletedURL
WHERE DisplayUrl like ‘%5.doc’

SELECT * FROM [<Your FSA Crawl DB>].dbo.MSSCrawlURL
WHERE DisplayUrl like ‘%5.doc’
SELECT * FROM [<Your FSA Crawl DB>].dbo.MSSCrawlUrlReport
WHERE DisplayUrl like ‘%5.doc’
SELECT * FROM [<Your FSA Crawl DB>].dbo.MSSCrawlDeletedURL
WHERE DisplayUrl like ‘%5.doc’

13.   Re-Query the SharePoint FAST Crawl Databases
a.      The “MSSCrawlDeletedUrl” table should now have a record reflecting “5.doc”
b.      The “MSSCrawlURL” table should have been updated to reflect:
                                                    i.     Column: “DeletePending=2”
                                                   ii.     Column: “ContentSourceId=-1”
c.      The “MSSCrawlUrlReport” table should have been updated to reflect:
                                                    i.     Column: “IsDeleted=1”
Everything seems to follow the same path at this point.  No surprises.

14.   So if the Item has been deleted from the Index why do we still have an MSSCrawlUrl record with “DeletePending = 2”.
a.      It turns out that the “MSSCrawlUrl” record does get removed until the 2nd crawl after the delete. Probably why “DeletePending = 2”.

15.   Re-run the incremental and Crawl Reports

16.   Check the SQL and you will find no change

17.   Re-run the incremental and Crawl Reports for a third time

18.   Check the SQL and you will see where the divergence occurs
a.      After the 2nd incremental crawl occurs after the delete the “MSSCrawlUrl” record gets removed within the SharePoint Crawl Database but the record remains in the FAST Crawl database.

19.   Is it an issue?
a.      Outside of the fact that the Crawl database over time will become artificially inflated in regards to storage requirements compared to a SharePoint crawl database, the only way I can see it as an issue is if the delete access url record gets re-submit upon each Incremental. Let check.

20.   Add the follow SQL Query
   SELECT * FROM [<Your FSA Crawl DB>].dbo.MSSCrawlQueue
SELECT * FROM [<Your FSA Crawl DB>].dbo.MSSCrawlURL
WHERE DisplayUrl like ‘%5.doc’

21.   There will be a little timing upon executing this test.
a.      Kick-off an Incremental Crawl against the FAST Content Source
b.      Repeatedly execute the Query until you find the “MSSCrawlQueue” table Populated
c.      The “MSSCrawlQueue” table is the queue for those items that are going to be crawled.

You won’t find it getting queue up.   I already knew this as the “DeletePending=2” and “ContentSourceID=-1” will prevent it from being re-crawled.
22.   I have tested using/against:
a.      Full Crawls instead of incremental which should act the same
b.      No CUs installed
c.      APR 2011 CUs – Same results


Conclusion:  A far as bugs go this isn’t much of one to write home about.  On extremely large Farms 200-300 M with a lot of deleting content such as from Exchange in theory I could experience some sluggishness and an inflated storage requirement.  This could be easily rectified with a little testing and a SQL statement with “ContentSourceId=-1” and “DeletePending=2”.
For now we will leave it for what it is … an interesting lesson on how items move through the Crawl Database with an eye on extra storage requirements.
If anyone sees any different behavior I would be interested and hearing about it.

KORITFW

2 comments:

  1. Seems like a small bug, but it should be fixed. Did you report this to MS/FAST?

    There is also a sp1 scheduled for FS4SP sometime in the future, which will be interesting to read the errata on.

    ReplyDelete
  2. Hello Eric,

    I have checked with MSFT and the confirm the following:

    This is not a bug, but an optimization. Per our discussion with the SQL team, they recommended that we don’t delete from our tables but instead re-use rows from the table. This is not completely deterministic, but in general we try to reuse rows. Obviously, there’s no fix coming because there is no customer impact.

    Regards,

    Fadi Aqqad

    ReplyDelete