Wednesday, October 5, 2011

FS4SP Primary Indexer Redundancy

Topic: SharePoint 2010, FS4SP, Fast Search Server 2010
Subject:  Recovering from Disaster, Primary Indexer Redundancy, FS4SP Backup Strategy?
Problem:  The FS4SP server for my Primary Index Column has become corrupt and un-stable and my FS4SP has an un-recoverable error.  What is the fastest way to recovery?  What backup plan should I have put in place?
Response:
Most customers tend to focus on Search uptime and therefore setup backup columns for the primary index columns. If a FS4SP primary column server encounters an un-recoverable error the backup columns become responsible for serving queries but what about indexing? Content is changing and new content is being created daily.  We can’t just stop indexing and we certainly can’t index once we have lost a primary column.
There are three responses to this problem and all revolve around repairing or rebuilding the primary index column.
1.      Rebuild the Primary Column Server and perform a Full Re-crawl.
a.      This would be acceptable for a small corpus but if you are crawling millions of items this could take days or even weeks to recover.

2.      Rebuild the Primary Column Server and use the FS4SP backup and restore procedures from the last backup.
a.      This would probably take less time than option #1 but comes with a different set of issues.
                                                    i.     The Crawl databases from the FAST Content Connector SSA would also need to be backed up and restored.

                                                   ii.     If the Crawl databases were backed up after by FS4SP index or vice versa, the actual index could contain more or less items than the Crawl Databases.

                                                  iii.     If the Crawl databases and index can’t be backed up at idle time so they have the identical content the preference would be to have the Crawl databases backed up earlier.  In this case when the Content SSA is up and running an incremental would pick up items which already exist in the FS4SP index but those items would merely replace those that already exist.

3.      Prompt a backup index column to become a primary column.

a.      This would allow the quickest turn around to have the index being populated with content again.
As option #3 would be the quickest road to recovery I will use the Solution\Example section to simulate a disaster and recovery.
In this example I will use a 3 server FS4SP Farm: 1 Admin, 1 Primary Column, and 1 Backup Column and three File Share Folders for crawling content. I will also reset my index to make looking at the index a clean a possible.
Starting FS4SP Farm Deployment

<?xml version="1.0" encoding="utf-8" ?>
<deployment version="14">     
   <instanceid>FAST Search Server POC</instanceid>
   <connector-databaseconnectionstring></connector-databaseconnectionstring>
 
   <!—Admin Node -- >
  <host name="fast1.mydomain.local">
               <admin />
               <webanalyzer server="true" max-targets="2" link-processing="true" lookup-db="true" redundant-lookup="true" />   
  </host>
 
  <!—Primary Column -- >
  <host name="fast2.mydomain.local">
               <content-distributor id="0" />
               <searchengine row="0" column="0" />
               <indexing-dispatcher />
               <query />
  </host>
 
  <!—Back Column -- >
  <host name="fast3.mydomain.local">
               <document-processor processes="4"/>
               <content-distributor id="1" />
               <indexing-dispatcher />
               <searchengine row="1" column="0" />
               <query />
  </host>
 
  <searchcluster>
<row id="0" index="primary" search="true" />
<row id="1" index="secondary" search="true" />
  </searchcluster>

</deployment>

(3) Create 3 folders accessible to the FAST Connector Content SSA for Crawling.  My contain 100 items (mixed PDF, DOC, etc)
               \\myfileshare\Folder1
               \\myfileshare\Folder2
               \\myfileshare\Folder3



Solution\Example:

1.      Setup and crawl a File Share within the FAST Content SSA.

a.      In this example \\myfileshare\Folder1

2.      Ensure the Index is Idle and primary and backup are identical

a.      Open a Microsoft FAST Search Server 2010 Command Shell as Administrator

b.      Execute: indexerinfo –a status

<?xml version="1.0"?>
<indexer hostname="fast2.mydomain.local" port="13050" cluster="webcluster" column="0" row="0"
Running" exclusion_interval="60" time_since_last_index="0">
  <documents size="44758573.000000" total="101" indexed="101" not_indexed="0"/>
  <column_role state="Master" backups="1"/>
  <index_frequence min="0.000000" max="0.000000"/>
  <partition id="0" index_id="1315333741336367000" status="idle" type="dynamic"
    <documents active="101" total="101"/>
  </partition>
  <partition id="1" index_id="1315328545925813000" status="idle" type="dynamic"
    <documents active="0" total="0"/>
  </partition>
  <partition id="2" index_id="1315328542666333000" status="idle" type="dynamic"
    <documents active="0" total="0"/>
  </partition>
  <partition id="3" index_id="1315328539447778000" status="idle" type="dynamic"
    <documents active="0" total="0"/>
  </partition>
  <partition id="4" index_id="1315328535885114000" status="idle" type="dynamic"
    <documents active="0" total="0"/>
  </partition>
  <document_api number_of_elements="0" last_sequence="92764" frequence="0.000000"
    <queue_size current="0"/>
    <operations_processed api="0"/>
  </document_api>
</indexer>

<?xml version="1.0"?>
<indexer hostname="fast3.mydomain.local" port="13050" cluster="webcluster" column="0" row="1" factory_type="fi
nterval="60" time_since_last_index="0">
  <documents size="44758573.000000" total="101" indexed="101" not_indexed="0"/>
  <column_role state="Backup" backups="0"/>
  <index_frequence min="0.000000" max="0.000000"/>
  <partition id="0" index_id="1315333741336367000" status="idle" type="dynamic"
    <documents active="101" total="101"/>
  </partition>
  <partition id="1" index_id="1315328545925813000" status="idle" type="dynamic"
    <documents active="0" total="0"/>
  </partition>
  <partition id="2" index_id="1315328542666333000" status="idle" type="dynamic"
    <documents active="0" total="0"/>
  </partition>
  <partition id="3" index_id="1315328539447778000" status="idle" type="dynamic"
    <documents active="0" total="0"/>
  </partition>
  <partition id="4" index_id="1315328535885114000" status="idle" type="dynamic"
    <documents active="0" total="0"/>
  </partition>
  <document_api number_of_elements="0" last_sequence="92764" frequence="0.000000"
    <queue_size current="0"/>
    <operations_processed api="0"/>
  </document_api>
</indexer>

Site Note: I truncated the output a little for readability.

c.      The yellow highlight shows partition “0” contains the 101 items (100 items + 1 folder) in both the Primary and backup column.

d.      The blue highlight shows that all 101 items have been indexed.
                                                    i.     When actively crawling the “not_indexed” count will not be constantly zero.  As batches are processed, stored and the indexed at intervals.

e.      The green highlight shows that current status is idle for each partition within the Column Index.

f.       In the real world we will not get that lucky as our server crash will probably not occur while we are idle and if we lose a server we will not be able to get this information for the primary and backup columns.  Only the backup.

3.      Let’s lose the Primary Column. (i.e. simulate the Server has crashed and is un-recoverable)

a.      Open the FAST Command Shell as Administrator on the Primary Column Server
                                                    i.     In this Example: fast2.mydomain.local

b.      Execute: nctrl stop
                                                    i.     This will stop all FAST services

c.      Alternatively we could issue: “nctrl stop indexer” which will stop just the RTS Indexer but I want to simulate the server is gone.

d.      Setup a 2nd File Share Crawl in the FAST Content SSA

                                                    i.     Start a Full Crawl

e.      From the FAST Shell as Administrator
                                                    i.     Execute: psctrl status

1.      Note that no items are being submitted to the FS4SP pipeline. (this will only show results if there are Document Processors assigned to the server that are not stopped)

                                                   ii.     Execute: indexerinfo –a status

1.      Note: we have lost the Primary column and no counts have changed on the backup column.

2.      The Primary and Backup Columns do NOT work as a fail-over in which content will be indexed to the backup and replicated to the primary columns once a primary column has been re-established

3.      It would not be un-expected for the Backup Columns and Primary Columns to have different counts if the primary server crashed while active crawling was occurring.  To be able to recover our primary column without having to restore from a backup or re-crawl millions of items and to eliminate any long downtime is worth losing a few items that would need to be accounted for once we are back up and crawling.

                                                  iii.     Stop the Crawl from Step d.

Side Note: A reset of the OSearch14 service may be required if the crawl status does not go to idle.

4.      Prompt the Backup Column to become the Primary Column

a.      Stop all Crawls

b.      Stop the Web Analyzer and Relevancy Admin
                                                    i.     The WebAnalyzer runs on a schedule. To avoid any updates or processing causing changes to the index we will suspend the services from processing.

                                                   ii.     Logon to the FS4SP server which hosts the WebAnalyzer.

1.      In this example: fast1.mydomain.local

                                                  iii.     Open FAST Command Shell As Administrator

                                                  iv.     Execute: waadmin showstatus

1.      The Overall Status needs to be running before we can suspend it.

                                                   v.     If the Status is paused

a.      Execute: waadmin enqueueview

b.      Repeat Steps iv.

                                                  vi.     Execute: waadmin AbortProcessing

                                                vii.     Execute: spreladmin AbortProcessing


c.      Stop FS4SP Nodes
                                                    i.     Perform the Node shutdown on each FS4SP server
1.      In this example “fast1.mydomain.local” & “fast3.mydomain.local”


d.      Modify the Existing Farm Configuration
                                                    i.     On the FAST Admin Server

                                                   ii.     Use Windows Explorer to navigate to <FAST Install Driver>\FASTSearch\etc\config_data\deployment.

                                                  iii.     Edit the deployment.xml file
1.      After initial FS4SP configuration this becomes the master deployment file

2.      Comment out the current Primary Column.
a.      fast2.mydomain.local in this example

3.      Prompt the Backup Column by changing the row from “1” to “0”
a.      fast3.mydomain.local in this example

4.      Remove the Backup Row from the Farm by commenting out row=”1” from the search cluster.

                                                  iv.     The new deployment file will look as follows with changes highlighted.

<?xml version="1.0" encoding="utf-8" ?>
<deployment version="14">     
   <instanceid>FAST Search Server POC</instanceid>
   <connector-databaseconnectionstring></connector-databaseconnectionstring>
 
   <!—Admin Node -- >
  <host name="fast1.mydomain.local">
    <admin />
    <webanalyzer server="true" max-targets="2" link-processing="true" lookup-db="true" redundant-lookup="true" />   
  </host>
 
<!— Old Primary Column -- >
<!-- <host name="fast2.mydomain.local">
                   <content-distributor id="0" />
                   <searchengine row="0" column="0" />
                   <indexing-dispatcher />
                   <query />
 </host> -- >

  <!—New Primary Column -- >
  <host name="fast3.mydomain.local">
    <document-processor processes="4"/>
    <content-distributor id="1" />
    <indexing-dispatcher />
    <!--<searchengine row="1" column="0" /> -- >
    <searchengine row="0" column="0" />
    <query />
  </host>
 
  <searchcluster>
<row id="0" index="primary" search="true" />
<!-- <row id="1" index="secondary" search="true" /> -- >
  </searchcluster>
</deployment>

                                                   v.     Starting with the Admin Server

1.      In this example: “fast1.mydomain.local”

                                                  vi.     Open FAST Command Shell As Administrator

                                                vii.     Execute: Set-FASTSearchConfiguration

                                               viii.     Execute: nctrl start

                                                  ix.     Execute: nctrl status

1.      Wait for the Node to respond that all services are running.

                                                   x.     Repeat steps vi. – ix. for each non-admin node

1.      In this example only “fast3.mydomain.local”

2.      Remember “fast2.mydomain.local” crashed so we will not do anything with that server.

                                                  xi.     Execute: indexerinfo –a status    

1.      Note fast3 is column=”0” row=”0”, State=”standalone”, backup=”0”

<indexer hostname="fast3.mydomain.local" port="13050" cluster="webcluster" column="0" row="0"
Running" exclusion_interval="60" time_since_last_index="162">
  <documents size="88681405.000000" total="101" indexed="101" not_indexed="0"/>
  <column_role state="standalone" backups="0"/>

e.      Re-crawl the 2nd File Share

                                                    i.     In this example I will crawl \\myfileshare\Folder2


f.       Re-Execute: indexerinfo –a status
                                                    i.     Validate the counts total include the new crawl.
<indexer hostname="trickypoc3.trickydomain.local" port="13050" cluster="webcluster" column="0" row="0"
Running" exclusion_interval="60" time_since_last_index="162">
  <documents size="88681405.000000" total="202" indexed="202" not_indexed="0"/>
  <column_role state="standalone" backups="0"/>
 

5.      Re-configure a New FS4SP server to join the FS4SP Farm.

a.      In this example I will merely re-use fast2.mydomain.local

b.      Stop FS4SP Nodes
                                                    i.     Perform the Node shutdown on each FS4SP server
1.      In this example “fast1.mydomain.local” & “fast3.mydomain.local”

c.      Modify the Existing Farm Configuration to include the new Backup Column
                                                    i.     On the FAST Admin Server

                                                   ii.     Use Windows Explorer to navigate to <FAST Install Driver>\FASTSearch\etc\config_data\deployment.

                                                  iii.     Edit the deployment.xml file

1.      Un-Comment new FS4SP Server to be used as the backup Column
a.      fast2.mydomain.local in this example

2.      Change the row from “0” to “1”
a.      Fast2.mydomain.local in this example

3.      Re-implement the Backup row=”1” in the search cluster.

                                                  iv.     The new deployment file will look as follows with changes highlighted.

<?xml version="1.0" encoding="utf-8" ?>
<deployment version="14">     
   <instanceid>FAST Search Server POC</instanceid>
   <connector-databaseconnectionstring></connector-databaseconnectionstring>
 
   <!—Admin Node -- >
  <host name="fast1.mydomain.local">
    <admin />
    <webanalyzer server="true" max-targets="2" link-processing="true" lookup-db="true" redundant-lookup="true" />   
  </host>
 
<!—New Backup Column -- >
<host name="fast2.mydomain.local">
                   <content-distributor id="0" />
                   <searchengine row="1" column="0" />
                   <indexing-dispatcher />
                   <query />
 </host>

  <!—New Primary Column -- >
  <host name="fast3.mydomain.local">
    <document-processor processes="4"/>
    <content-distributor id="1" />
    <indexing-dispatcher />
    <!--<searchengine row="1" column="0" /> -- >
    <searchengine row="0" column="0" />
    <query />
  </host>
 
  <searchcluster>
<row id="0" index="primary" search="true" />
<row id="1" index="secondary" search="true" />
  </searchcluster>
</deployment>

                                                   v.     Copy the fixml from the new Primary Column (fast3) to the new Backup Column (fast2) with the following commands.  As I am using my existing server I will delete the fixml from “fast2” first.

robocopy /E /MT:100 /NFL /COPYALL /LOG:\incoming\robocopy_fixml.log  <<em>source_path</em>>\data\data_fixml <FASTSearchFolder>\data\data_fixml
copy <<em>source_path</em>>\data\ftStorage\processed_checkpoint.txt <FASTSearchFolder>\data\ftStorage

In this example:
robocopy /E /MT:100 /NFL /COPYALL \\fast3\c$\FASTSearch\data\data_fixml C:\FASTSearch\data\data_fixml

copy \\fast3\c$\fastsearch\data\ftStorage\processed_checkpoint.txt C:\FASTSearch\data\ftStorage

Side Note:  Why copy the FIXML if replication between a Primary Column and Backup Column will occur automatically when we implement the new backup column and resume indexing?  So on that note, I will not use the robocopy in this example.  When we implement the backup column and resume crawling FS4SP certainly appears to take care of the replication as we end up with the same number of items in both the Primary and Backup Column.  I highlighted in RED two numbers to look at in those results which are the Total Counts for each Column and not just the Counts for each partition.   When using a Primary and backup column strategy not only is the index kept in sync but also the FIXML.  As we crawled \\myfileshare\Folder2 before re-implementing our backup column we want to make sure our FIXML stays in sync.  If you choose not to perform the robocopy like I did browse around the FIXML folders on the 2 machines and you will quickly notice they do not appear to be identical.  I encourage trying it both ways.

                                                  vi.     Starting with the Admin Server
1.      In this example: “fast1.mydomain.local”

                                                vii.     Open FAST Command Shell As Administrator

                                               viii.     Execute: Set-FASTSearchConfiguration

                                                  ix.     Execute: nctrl start

                                                   x.     Execute: nctrl status

1.      Wait for the Node to respond that all services are running.

                                                  xi.     Repeat steps vii. – ix. for each non-admin node
1.      In this example “fast2.mydomain.local” and “fast3.mydomain.local”

2.      “fast2.mydomain.local” is our new backup server.

                                                xii.     Execute: indexerinfo –a status    

1.      Note fast3 is no longer column=”0” row=”0”, State=”standalone”, backup=”0” but state=”Master” backups=”1” while fast2 is column=”0” row=”1” state=”Backup” backups=”0”

d.      Resume WebAnalyzer and Relevancy Admin

                                                    i.     Open FAST Command Shell As Administrator on the FS4SP Admin node or the Server with the WebAnalyzer service enabled.

                                                   ii.     Execute: waadmin EnqueueView

                                                  iii.     Execute: spreladmin Enqueue

e.      Setup the 3rd File Share Crawl and Crawl

                                                    i.     In this example I will crawl \\myfileshare\Folder3

                                                   ii.     When the Crawl has completed

                                                  iii.     Open FAST Command Shell As Administrator on any FS4SP server

                                                  iv.     Execute: indexerinfo –a status

                                                   v.     Note the highlighted sections in the results below:

1.      fast3 has become the Primary Column

2.      fast2 has become the Backup Column

3.      both columns have the same number of expected documents

4.      As replicated does not occur immediately you may have to give the backup column time before it becomes in sync with the Primary Column.  Using the robocopy in the SideNote above will probably get you in sync faster if you have a large corpus of documents. It this example using a small corpus of 303 items it occurs relatively quick.

<indexer hostname="fast3.mydomain.local" port="13050" cluster="webcluster
Running" exclusion_interval="60" time_since_last_index="103166">
  <documents size="134200894.000000" total="303" indexed="303" not_indexed="0"/>
  <column_role state="Master" backups="1"/>
  <index_frequence min="8.538462" max="25.250000"/>
  <partition id="0" index_id="1315606415333091000" status="idle" type="dynamic" ti
    <documents active="303" total="303"/>
  </partition>
  <partition id="1" index_id="1315328545925813000" status="idle" type="dynamic" ti
    <documents active="0" total="0"/>
  </partition>
  <partition id="2" index_id="1315328542666333000" status="idle" type="dynamic" ti
    <documents active="0" total="0"/>
  </partition>
  <partition id="3" index_id="1315328539447778000" status="idle" type="dynamic" ti
    <documents active="0" total="0"/>
  </partition>
  <partition id="4" index_id="1315328535885114000" status="idle" type="dynamic" ti
    <documents active="0" total="0"/>
  </partition>
  <document_api number_of_elements="0" last_sequence="92865" frequence="0.000000"
    <queue_size current="0"/>
    <operations_processed api="101"/>
  </document_api>
</indexer>

<?xml version="1.0"?>
<indexer hostname="fast2.trickydomain.local" port="13050" cluster="webcluster
nterval="60" time_since_last_index="0">
  <documents size="90278062.000000" total="202" indexed="202" not_indexed="0"/>
  <column_role state="Backup" backups="0"/>
  <index_frequence min="0.000000" max="0.000000"/>
  <partition id="0" index_id="1315606415333091000" status="idle" type="dynamic" ti
    <documents active="303" total="303"/>
  </partition>
  <partition id="1" index_id="1315328545925813000" status="idle" type="dynamic" ti
    <documents active="0" total="0"/>
  </partition>
  <partition id="2" index_id="1315328542666333000" status="idle" type="dynamic" ti
    <documents active="0" total="0"/>
  </partition>
  <partition id="3" index_id="1315328539447778000" status="idle" type="dynamic" ti
    <documents active="0" total="0"/>
  </partition>
  <partition id="4" index_id="1315328535885114000" status="idle" type="dynamic" ti
    <documents active="0" total="0"/>
  </partition>
  <document_api number_of_elements="0" last_sequence="92865" frequence="0.000000"
    <queue_size current="0"/>
    <operations_processed api="0"/>
  </document_api>
</indexer>

Conclusion: Though all the services in a FS4SP farm, with the exception of the Admin service, can be implemented in a scaled/redundant fashion there is no way to implement indexer redundancy for your Primary Columns.  If you lose a Primary Column you will not be able to index.  It would be nice if the backup columns automatically worked as a fail-over for both query and indexing but they do not.  When planning a disaster recovery strategy you can choose a couple of options: 1) using the built-in fs4sp backup and restore commands or 2) backing up the fixml folder and rebuilding the index from that restore but if you are spending the cost to implement high-availability of search (i.e. primary columns have backups) it may be wise to think about putting a plan in place to prompt backup columns to primary columns for disaster recovery.  There could be several advantages: 1) the storage space has already been allocated for the backup columns saving the cost of additional storage for FS4SP backups, 2) you will probably get backup and indexing faster, and 3) you can immediately resume indexing by prompting the backup column and worry about building a new backup column later. (Standing up new hardware can take time especially if there is not hardware in sight to use)

KORITFW