星期四, 九月 06, 2007

10 Ways to Optimize SQL Server Full-text Indexing

10 Ways to Optimize SQL Server Full-text Indexing
Hilary Cotter

In this article, Hilary Cotter shows you techniques you can use to make SQL Server's full-text indexing, or FTI, work better and faster. Also see his previous article on optimizing full-text searching in the January 2004 issue.

Indexing performance has always been the Achilles' heel of SQL Server's full-text feature. Although the search performance itself is fast (very few customers complain about the search speed), users do complain (often loudly and frequently) about how slow the indexing process is–especially for large tables.

At the moment, I'm using a 2GHz laptop with a gig of RAM, and I can hit 10,000 rows per minute (rpm) with my sample database. Ten thousand rpm means it would take me about an hour and 40 minutes to index a million-row table, or a day to index a 15-million-row table. SQL Server's indexing is slow primarily because of its asynchronous data extraction scheme. (The fact that competitors' products are also slow at indexing offers little solace to the harried DBA.) Rather than plunge into arcane details about SQL FTI internals (some of which are being revised for SQL 2005 anyway), I'll focus on what you can do to improve FTI today.

1. Use change tracking with the "update index in background" option


The easiest way to improve the performance of full-text indexing is to use change tracking with the "update index in background" option.
Here's why. When you index a table (FTI, like "standard" SQL indexes, works on a per-table basis), you specify full population, incremental population, or change tracking. When you opt for full population, every row in the table you're full-text indexing is extracted and indexed. This is a two-step process. First, you (or Enterprise Manager) run this system stored procedure:

sp_fulltext_getdata CatalogID, object_id

Note: You can get the object_id for your table by issuing SELECT object_id('TableName'), where TableName is the name of your table.

This will return a results set listing each primary key (or unique identifier) value for the table you're full-text indexing. So if you have a 50-million-row table, you get a result set of 50 million rows with your primary key converted to varbinary(450). (If you have a timestamp column on the table you're full-text indexing, the timestamp column will be returned as well.)

This stored procedure also returns the latest timestamp for the database, which MSSearch (the Microsoft Search Services engine) will store internally as a reference for doing incremental populations.

For instance, here's what MSSearch will return for the following statement:

sp_fulltext_getdata 10, 517576882

0000000000000490
------------------ -----------------------------------
0x000000000000048B 0x3131312D31312D31313131
0x000000000000044F 0x3131312D31312D31313132
0x0000000000000450 0x3131312D31312D31313133

The first output line, 0000000000000490, is the timestamp for the database. 0x000000000000048B is the timestamp column for the row I'm indexing, and 0x3131312D31312D31313131 is the PK value for the first row in the table I'm indexing.

After all the results sets of all of the timestamps and PK values are returned to MSSearch, MSSearch will issue another sp_fulltext_getdata, but this time, once for every row in your table. It will look like this:

sp_fulltext_getdata 10, 517576882,
0x3131312D31312D31313131,1

So if you have 50 million rows in your database, this procedure will be issued 50 million times. Think about that. On the other hand, if you use an incremental population, MSSearch will issue an initial:

sp_fulltext_getdata 10, 517576882, 0000000000000490

(where 0000000000000490 is the timestamp from the last call) and then issue:

sp_fulltext_getdata 10, 517576882,
0x3131312D31312D31313131,1

for each row in the table that you're full-text indexing. Again, 0x3131312D31312D31313131 is your primary key value for the first row. So if you have 50 million rows in your database, this statement will also be issued 50 million times. Why? Because even with an incremental population, MSSearch must figure out exactly which rows have been changed, updated, and deleted. Another problem with incremental populations is that they'll index or re-index a row even if the change was made to a column that you aren't full-text indexing. [Beginning to understand why folks complain about the speed of indexing?–Ed.]

Although an incremental population is generally faster than a full population, you can see that for large tables, either will be time-consuming. My rule of thumb: If you're modifying a large portion (say, more than 10 percent) of your table, do a full population rather than an incremental one.

That's why I recommend you enable change tracking with background or scheduled updating. If you do, you'll see that MSSearch will first issue another:

sp_fulltext_getdata 10, 517576882,
0x3131312D31312D31313131,1

for every row in the table with change tracking enabled. Again, 0x3131312D31312D31313131 is the primary key value for the first row. Then, for every row that has a column that you're full-text indexing and that's modified after your initial full population, the row information will be written (in the database you're indexing) to the sysfulltextnotify table. MSSearch will then issue the following only for the rows that apear in this table–and will then remove them from the sysfulltextnotify table.

sp_fulltext_getdata 10, 517576882,
0x3131312D31312D31313131,1

To see this in action with an indexed pubs database, enable change tracking and try the following experiment:

exec sp_fulltext_database N'enable'
GO
exec sp_fulltext_catalog N'test', N'create'
GO
exec sp_fulltext_table N'[dbo].[authors]',
N'create', N'test', N'UPKCL_auidind'
GO
exec sp_fulltext_column N'[dbo].[authors]',
N'au_lname', N'add', 1040
GO
exec sp_fulltext_column N'[dbo].[authors]',
N'au_fname', N'add', 1040
GO
exec sp_fulltext_column N'[dbo].[authors]',
N'phone', N'add', 1040
GO
exec sp_fulltext_column N'[dbo].[authors]',
N'address', N'add', 1040
GO
exec sp_fulltext_table N'[dbo].[authors]',
N'activate'
GO
exec sp_fulltext_table N'[dbo].[authors]',
N'start_change_tracking'
GO
exec sp_fulltext_table N'[dbo].[authors]',
N'start_background_updateindex'
GO

Then, do the following:

update authors set au_lname=au_lname
where au_lname < 'm%'
go
select * from dbo.sysfulltextnotify

Wait a couple of seconds before issuing the second query again, and you'll see that MSSearch has picked up your changes and has removed the entries from the MSSearch table.

As you can see, the change tracking option offers much better performance than either incremental or full population when it only has a few updates to make. If a significant portion of your table has changed, though, you may find full population offers better efficiency.

If you're running SQL full-text indexing on a heavily updated database, it behooves you to consider scheduling the update as opposed to having it run continuously. This is especially true if you can schedule updating during periods of low activity (for instance, nights or weekends). To schedule an updating of the index in the background, issue the following:

exec sp_fulltext_table N'[dbo].[authors]',
N'update_index'

Note: Bear in mind that updates made to the text data type using WRITETEXT or UPDATETEXT will not be flagged for indexing under the change tracking option.

2. Consider using a separate build server


Tables that are heavily updated while you're indexing can create locking problems, so if you can live with a catalog that's periodically out of date–and an MSSearch engine that's sometimes unavailable–consider using a separate build server. You do this by making sure the indexing server has a copy of the table to be full-text indexed and exporting the catalog (see KB Article 240867 for the details). Clearly, if you need real-time or near real-time updates to your catalog, this is not a good solution.
3. Convert all of your data to text


As you know, SQL Server 2000 gives you the option of storing what are often large and format-heavy documents such as PDFs in their native format in an image data type column. By specifying the document extension, you ensure that the appropriate iFilter will be applied to the documents. However, as you might expect, there's a performance hit, and I've found that you can improve the raw indexing speed significantly by storing your documents as text. The FiltDump utility that ships with the platform SDK is a good utility that dumps the output of your iFilter, enabling you to see how the iFilter interprets your document. Use the —b switch as follows:
FiltDump —b MyDocument.pdf > MyDocument.txt

Surprisingly, indexing performance isn't terribly sensitive to the size of the text documents within the columns you're indexing. In fact, I find there's little difference using varchar, char, or text for documents up to 8K. In one test, I took 40,000 documents with an average size of about 6K and stored them in a table as varchar. Then I created another table, storing the same documents as text, and the indexing speed was almost the same. (The Text in Row option didn't change the indexing speed, either.) Indexing speed is, however, quite sensitive to both the amount of formatting in your documents and the speed of your iFilter.

I did a test where I decompiled MSDN and converted it to Word documents–storing all of the 460,000 documents in a full-text database. When the documents were stored as HTML, indexing was 23 percent faster, and when they were stored as TXT, I realized an additional 10 percent increase in indexing speed.

If you do choose to store documents in an Image data type column, they'll be extracted to %systemroot%\temp\Gthrsvc. You can change this temporary location (try using a RAID-1 array) via this Registry key:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Search\1.0\
Gathering Manager\TempPath

Again, I recommend you avoid the performance hit associated with Image data by storing your documents as text, char, or varchar.

4. Limit activity when population is running


When population is running, don't run Profiler, and limit other database activity as much as possible. Profiler consumes significant resources. Likewise, don't run this in a continual loop, as it has a negative impact on performance:
Select FULLTEXTCATALOGPROPERTY('CatalogName',
'PopulateStatus')

5. Increase the number of threads for the indexing process


Increase the number of threads you're running for the indexing process. The default is only five, and on quads or 8-ways, you can bump this up to much higher values. MSSearch will, however, throttle itself if it's slurping too much data from SQL Server, so avoid doing this on single- or dual-processor systems. These are the relevant Registry entries:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Search\1.0\
Gathering Manager\Servers\SQLServer\EvaluationOrder

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Search\1.0\
Gathering Manager\Servers\SQLServer\HitInterval

Another trick, if your application permits it, is to partition your data into multiple catalogs. For instance, if you split your table into two, you'll have twice as many threads working on the indexing.

6. Invest in the best equipment you can afford


Invest in a quad or 8-way with the highest CPU cycles and L2 cache you can afford. There's a sweet spot using eight processors. If you can't afford an 8-way, invest in the fastest disk subsystem you can afford. (Microsoft doesn't support placing SQL Server databases or full text catalogs on RAM drives or disks, and some users have had spectacular failures using RAM drives.)
7. Stop any anti-virus or open file-agent backup software


If this isn't possible, try to prevent them from scanning the temporary directories being used by SQL FTI and the catalog directories–which default to C:\Program Files\Microsoft SQL Server\MSSQL\FTData.
8. Give the catalog its own controller


Place the catalog on its own controller, preferably on a RAID-1 array.
9. Put the temp directory and pagefile on RAID-1 arrays


Place the temp directory on a RAID-1 array via SetTempPath.vbs, which can be found in C:\Program Files\Common Files\System\MSSearch\bin. Similarly, consider putting pagefile on its own RAID-1 array with its own controller.
10. Be aware of maxed-out CPU utilization


The German and (especially) the Asian word breakers have so much work to do that it's not uncommon for CPU utilization to max out at 100 percent. Although such utilization in itself isn't necessarily a bad thing, it does become a bad thing when your Server Work Queues counter climbs beyond 2 because your server will have to allocate resources to managing the queue (as opposed to doing actual work).
A note about BLOBs


Web servers are optimized for serving files to Web clients, and RDBMSs are optimized for returning results sets to clients, so Web servers offer better performance returning BLOBs to clients than RDBMSs do. In general, the more you offload from your RDBMS to your Web server, the better your overall database performance will be. Therefore, you may want to consider storing a copy of your BLOB data both in the database and via the file system. Then, when generating your Web pages, you can provide an IMG SRC link to your file system version of the BLOB–as opposed to having your RDBMS return it. (I realize that there's considerable debate in the community about where to store BLOBs and large volumes of textual data in general. SQL Server MVP Aaron Bertrand discusses some of the various arguments and lists Microsoft's KB articles related to BLOB data at www.aspfaq.com/show.asp?id=2149.)
The samples


There are two code samples. The first one illustrates loading a database with files located in a directory. (There are several methods to load data into a database, including ADO's GetChunk and AppendChunk, TextCopy, and even bcp. I prefer to use ADO's stream object. Please refer to the first code sample for an example of loading a database using the stream object.) The second one lets you monitor the progress of a full or incremental population, allowing you to "peek" at the in-process population, and will guess where you are in the population process. (It does this by running Profiler momentarily.) The estimate is based on the fact that, on average, 5 percent of a catalog build process is spent compiling the catalog. The actual time will, of course, be dependent on the number of unique words you have in your content and the frequency of their usage.
Download 406HILARY.ZIP

Sidebar: SQL Server 2005 Full-text Search
SQL Server 2005 promises many improvements to SQL FTS. I've broken these down into various categories:

Architecture–The most significant improvement is in the arena of raw performance, both of indexing and search; early tests show improvement by two to three orders of magnitude. The main reason is that Microsoft has integrated the search engine into the database engine. In SQL7 and SQL 2000, the search engine was a separate service that was shared by all instances not only of SQL Server, but also SharePoint Portal Server Search and Exchange Content Indexing. The search indexes have been moved into the database as well. Not only does this result in better performance, it also allows you to copy your database and full-text indexes as a single unit. Bonus: The new search engine can index up to 2 billion rows. (The limit on SQL 2000 was between three and 30 million rows.)
;T-SQL improvements–Look for thesaurus support as well as a full-text index creation syntax very similar to the CREATE INDEX syntax used for tables. You may realize that the current versions of Search SQL don't handle noise words or accents well. Happily, the version of Search SQL that ships in SQL 2005 solves these problems. You can also selectively query one or more columns in your full-text query.
Reporting tools–SQL 2005 full-text search also ships with improved reporting and logging capabilities. You'll also see two new tools: Lrtest, which allows you to determine exactly how a search argument is broken by the search parser while querying and indexing, and CiDump, which allows you to dump your catalog to determine exactly what's contained in your full text catalog.
By the way, Yukon FTS provides the engine to XQuery when you store XML documents using the XML datatype.

Sidebar: Resources
Andrew Cencini's white paper, "SQL Server 'Yukon' Full-Text Search: Internals and Enhancements," at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/yukonftsearch.asp
323739 INF: SQL Server 2000 Full-Text Search Deployment White Paper
827449 How to manually reinstall the Microsoft Search service for an instance of SQL Server 2000
817301 PRB: Microsoft Search Full-Text Catalog Population Fails When You Upgrade SQL Server 2000 Failover Cluster to SQL Server 2000 SP3
317746 PRB: SQL Server Full-Text Search Does Not Populate Catalogs
811064 Support Webcast: Microsoft SQL Server 2000: Full-Text Search and SP3
25136 Support Webcast: SQL Server 7.0 Full-Text Search and Basic Troubleshooting Methods
308771 PRB: A Full-Text Search May Not Return Any Hits If It Fails to Index a File
240681 HOWTO: Query Index Server for Custom Properties with SQL Query Analyzer
See also the "Resources" sidebar in Hilary's January feature, included as a PDF in the accompanying Download.

没有评论:

发表评论