November 16th, 2009Adding Adobe Acrobat (.pdf) Documents to Your SharePoint Search
Even though most organizations deal with Adobe Acrobat and .pdf documents, it doesn’t mean Microsoft added this file type in SharePoint. Sure you can upload .pdf documents to document libraries, but you can forget about crawling those documents and returning results from your search queries.
Thankfully, there are a few resources to address this issue. I will document those resources below, however, here is a condensed list of actions:
1. You have to get your hands on an IFilter from Adobe. Essentially, the IFilter is the link that gets SharePoint to understand what a .pdf document is and how to crawl it. The easiest way to install the IFilter is to download the latest version of Adobe Acrobat Reader on your Central Administration server. The latest versions (8.0 and up) of Reader include the IFilter.
2. You then need to add the extension to your SharePoint extension list:
- Click Start, click Run, type regedit, and then click OK.
- Locate and then click the following registry subkey:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\GUID\Gather\Search\Extensions\ExtensionList
- On the Edit menu, point to New, and then click String Value.
- Type 38, and then press ENTER.
- Right-click the registry entry that you created, and then click Modify.
- In the Value data box, type pdf, and then click OK.
3. Add the PDF file type to the Extensions List for WSS search by editing the registry:
- Start regedit
- Open the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\{Random GUID}\Gather\Search\Extensions\ExtensionList
- Add PDF to the list as a new String Value. Use a new high value e.g. if 37 is the highest value, use “38″ as the key with the value “pdf”
4. Make a graphic for the .pdf icon to be displayed in the search results and add it to SharePoint files to be referenced:
- Create a 16X16 pixel image of the Acrobat file icon.
- Save the image as pdf16.gif
- Add the Acrobat PDF picture to the SharePoint templates directory. Copy the Acrobat PDF picture called pdf16.gif in the 12 Hive\TEMPLATE\IMAGES folder, e.g. %programfiles%\Common Files\Microsoft Shared\Web Server Extensions\12\TEMPLATE\IMAGES.
- Bind the Acrobat PDF picture to the PDF file type
- Open the 12 Hive\TEMPLATE\XML\DOCICON.XML file
- Find the <DocIcons.ByExtension> part
- Add the following mapping: <mapping Key=”pdf” Value=”pdf16.gif” OpenControl=”" />
5. Run IISReset
6. For MOSS 2007 users, Add the file type to Central Administration:
- Go to your SSP site
- Click on Search Settings > File Types > New File Type
- Add pdf as a file type
7. Complete a Full Crawl of your content sources. This will re-crawl pdf documents that may already be in libraries.
Some of you might see the following warning in your crawl logs:
The file reached the maximum download limit. Check that the full text of the document can be meaningfully crawled.
This is because some of your .pdf files may be too large. Also, SharePoint wont crawl .pdf documents that have other .pdf documents contained within them.
There doesn’t seem to be a fix for complex .pdf documents, however the size issue can be resolved by adjusting the MaxDownloadSize key in the registry:
- Start Registry Editor (Regedit.exe).
- Locate the following key in the registry: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager
- Open Edit – New – DWORD Value. Name it MaxDownloadSize. Double-click, change the value to Decimal, and type the maximum size (in MB) for files that the gatherer downloads.
You may also want to adjust the timeout period for SharePoint to crawl the documents as you have just adjusted the amount of data that can be crawled and it may take longer:
- In Central Administration, on the Application Management tab, in the Search section, click Manage search service.
- On the Manage Search Service page, in the Farm-Level Search Settings section, click Farm-level search settings.
- In the Timeout Settings section change Connection and Request acknowledgement time.
- For WSS3 users: The registry key for is HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\
Web Server Extensions\12.0\Search\Global\Gathering Manager
Default would be MaxDownloadSize (16MB) * MaxGrowFactor (4MB) = 16 * 4 = 64MB of text can be indexed.
After all of your changes, you should restart the CA server and run another full crawl.
Resources:
http://support.microsoft.com/default.aspx/kb/927675
http://blog.tylerholmes.com/2008/04/walkthrough-installing-adobe-v6-pdf.html
http://grounding.co.za/blogs/neil/archive/2008/12/02/working-with-pdf-s-and-sharepoint.aspx
http://fmuntean.wordpress.com/2008/05/22/increase-file-size-crawling/