Crawl a WordPress Blog with SharePoint 2013

At work we have a WordPress blog that we wanted to include in our public website’s search results.  Yep, our public website is SharePoint 2013.  We recently moved it to Azure, but it is a blog for another day…

Anyway, I started down this path and ran into a few issues before I sorted it out.  Once I figured it out I thought it’d be a good idea to share.

The first thing I did was go to my search config and create a new content source.  I added the URL to the blog to it and I started down a path of trying several different Crawl Settings.

Turns out I just needed to set it to Only crawl within the server of each start address.  I couldn’t tell this worked though because I kept running into this warning in my crawl logs every time I did a full crawl…

Item not crawled due to one of the following reasons: Preventive crawl rule; Specified content source hops/depth exceeded; URL has query string parameter; Required protocol handler not found; Preventive robots directive. ( This item was deleted because it was excluded by a crawl rule. )

I tried to google the site and could only ever get a result it I googled the URL, blog.b2btech.com, which told me…

A description for this result is not available because of this site’s robots.txt

I went and checked the reading settings on the blog.  Turns out the Search Engine Visibility check box, Discourage search engines from indexing this site, was checked.  I unchecked it and kicked off a crawl.  At this point I didn’t have the proper Crawl Setting set and was just trying to crawl the sitemap.xml file with SharePoint can’t do.  I experimented with crawl rules for a while and then switched back to the url of the blog in the content source.

This resulted in much more stuff coming into the index than I would ever want.

Eventually, what ended up working for me is the following:

  • Content source  with Blog
  • Crawl Setting set to, Only crawl within the server of each start address
  • Crawl rule set to
    • blogpath/*
    • User regular expression syntax for matching this ruled checkbox checked
    • Include all items in this path selected and all check boxes below left unchecked
    • Anonymous access

After getting my configuration sorted out as indicated above, I crawl worked as expected and I have blog entries showing up in search.

I specifically wrote this blog because this forum post didn’t provide a solution to the poster’s issues…

https://social.technet.microsoft.com/Forums/exchange/en-US/d0e50c07-662b-47c5-9347-b6fe44ec23ed/crawling-wordpress-blog?forum=sharepointsearch