Crawl a WordPress Blog with SharePoint 2013

At work we have a WordPress blog that we wanted to include in our public website’s search results.  Yep, our public website is SharePoint 2013.  We recently moved it to Azure, but it is a blog for another day…

Anyway, I started down this path and ran into a few issues before I sorted it out.  Once I figured it out I thought it’d be a good idea to share.

The first thing I did was go to my search config and create a new content source.  I added the URL to the blog to it and I started down a path of trying several different Crawl Settings.

Turns out I just needed to set it to Only crawl within the server of each start address.  I couldn’t tell this worked though because I kept running into this warning in my crawl logs every time I did a full crawl…

Item not crawled due to one of the following reasons: Preventive crawl rule; Specified content source hops/depth exceeded; URL has query string parameter; Required protocol handler not found; Preventive robots directive. ( This item was deleted because it was excluded by a crawl rule. )

I tried to google the site and could only ever get a result it I googled the URL, blog.b2btech.com, which told me…

A description for this result is not available because of this site’s robots.txt

I went and checked the reading settings on the blog.  Turns out the Search Engine Visibility check box, Discourage search engines from indexing this site, was checked.  I unchecked it and kicked off a crawl.  At this point I didn’t have the proper Crawl Setting set and was just trying to crawl the sitemap.xml file with SharePoint can’t do.  I experimented with crawl rules for a while and then switched back to the url of the blog in the content source.

This resulted in much more stuff coming into the index than I would ever want.

Eventually, what ended up working for me is the following:

  • Content source  with Blog
  • Crawl Setting set to, Only crawl within the server of each start address
  • Crawl rule set to
    • blogpath/*
    • User regular expression syntax for matching this ruled checkbox checked
    • Include all items in this path selected and all check boxes below left unchecked
    • Anonymous access

After getting my configuration sorted out as indicated above, I crawl worked as expected and I have blog entries showing up in search.

I specifically wrote this blog because this forum post didn’t provide a solution to the poster’s issues…

https://social.technet.microsoft.com/Forums/exchange/en-US/d0e50c07-662b-47c5-9347-b6fe44ec23ed/crawling-wordpress-blog?forum=sharepointsearch

SharePoint Saturday Presentation…

I presented at SharePoint Saturday Atlanta today.  For now I’m just going to post the slide deck and some relevant resources.  Later I plan to put together a post about scaling out the SharePoint 2013 Service Application.

SearchTopology_Final

http://technet.microsoft.com/en-us/library/jj219705.aspx

A Tour of SharePoint 2013 Search Part 2

This is part 2 of my series on search.  Part one is linked to below…

https://sharepointv15.wordpress.com/2013/01/16/a-tour-of-sharepoint-2013-search-part-1/

Part 1 focused on the System Status and Search Application Topology.  This post will focus on Content Sources.

On  the Search Administration page  there are several links broken into titled categories.  The second group is titled Crawling.

sp2_1

“A content source is a set of options that you use to specify what, when and how to crawl.”

When Search is initially configured the content source “Local SharePoint sites” is created, and as the name implies this includes all SharePoint sites in your farm.  As you create additional web apps they are automatically added to this content source.  Another thing to note is that changing your default AAM will result in that URL being added to your content source in addition to whatever the original URL of your site was, so there may be need for cleanup.  This is also good to know, “Changing a content source requires a full crawl for that content source”

I pulled that and the content source definition from this technet article.  It is short and worth the read…

http://technet.microsoft.com/en-us/library/jj219808.aspx

Clicking on Content Sources will bring you to the Manage Content Sources page…

sp2_2

Clicking on the dropdown will result in the following menu appearing…

sp2_3

Clicking Edit or on the Name will bring you to the edit content source page.

sp2_4

sp2_5

From this page your initial options are to name your content source, view content source details and add or remove start addresses.  Keep in mind that the the Edit and Add pages are basically the same.  Obviously, you are going to need to click the New Content Source button to get to the Add Content Soure page etc…   A start address is the point from which the crawler will begin to crawl your site.  Typically Local SharePoint sites is going to have all of your web apps listed by default.

Crawl Settings really only applies to when you are creating a content source because once you have selected a setting you can’t change it.  When creating a content source you have the following options.

 

sp2_9

Switching between the first 4 of these really just changes the path requirement, as illustrated in the screen shot below.

 

sp2_11

However, Line of Business data and Custom Repository require significantly different information…  Line of Business Data requires you to select a BCS Service application which of course requires that you have BCS provisioned and a Service Application is connected to some LOB System.  More information on both source types can be found here…

http://technet.microsoft.com/en-us/library/jj219577.aspx#Section3

And here…

http://msdn.microsoft.com/en-us/library/ee556429.aspx

sp2_10

A Custom Repository requires that you have a Custom Connector registered.

sp2_12

Your only edit options are, “Crawl everything under the hostname for each start address” or “Only crawl the Site Collection of each start address”.  The second option would be used if you want to crawl some site collections in a web app less or more often than others.  There are several factors that would go into a decision like this.  For instance, varying content change frequency between site collections.

The next section deals with Crawl Schedules.  Crawl Schedules has the new option, Enable Continuous Crawls.

“Enable continuous crawls is a crawl schedule option that is new in SharePoint 2013. It is available only for content sources that use the SharePoint sites content source type. A continuous crawl starts at set intervals. The default interval is 15 minutes, but you can set continuous crawls to occur at shorter intervals by using Windows PowerShell.”

http://technet.microsoft.com/en-us/library/jj219802.aspx

After quite a bit of digging I believe I may have found PowerShell that allows you to set the continuous crawls to shorter intervals.  I say may because I haven’t tried this yet…  I believe that you would change this by using  Set-SPEnterpriseSearchCrawlContentSource, referencing the desired content source, and setting the -CrawlScheduleRepeatInterval to a time less than 15 minutes.  I will try this an confirm that it works, and update this post with the results.  The link below is to the TechNet article that covers Set-SPEnterpriseSearchCrawlContentSource…

http://technet.microsoft.com/en-us/library/ff607675.aspx

The familiar Incremental Crawl and Full Crawl scheduling options are next.  Both of which allow you to create a schedule.    Crawl schedules require a good bit of planning and are very much dependent on the specific needs of the environment.

Last we have Content Source Priority.  Your options here are High and Normal.  The Crawl system uses this to prioritize crawling resources with High content sources being top priority.

From a Content Source’s drop down menu the View Crawl Log options is available.

sp2_6

Clicking on this will bring you to the crawl log .

sp2_7

This screen provides you with Average Crawl Duration and Summary information.  This is where you go asses the health of your content source crawls, and is going to be your first stop when you need to troubleshoot content source issues.

Note: If there are any Top Level Errors they occur only at the starting address, virtual servers or content databases.  These are going to be more serious, should be addressed first, and will often greatly increase the number in the Errors column because they will be the root cause of those issues.  Put another way, when on a Premiere Support call a year or so ago this was the primary focus of the support engineer.

Clicking on the number of errors will bring you to this page…

sp2_8

You will be taken to this same page, filtered appropriately, if you click on Warnings or Successes.  URL View allows you to search for crawled documents.  Databases provides a list of your crawl store databases and the number of items in each.  For more information I suggest reading this TechNet article…

http://technet.microsoft.com/en-us/library/jj219611.aspx#proc3

This concludes part 2.  I will continue this series as I have time, and hope to get all the way through search, which is a massive subject, in a reasonable amount of time.

A Tour of SharePoint 2013 Search Part 1

This post covers Search after configuring the service application… see instructions for configuring the service application here…

https://sharepointv15.wordpress.com/2012/08/02/sharepoint-2013-search-service-application-configuration/

This is part one of a series of Search Posts that will cover SharePoint 2013 Search in detail.  One important note right off is that Fast Search is SharePoint 2013 Search.  Another important note is that it is very resource intensive, so you are going to want to run minimum specs or better in a production environment.

http://technet.microsoft.com/en-us/library/cc262485.aspx

Also, whatever you do, DO NOT set your virus scanner to scan search data folders in real-time and do not use dynamic RAM allocation for Search Servers.  They couldn’t have been more clear about this at the conference.  Another tip I picked up from the conference is that 10 Million items is your threshold for needing to move from single server environment to one in which the search components are distributed between multiple servers.

To Access the Search Service Application … Click on manage service applications under Application Management on the Central Admin home page.

From here click on your Search Service Application’s name to link to the Search Administration Screen

sp1_1

The first thing to note is that you can change your Default content access account by clicking on the account name.  This will bring this up…

sp1_2

You can change the contact email address for crawls by clicking on the current email address.

You can add a proxy server for federation by clicking on none next to Proxy server for crawling and federation…  I believe this is something that you would need to set up if you wanted to set up Hybrid Search between SharePoint on-premise and SharePoint Online, but that is something I need to look into further, and set up for myself in order to confirm.

sp1_3

You can disable search alert status and query logging, and you can change your global search center URL.

If you scroll down you are presented with the Search application Topology

sp1_4

The Search Application Topology lists your Search Components and databases.

Starting from the top…

The Admin component runs all the system processes that search needs in order to function.  You can have multiple Admin components in your farm, but only one can be active at any given time.

The Crawl component is what is used to crawl content based off of settings stored in the crawl database/s.  You can add crawl components in order to increase crawl performance.

The Content processing component processes crawled items before passing them on to the index component.  This is where documents are parsed and properties are mapped etc…

The Analytics processing component handles search and usage analytics.

The Query processing component handles all analysis and processing of search queries and results.

Index partitions are a means by which to divide up the index.  Index partitions are stored on disk.  Index partitions collectively make up the Search Index.

Index replicas are exactly what they sound like.  Really they are Index partition replicas.  Each replica has an index component attached to it.  Creating replicas is a means by which to achieve fault tolerance in that you have two or more replicas of an Index partition that live on different servers, so that is one server goes down that portion of the index is still available.

The Administration database is where all of your configuration data is stored.  You will have one and only one of these.

The Analytics reporting database stores your search usage analytics results.

The Crawl database is your crawl history store and crawl operation manager.  You can have multiple crawl databases and each one can have one or more crawl components associated with it.

Finally you’ll have a Link database which stores data extracted by your content processing component as well as click-through data.

Additional components and databases must be created via Powershell and that subject warrants its own post.

There is quite a bit of information out on Technet, and some of the info from this post came from diagrams that can be found at…

http://technet.microsoft.com/en-us/library/cc263199#search

This ends part one… I’m hoping to have time to get the other parts of this up soon, so stay tuned.

SharePoint 2013 Search Service Application Configuration

While there have been some major changes made to Search in SharePoint 2013, the process of creating the service application really hasn’t change much at all.

From the Central Admin Home page click Manage service applications under Application Management..

In the Ribbon click New and select Search Service Application.

Name your Search Service Application and Select a service account

Next you’ll select or create an application pool for search.  I’m just going to run on a existing app pool to conserve resources.  If this were production I’d likely create a new app pool for both the search admin web service and the search query and site settings web service.

Click OK and wait…

Upon completion you’ll be presented with the following:

Note: the second time I configured the service app I accessed Central Admin from a computer that wasn’t part of the farm and it appeared to get hung up.  I then browsed to the service aplicaitons screen, search showed up just as it does below and everything works as it should.

Stay tuned, my next search posting will cover how to configure search from this point.