Lightning FAST Enterprise Searches in SharePoint 2010 - Krasis Press

Síguenos y descubrirás los mejores trucos y recursos:

¿Te interesa este libro?Cómpralo en nuestra tienda: www.campusmvp.com

Especialistas en formación online y librosde tecnologías Microsoft.

- En papel o en formato electrónico

- Sin DRM- Imprimible- Busca en el contenido

Lightning FAST Enterprise

Searches in Sharepoint 2010

Gustavo Vélez

LIGHTNING FAST ENTERPRISE SEARCHES IN SHAREPOINT 2010

© 2012 KRASIS CONSULTING, S. L. www.campusmvp.net

ALL RIGHTS RESERVED, NO PART OF THIS BOOK MAY BE REPRODUCED, IN ANY

FORM OR BY ANY MEANS WITHOUT PERMISSION IN WRITING FROM THE PUBLISHER

ISBN ELECTRONIC FORMAT: 978-84-939659-1-4

http://www.campusmvp./

Acknowledgments

In recognition and appreciation, I would like to acknowledge the people involved in

making this project possible. Juan Carlos Gonzalez Martin of the Centro de Innovación

en Integración (Integration and Innovation Center CIIN, http://www.ciin.es, Cantabria,

Spain) and Fabian Imaz of Siderys (Siderys, http://www.siderys.com, Montevideo,

Uruguay), both SharePoint MVPs, recognized SharePoint experts and amazing friends,

agreed to read the text and sift out errors and inconsistencies. And Jose Manuel

Alarcón, editor at Krasis, who ensured the book's publication and has the patience to

hear my complains through the realization of the book and wait till all problems were

solved.

Gustavo Vélez

v

Table of Contents

ACKNOWLEDGMENTS ....................................................................................... iii

TABLE OF CONTENTS ......................................................................................... v

PREFACE ............................................................................................................... vii

CHAPTER 1: INTRODUCTION .......................................................................... 11

1.- Search in the IT world .............................................................................................................. 11 2.- Short history of fast .................................................................................................................. 13 3.- Positioning of fast in the microsoft stack ............................................................................ 15

3.1.- Windows, SQL, SharePoint, SCOM ......................................................................... 16 3.2.- Microsoft Search Products .......................................................................................... 16

4.- Some Important Documentation .......................................................................................... 18

CHAPTER 2: FAST IN THE CONTEXT OF SEARCH ..................................... 21

1.- Goals of search........................................................................................................................... 21 2.- Internet search vs. Enterprise search ................................................................................... 22 3.- Search terminology and concepts ......................................................................................... 23 4.- Fast versions ............................................................................................................................... 28

CHAPTER 3: ARCHITECTURE AND DESIGN ................................................. 31

1.- Conceptual Design .................................................................................................................... 31 2.- Logical Design ............................................................................................................................. 34 3.- Physical Design ........................................................................................................................... 35

3.1.- Extra-Small Farm ............................................................................................................ 35 3.2.- Medium Farm .................................................................................................................. 36 3.3.- Large Farm ....................................................................................................................... 38 3.4.- Extra-Large Farm ........................................................................................................... 41 3.5.- Virtualization of FAST farms ....................................................................................... 42

CHAPTER 4: FAST REQUIREMENTS AND INSTALLATION ........................ 43

1.- Requirements .............................................................................................................................. 43 1.1.- Hardware Requirements.............................................................................................. 43 1.2.- Software Requirements ................................................................................................ 44 1.3.- Environment preparation ............................................................................................. 44

2.- Manual Installation ..................................................................................................................... 45 2.1.- Prerequisites ................................................................................................................... 45 2.2.- Software Installation ...................................................................................................... 46 2.3.- Post-Setup Configuration ............................................................................................ 47

vi Lightning FAST Enterprise Searches in Sharepoint 2010

vi

2.3.1.- Single-server FAST Post-Setup Configuration ................................................ 47 2.3.2.- Farm FAST Post-Setup Configuration .............................................................. 49

3.- Scripted Installation ................................................................................................................... 51 3.1.- Prerequisites ................................................................................................................... 52 3.2.- Software Installation ...................................................................................................... 52 3.3.- Post-Setup Configuration ............................................................................................ 53

CHAPTER 5: CONFIGURATION AND ADMINISTRATION .......................... 57

1.- Configuration .............................................................................................................................. 57 1.1.- SharePoint 2010 Content Search Service Application ......................................... 57 1.2.- SharePoint 2010 Query Search Service Application ............................................. 59 1.3.- Certificates ...................................................................................................................... 60

1.3.1.- Create a new Content Self-Signed Certificate ............................................... 60 1.3.2.- Replace a Content Self-Signed Certificate with a CA Certificate ............. 61 1.3.3.- Query Certificate ................................................................................................... 62 1.3.4.- Certificate for Security Trimming...................................................................... 63

1.4.- SharePoint 2010 Site Collection Configuration ..................................................... 63 1.5.- Content Indexing ........................................................................................................... 64

2.- Administration and Configuration with PowerShell ......................................................... 65 2.1.- Administration cmdlets ................................................................................................ 65 2.2.- Index Schema cmdlets .................................................................................................. 66 2.3.- Installation cmdlets ........................................................................................................ 67 2.4.- Spell Tuning cmdlets ..................................................................................................... 67 2.5.- Security cmdlets ............................................................................................................. 68

3.- Administration ............................................................................................................................ 69 3.1.- SharePoint 2010 Central Administration ................................................................ 70

3.1.1.- General Administration ........................................................................................ 70 3.1.2.- Crawling Administration ...................................................................................... 71 3.1.3.- Query Administration ........................................................................................... 72 3.1.4.- Reporting and Reporting Administration ........................................................ 72

3.2.- Administration Command-line Tools ....................................................................... 73 3.3.- Backup and Recovery ................................................................................................... 74

3.3.1.- Backup and Restore Prerequisites ..................................................................... 74 3.3.2.- Configuration Backup and Restore ................................................................... 75 3.3.3.- Full Backup and Restore....................................................................................... 76

4.- Monitoring ................................................................................................................................... 79 4.1.- FAST Logs ........................................................................................................................ 79 4.2.- WMI for monitoring ..................................................................................................... 80 4.3.- Performance Counters for monitoring .................................................................... 81 4.4.- Monitoring Command-line Tools .............................................................................. 82

CHAPTER 6: USER INTERFACE ........................................................................ 85

1.- FAST Search Center ................................................................................................................. 85 2.- WebParts for SharePoint 2010 .............................................................................................. 88

2.1.- Search WebPart Gallery .............................................................................................. 88 2.1.1.- Search Box WebPart ............................................................................................ 90

Contents vii

2.1.2.- Core Results WebPart ......................................................................................... 92 2.1.3.- Refinement Panel ................................................................................................... 94

2.2.- Customizing Search WebParts ................................................................................... 97 2.2.1.- XSLT Transformations ......................................................................................... 97 2.2.2.- Properties Manipulation ..................................................................................... 101

2.3.- Customizing Non-sealed Search WebParts .......................................................... 102

CHAPTER 7: PROGRAMMING ......................................................................... 107

1.- Working with the Search API .............................................................................................. 107 1.1.- Administrating FAST Programmatically ................................................................. 107 1.2.- Querying FAST Programmatically ........................................................................... 111

1.2.1.- The Federated Object Model ........................................................................... 112 1.2.2.- The Query Object Model .................................................................................. 113 1.2.3.- The Query WebService ..................................................................................... 116

1.3.- Content API for FAST ................................................................................................ 119 2.- Customize The Content Pipeline with the Extensibility Stage .................................... 121

2.1.- Crawled properties, Managed properties and Crawled property categories

121 2.2.- Creating the Logic of the Pipeline Extension ....................................................... 122 2.3.- Configuration of the Pipeline Stage ......................................................................... 124

3.- Adding custom Refinement Panels ...................................................................................... 126 4.- Building Search WebParts providing FQL capabilities ................................................... 130

INDEX .................................................................................................................. 133

Preface

FAST is the Enterprise Search solution from Microsoft and it is taking quickly a

very important role in the offer of the company's enterprise servers. With its integration

in the SharePoint 2010 family, FAST bids a scalable, flexible and powerful search

server that not only contents with other similar commercial software but that can pick

up the gauntlet and surmount easily any other product.

This book is oriented to technical audiences that need to design, install, configure

and customize a FAST Search implementation. More general themes are handled in the

first chapters: wide-ranging information about search, the past-and-future of search, a

short history of FAST and explanations about the very specific definitions and concept

used by search engines; because search is intimately related to human linguistics and

how people organize information, special attention has been given to how the internal

algorithms can be interpreted from an information technology perspective, not from a

pure technical point of view.

Installation and configuration are managed in the following chapters. Although the

installation procedure trails the traditional friendly installation routines of all Microsoft

products, there are some important aspects that must be taken in consideration

especially for an enterprise FAST farm. The different configuration options

(SharePoint Central Administration, SharePoint Site Collection Administration, FAST

Object Model and FAST PowerShell console) are reviewed to explain the several

available ways to adapt the system to the enterprise requirements.

Finally, the default Search User Interface is assessed. Albeit the SharePoint Search

WebParts can be used by both, the SharePoint Enterprise Search and FAST, the

different WebParts are analyzed and the configuration and customization possibilities

are described because they form the main components that the day-to-day users will

experience.

Customization of the core search engine is one of the points that make FAST

different from the SharePoint Enterprise Search engine. In the current FAST version

the great part of customizations take place modifying XML files but some

programming is allowed and sometimes indispensable to ensure FAST is behaving as

required. The last book chapter deals with programming and customizing the engine

and it is mainly oriented to developers.

All-by-all the book offers a 360 degrees view of FAST and it is intended to be a

reference work for those people that are curious about FAST and the ones that must

deal with the server for the first time.

And remember: if you cannot find it, it doesn't exist...

Gustavo Vélez

11

CHAPTER

The Wikipedia defines "Search" as "software for finding information": that is a

short, concise definition of something that is becoming indispensable in our

information-driven society; namely, how to discover the necessary data and distinguish

relevant from irrelevant material.

Search as IT technology is at this moment one of the most important components in

each information system. Because computer systems are able to generate huge amounts

of information, everyday it is more and more difficult (and expensive) to reach the

appropriate information. Search technologies enable us to work in a smarter way,

reusing the data that already exists.

But, on the other hand, our society is also becoming more and more addicted to and

dependent on information search technologies, making the knowledge society reliant

on search services and their quality to work correctly; saying in other words, if you

cannot find it, it doesn't exist.

1.- SEARCH IN THE IT WORLD

Search is not a new issue in the IT world. Since computers have been saving data

electronically, it has been a necessity to get the correct records back. Theoretical work

started as early as 1945 when Vannevar Bush as Director of the Office of Scientific

Research and Development in USA after the Second War World, stressed the necessity

of creating an information device (that he called a "Memex") to allow a memory

storage retrieval system without limits, flexible and associative.

Gerard Salton from Harvard University is considered the father of the modern

search technologies. After the publication of his book "A Theory of Indexing", where

Introduction

1

12 Lightning FAST Enterprise Searches in Sharepoint 2010

base concepts as Document Frequency, Term Frequency, Term Discrimination and

Relevancy were defined, the mathematical and theoretical foundations for search

algorithms found their place.

With the creation of ARPANet in 1972 and the start of Internet as we currently

know it in 1993, the necessity for a search mechanism was urgent. In 1990 the first

search engine was created: Archie fashioned by Alan Emtage, a student at McGill

University in Montreal. Archie was merely a script data collector that used regular

expressions to retrieve file names matching the user queries.

Because Archie was a big success, new search systems starting to appear to fill the

gaps left. Veronica was created at the University of Nevada that, besides the same use

as Archie, was also able to index the content of plain text files. In a short time Jughead,

a clone of Veronica was created with a more advanced user interface. At this time

Gopher and FTP where the main transfer protocols used and ARPANet was principally

an academic initiative.

On August 6 1991 Tim Burners-Lee at the CERN created the first page using the

WWW protocol; at the same time, the Virtual Library (http://vlib.org/, still existing),

the first and oldest sites catalog was online. Very soon the first crawlers were

implemented and in June 1993 Matthew Gray presented the "World Wide Web

Wanderer" initially to measure the active web servers, but soon becoming "Wandex"

the first data base created to capture URL's.

By the beginning of 1994 Internet was three search engines rich: "World Wide Web

Worm", JumpStation and RBSE (the "Repository Based Software Engineering"

spider). The only one that had a ranking mechanism was RSBE. The other two listed

their results as they were found without any discrimination, making them impossible to

use when the WWW grew exponentially. In 1993 Excite was also born, the first search

engine that used statistical analysis and word relationships to improve the search

mechanism. Excite had a huge success and was sold in 1999 for $6.5 billion (and sold

again in 2001 for $10 million, after the Internet crash).

1994 saw the birth year of Yahoo! as well (David Filo and Jerry Yang), initially as a

collection of web pages and shortly after creation, making the jump to

commercialization in the model that we know currently. Lycos, Hotbot and Altavista

went online the same year, making the change from web pages catalogue to crawled

search mechanisms, allowing new technologies as natural language queries. All of

these search engines become eventually irrelevant because of technical, financial and

management reasons.

Finally in 1998 Larry Page and Sergey Brin launched Google at Stanford

University, based on its early work BackRub. The same year Microsoft set up MSN

Search online and in 2006 Microsoft announced Bing using its own created search

technologies (MSN Search was based mainly on Yahoo!, Overture, Looksmart and

Inktomi).

Although web search is very important, enterprise search is occupying a prominent

role in the search market. Currently all the big software companies (IBM, Google,

EMC, SAP and, of course, Microsoft) have one or more enterprise search offering.

Some extra technical information about the similarities and differences between web

and enterprise search will be analyzed in the second chapter.

http://vlib.org/

Introduction 13

Search seems to be a static world seen from the perspective of the users, but it is a

very dynamic world from the technical and business perspective. At the web search

front, the battle between Google and Bing is beginning to become legendary: the

underdog against the huge establishment. At the enterprise search front, the roles are

more equally distributed, with FAST gaining more momentum especially because of

the growing influence of Microsoft in the business world.

Technically the future is completely open. Currently search is still essentially about

finding primary topics or noun-phrases: a person's name, a city, a product and so on.

The future of search should be finding verbs, called by Microsoft as the "decision

engine" (as opposed to "search engine"): search will try to give the user the knowledge

to complete tasks doing the initial computational discerning automatically.

New classes of information are starting to be also more important; social network

data for example, or location data and the interconnection between all layers of

information. Currently a user normally searches for a term or number of terms: "fast

search" and the result is a mash-up of information that has something to do with "fast"

(any kind of fast: fast food, fast cars, FAST search) and "search". As search engines

become more "intelligent", they should add other layers of information, for example

the kind of user ("user is an IT-pro), his current geographical location ("user is at the

office") and filter the results to show a much more consequent and useable set of

information. Additionally, the search engine could prepare the information in report

form, setting it directly in Word format for example. The search engine should become

in part intelligent software and in part assistant and less of an information reader only.

The progression of search is from merely data to useful information to knowledge that

answers questions.

2.- SHORT HISTORY OF FAST

Till Google provoked a landslide in the search word in 2002, Microsoft was not

really aware of the importance of search for the IT industry. Until then, Microsoft used

different third-party technologies for web search and had one "enterprise" search

engine used locally for Windows and for some of its servers (namely Search Server for

SharePoint). Gartner's Magic Quadrant for Information Access reflects this position in

its 2006 report as Figure 01 shows: Microsoft is impossible to be found in the diagram.


Figure 01.- Gardner Magic Quadrant for Information Access 2006

In 2006 Microsoft stepped up the company strategy for Search for the next few

years announcing that search should be of vital importance for the company and all its

servers. Three years after that, the Gardner Magic Quadrant would show a very

different panorama, as shown in Figure 02: Microsoft is in the most important part of

the diagram, the "Leaders" quadrant. And that was possible thanks to the acquisition of

FAST in 2008.

Figure 02.- Gardner Magic Quadrant for Information Access 2009

Introduction 15

From this date till the present day, Google and Microsoft FAST have remained

approximately in the same position in the quadrants. Google stays as the first player in

the web search market and Microsoft is very busy converting all the constituent base

technologies used originally by FAST to Microsoft technologies and integrating FAST

in the Microsoft Stack, principally SharePoint.

FAST was originally a Norwegian company focused on enterprise data search

technologies and its application. Microsoft bought the company on April 24 2008.

FAST was born at the desks of the Department of Computer and Information Science

of the Norwegian University of Science and Technology (NTNU) in 1997 and

launched the first version of the engine in 1999. Initially FAST had versions for web

and enterprise search, but in 2003 they decided to focus exclusively on enterprise

search.

At the beginning of 2004 FAST launched the FAST Enterprise Search Platform

(FAST ESP). The next year FAST found its reputation in the enterprise search world as

probably the best and technologically most advanced engine in the mark and FAST

appears in the Gartner Magic Quadrant for Information Access Technologies in the

"Leaders" Quadrant for a number of years in a row. Nevertheless, FAST was almost

never financially profitable and legal problems troubled the company continuously,

finalizing in the suspension of trading of FAST shares in the Oslo Stock Exchange in

December 2007.

January 8, 2008 Microsoft announced the acquisition of FAST Search & Transfer

for $1.2 billion, making a separate division in the company to house FAST.

FAST ESP was probably the technological leader of enterprise search engines,

offering Contextual Insight (a group of technologies that add linguistic and statistical

analytics to improve search precision), semantic indexes (to recognize and retain the

inherent structure of documents), entity metadata, taxonomic navigation, faceted

browsing and entity discovery (to extract textual entities from the results of previous

search) under other advances.

Originally, FAST ESP was an agnostic system: it was possible to install it in

Windows, Unix and Linux systems, 32 and 64 bits, and it was written in Java, PHP and

Python. It had its proprietary administration interface, user interface, alerting system,

connector mechanism and different other subsystems, but it was possible to integrate

the query and results in SharePoint 2007 using WebParts. Since FAST was bought by

Microsoft, the main change in the server has been the attempt to integrate its code base

with the Microsoft toolset and make it to work smoothly with the rest of the Microsoft

Stack. That means Java and Python code have been changed to Microsoft DotNet

compatible technologies, SQL is used extensively and SharePoint 2010 is becoming the

default interface.

3.- POSITIONING OF FAST IN THE MICROSOFT STACK

Although currently Microsoft has different search engines and versions, FAST is its

most powerful engine and the enterprise preferable offer. As for each of its enterprise


servers, FAST is part of an ecosystem and impossible to work as a stand-alone product.

FAST relays on Windows as Operating System, SQL Server as its repository

mechanism and SharePoint as user and administration web interface. Besides that,

products as Microsoft System Center Operations Manager (SCOM) could be used to

control the availability, performance, configuration and security of FAST, and

Microsoft Forefront Threat Management Gateway (Forefront TMG) would be

necessary to protect FAST from outside threats. Other Microsoft products such as IIS

could be necessary as underground services for one of more of the servers.

3.1.- Windows, SQL, SharePoint, SCOM

Originally, FAST was developed as an agnostic system that could be installed in

Windows, Unix or Linux systems. Being the key Microsoft search technology means

that it must be specifically target to be implemented under Windows, specifically

Windows 2008 Server (64 bits) and up. FAST 2010 can be installed only under

Windows as Operating System (FAST ESP is consider legacy software, not supported

anymore). FAST doesn't demand special conditions for the Operating System; the

requirements are more hardware oriented, as it will be explained in the design chapter.

An SQL Server is required by FAST to maintain the configuration information.

SQL 2008 and up (64 bits) can be used, and FAST should require a modest part of the

database server performance and capability. All data necessary for indexing is not

stored in the database.

SharePoint 2010 is the User Interface and Administrators Interface of FAST, and in

this way, necessary to run properly FAST; but FAST is independent of SharePoint and

could (and should) be installed in separated servers. Both standalone and farm

installations of SharePoint 2010 can be used and an Enterprise license of SharePoint

2010 is indispensable. If document preview is desirable, to see thumbnails of Microsoft

Office Word and PowerPoint in the search results from FAST, Microsoft Office Web

Apps must be installed on the SharePoint servers.

SCOM is not required for the normal working of FAST, SQL or SharePoint, but as

the Microsoft strategic system center operation manager, SCOM is the recommended

monitoring solution for FAST. FAST support a number of monitoring services that

provide data using standardized Windows interfaces; SCOM can consume this data

giving the required protection from the operations perspective.

3.2.- Microsoft Search Products

Currently as of anno 2011, Microsoft have a variety of search offers, varying from

the low-cost/low-functionality of Search Express to the high-end FAST:

Microsoft SharePoint Foundation 2010 Search. Integrated in SharePoint

Foundation 2010 allows search scoped to single SharePoint Site

Collections and it cannot crawl external data sources. It has no

Introduction 17

administration User Interface and all the configurations happen

automatically. Scales to approximately 10 million items (using SQL

Server) for each search server

Microsoft Search Server Express. Free product that allows search over

enterprise content. Can crawl external data sources (web sites, file shares,

Exchange, Lotus Notes) and can federate query results from any

OpenSearch system. Deployment is limited to one server and can use SQL

Server Express (300.000 search items) or SQL Server (10 million search

items)

Microsoft Search Server 2010. Provides almost the same search

functionality of Microsoft SharePoint Server 2010 and can be deployed

across multiple servers for redundancy and increase of capacity and

performance. Supports multiple crawl servers and query servers and scales

to approximately 100 million items

Microsoft SharePoint Server 2010. The search engine embedded in

SharePoint Server 2010, making use of all social networking and managed

taxonomy features of SharePoint: indexing of people Profile database,

search in MySites, takes advantage of user-generated tags, managed

taxonomy to influence ranking, etc. Scales to approximately 100 million

items and can be installed in multiple servers and be used in multi-tenant

hosting environments

Microsoft FAST Search Server 2010 for SharePoint. Includes all the search

features of the other Microsoft search systems (except the Social features

of SharePoint 2010) adding almost unlimited scalability and performance.

Content processing is much more flexible and customizable. FAST consists

of three different versions:

o FS4SP: FAST Search for SharePoint 2010, the version packaged

for the SharePoint 2010 environment

o FSIA: FAST Search for Internal Applications aimed at

organizations that must crawl internal content

o FSIS: FAST Search for Internet Sites that allows crawling of

online information

It is important to consider that the differences in versions are merely a

licensing issue, the kernel engine and functionality is the same in all

versions. FAST ESP is considered legacy software and no longer available,

but customers who have currently Maintenance & Support contracts can

upgrade to either FSIA or FSIS.


Although little is known about the technology behind Bing, the web search engine

of Microsoft, it is indisputable that some aspects of Bing are directly related to FAST.

MSN Search (and later Windows Live Search), the web search engine before Bing,

used mixed technologies from AltaVista, Yahoo! and Inktomi. Bing uses suggestion

for queries and related searches based on semantic technology from Powerset, a

company purchased by Microsoft in 2008, but its search algorithms are property and

very secret.

Always a difficult question is what the right choice is: SharePoint Search or FAST

Search? The answer is always specific to the organization data landscape and

user/functionality needs. A quick differentiation between the two products is the

required search capacity and the necessity of customization. SharePoint Search has a

theoretical limit of 100 million search items, but the real-life edge should be much

lower. The theoretical limit of FAST is 500 million but with the right hardware and

topology could go over this figure.

Customization is the second criterion, but it could be the most important. The

search engine of SharePoint cannot be modified or adapted, meaning that adapting the

ranking mechanism or the indexing and querying tool should be impossible. FAST

allows many customizations making it much more flexible and adaptable to the

enterprise requirements.

In any case, choosing between SharePoint Search and FAST must follow the

indispensable design steps of any design: gain full understanding of the business

requirements, understand the data background in terms of quality, format and volume

and capture the search needs of the customer. The analysis of these factors should

indicate the right technology to use and provide the costs, risks and cost of ownership

estimations.

4.- SOME IMPORTANT DOCUMENTATION

Information about FAST is becoming better and more accessible. Because of the

close character of FAST ESP, the FAST version before Microsoft bought the company,

it was nearly impossible to find any kind of information about installation,

configuration, programming or use. Since release of the last version together with

SharePoint 2010, the flow of information from Microsoft has improved considerably.

The next list of documents from Microsoft is limited, but they represent the most

important information delivered from Microsoft about FAST for SharePoint 2010. The

list is limited to official Microsoft information, but slowly more and more independent

information is appearing in Internet from other sources.

FAST Software Development Kit (SDK) – Probably the most important technical

document about FAST. Several parts of this book are based on the information

delivered from the SDK, especially the configuration chapter. The SDK can be found

online in the site of the TechNet Library (http://technet.microsoft.com/en-

us/library/ee781286.aspx)

http://technet.microsoft.com/en-us/library/ee781286.aspx

http://technet.microsoft.com/en-us/library/ee781286.aspx

Introduction 19

Microsoft FAST Search Server 2010 for SharePoint Enterprise Search Evaluation Guide - This evaluation guide is designed to give business decision makers

and IT professionals an understanding of the design goals and the details of the

enterprise search features provided by Microsoft FAST Search Server 2010 for

SharePoint

(http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=24972)

FAST Search Server 2010 for SharePoint Capacity Planning - This white paper

describes the performance and capacity impacts in relation to FAST Search Server

2010 for SharePoint. This white paper includes information about the performance and

capacity characteristics of the feature and how it was tested by Microsoft

(http://www.microsoft.com/downloads/details.aspx?FamilyID=65B799E3-825C-

4398-8CD7-3311D3297997&displaylang=e&displaylang=en)

Download Microsoft FAST Search Server 2010 for SharePoint Trial – 120 days

Trial version of FAST (http://technet.microsoft.com/en-us/evalcenter/ee424282)

http://www.microsoft.com/downloads/details.aspx?FamilyID=65B799E3-825C-4398-8CD7-3311D3297997&displaylang=e&displaylang=en

http://www.microsoft.com/downloads/details.aspx?FamilyID=65B799E3-825C-4398-8CD7-3311D3297997&displaylang=e&displaylang=en

http://technet.microsoft.com/en-us/evalcenter/ee424282

21

CHAPTER

Search is intrinsic to human nature, humans are searching continuously; as a

consequence, the concept of search is intuitively recognized. The term search is related

to the process of finding solutions to yet unsolved problems. In computers, search is

used almost as generally as in the human context: each algorithm searches for the

completion of a given task.

1.- GOALS OF SEARCH

Search has been an important part of computers since its very beginning, as the core

technique to solve problems. In general, search can be applied to many problems, from

solving games (chess has an expected search space of about 1044

possibilities, making it

possible for the IBM "Deep Fritz" computer to be able to find the correct answers to

win against the human world champion in 2006, evaluating about 10 million options

per second), to many industrial route planning systems that use search to answer

shortest- and quickest-route queries in fractions of the time that other algorithms can do

it.

Search algorithms can be used to solve optimally sequence alignment problems in

biology, to guide industrial robots in unknown environments or to find bugs in software

(using very similar search patterns as those used to find the most successful strategy to

win in chess). In summary, search and search algorithms are extensively used in the

real-world domain, although this book scope is limited to search information in the

search space of computer saved data.

FAST in the context of

Search

2


2.- INTERNET SEARCH VS. ENTERPRISE SEARCH

For information search purpose, traditionally there is a division made between

Internet search and Database search. The differences are more related to unstructured

information (information that has no formal relationship together) versus information

that has a more relational character. But lately it is very clear that both concepts are

fusing to something more general as Enterprise search.

Internet search is designed to go across web pages and documents, looking for new

and changed information and making indexes of everything they can find. The engines

for this kind of search are made to follow a process (a "Pipeline") that goes from

crawling to discover the sources of information, through indexing their content in a

structured way (in a database, xml files or any other form) finalizing with the

mechanism to resolve the user queries and deliver the results.

Database search is an integrated mechanism of each modern Database (see for

example the "Full-Text Search" functionality of Microsoft SQL Server). Technically

speaking the search mechanism also needs to have an index on the tables based on one

or more columns in the table. The databases that allow this type of search provide as

well language-specific linguistic components, including word breakers, stemmers and

thesaurus files, allowing the use of queries with full-text predicates such as "contains"

or "like", so that the user can perform a variety of types of searches (search for a single

word or phrase).

Most of the currently used search products are large internet search engines

optimized to crawl web-pages and documents using the capabilities of database search

engines, thus Enterprise search engines. They can search both structured and

unstructured data sources: web-pages and documents are crawled, discovered and

indexed in separated indexes. Search results are generated on-the-fly for the users

querying the indexes in parallel and organizing the results following predetermined

rules.

One further differentiation between web-pages search engines (like Bing or Google)

and document oriented search engines is the capabilities and speed of indexing

documentation: the second type is made able and optimized to open and understand the

structure of documents (using iFilters for example) natively. At the other side, the

crawlers of Internet search engines are considerably different in comparison with the

document search engines, because they must travel across a completely different

environment (IP addresses and www technologies).

A main distinction point is the Relevance of search results. The overlaps within the

context of Enterprise search are very different from those applied to Internet search.

Enterprise search cannot take advantage of the very rich structure of links as is found

on the www hyperlink content. Algorithms that exploit the hyperlink structure to build

the information ranks are more suitable to be exploited than the query-independent

factors used by Enterprise search, such as document date or popularity.

FAST in the context of Search 23

3.- SEARCH TERMINOLOGY AND CONCEPTS

Search engines use their unique concepts and terminology. Because search has a

very strong language component, almost all the terminology is lent from the linguistic

and philological study fields.

Authoritative Page – Page designated as more relevant than other pages (for

example the home page for the intranet of an organization). The higher the

authoritative assigned level, the higher the ranking of the page in the search results.

Best Bets – Hand-created list of keywords for common queries that can

dramatically improve the search experience, particularly on information-rich sites such

as intranets. Best Bets are presented prominently at the beginning of the search results,

followed by the rest of the matching pages. Implementing Best Bets is an effective way

to improve the quality of search results.

Content Source – Options specific to a precise content to be crawled, including its

start address. A Content Source for SharePoint can contain up to 500 start addresses.

Crawl – Crawl is the methodical and automated manner used for search engines to

find information. Crawlers are computer programs in charge of the crawling. Crawlers

mainly create a copy of all the visited web-pages and documents for later processing by

the search engine that will index to provide searches. Crawlers are often used for

automating other tasks as links and source code (HTML) validation and to gather

specific types of information like E-mail address for example. The crawlers are

responsible for the freshness of the information that the search engine can use. Because

crawling a huge amount of information can take weeks or months, by the time a

crawler has finished its crawl, many events could have happened (creation, update,

deletion of information); for the search engine there is a cost associated with not

detecting this events and having outdated copies of the information.

Crawlers can have also an impact on the performance of the servers that maintain

the information: if the crawlers are requiring huge amounts of webpages or documents

from a system, they can have a crippling impact on the performance of the servers.

General speaking, web search crawlers architecture are pretty much unknown

(Yahoo!Slurp the Yahoo Search crawler, Googlebot from Google or Bingbot from

Bing), but the crawlers for Enterprise search are well-documented (the FAST crawler

for example). Crawl Queue refers to the data structure that stores the list of items to be

crawled. Crawl Rule is a set of preferences that applies to a specific Content Source

and it is used to include and exclude items in a crawl. Crawled Property means a type

of metadata that can be discovered during a crawl and applied to one or more items and

can be promoted to Managed Property. A Managed Property is a specific property in

the metadata schema that can be made available for queries.

Duplicate and Duplicate Result Removal – Refers to identical or near identical

content that should be removed from the search results.


Entity Extraction – Seeks to locate and classify elements in text into predefined

categories such as people's names, organization, location, expressions of times,

quantities, monetary values, percentages, etc.

Faceted Search – It is a filtering technique to access collections of information

represented using classifications with some common significance. Allows users to

narrow down the search results. Also known as Navigators or Refiners.

Federation – Allows simultaneous search of multiple searchable resources. A

Federation establishes a collaborative link in between different search systems,

allowing the systems to query other search engines without the necessity to maintain

indexes of the external systems, arranging the results from the various sources into a

useful form and presenting them to the user. When the search data model of the search

system is different from the data model of the foreign target system, the query must be

first translated and the users' credentials must be passed to maintain the appropriate

security. On the return side, the results need to be mapped back from the foreign

system to the search engine form to be rendered to the user. Scalability and

performance are always a source of concern in Federation: the query performance and

results quality are totally dependent on the foreign search engine.

High Confidentiality – A Managed Property identified as a good indicator of a

highly relevant item

iFilter – An iFilter is a translator that teaches the search engine the structure of

documents to be indexed. Without an appropriate iFilter, contents of a file cannot be

parsed and indexed by the search engine. Windows Indexing Service, MSN Desktop

Search, Internet Information Server, SharePoint Server, Site Server, Exchange Server,

SQL Server and all other products based on Microsoft Search technology support

indexing technology based on iFilters.

Index – Indexing is the process of extracting information from the original data

source and saving it in a format that the search engine can understand. The index is

structured in such way that the engine can find quickly the information that contains a

particular term. Indexing can be a complex process that uses a lot of resources of the

search servers. During the indexing not only the constituent words of the source are

extracted, but the language, the boundaries of sentences and paragraphs, changes in the

case and stemming of the words into their roots are determined. Normally the indexing

process is continuous to refresh the complete index frequently. For Internet search it is

usual to have a limit on the information indexed for each page and an algorithm

decides which sections of the page are relevant to be indexed (to prevent overload of

the web servers that contain very large pages such as technical manuals). On the

contrary, for Document search it is important to index as much as possible information,

and normally the limit (if it exists) is very high.


Information Extraction – The study that attempts to identify semantic structures in

order to excerpt relevant data. It describes the techniques to develop systems to index

and search vast amounts of data effectively. The goal is to automatically extract

structured information from unstructured documents.

Inverse Document Frequency (IDF) – A measure of how rare a term is in a

collection of documents, calculated by total collection size divided by the number of

documents containing the term. Common terms ("the", "and" etc.) have a very low IDF

and are often excluded from search results. These low IDF words are commonly

referred to as "stop words".

Keyword – A word used in a query. In web search, Keywords are targeted based on

what users looking for in the HTML of the pages. In Enterprise search, Keywords can

be configured to target specific terms relevant for the specific company.

Keyword Density – A measure of the percentage of words in a document that are

specifically chosen as keywords of the total number of currently present words. The

ranking is based on (amongst many things) the percentage of words on a page that are

similar to the words used in the query.

Latent Semantic Indexing (LSI) – Also known as Latent Semantic Analysis. It is

an indexing which switches the current lexical functioning of every search engine to a

semantic one. It uses a mathematical technique (Singular Value Decomposition) to

identify patterns and relations between the terms contained in a text. In this way it is

possible that a query returns results which do not contain the keywords searched.

Search engines are heading to LSI to ensure more human accurate results.

Lemmatization – Is the process of grouping different forms of words so that they

can be analyzed as a single item. A lemmatization algorithm determines the "lemma"

for a word; that means it understands the context of the word and determines its role in

a sentence. Following the example given for Stemming, "playing", "player" and "play"

should be lemmatized to the lemma "play" as well. The difference with Stemming is

that the stemmer has no knowledge about the context of the word in the sentence and

therefore cannot discriminate between words which have different meanings depending

on their position or use in the sentence. Taking a different example, the words

"improved" should have "good" as lemma and a complete different stem.

Lemmatization can be very difficult to implement as it is not only language-dependent,

but also culture-dependent (one lemma can be different in the same language but also

in different countries).

Link Map – A Link Map is a graph structure of the nodes connected by links in

Internet search. The map facilitates the fast access to the data, the popularity score of

the page and the ranking algorithm.


Natural Language Processing (NLP) – A system that allows search engine users

to type a question rather than keywords. This can be reached, at the simplest level,

making the search engine remove the stop words in the question to leave keywords that

are then processed as if it was a regular query. At the other end of the scale are

advanced systems that use statistics and linguistic analysis to accurately match the

available indexes to the user's question.

Partial Word Matching – Some search engines will consider not only exact

matches, but also partial matches. This means that if the search term is contained

within a word in a document in its index, the search engine considers the document a

match. Strongly related to lemmatization and Stemming.

Phrase Search – A type of search that allows users to search for documents

containing an exact sentence or phrase, rather than single keywords. Important point

here is that in a phrase search the words have to appear side by side in the document

(exactly as in the query) to be considered a match. If the words appear dispersed or

they appear side by side but in the wrong sequence, it is not considered a match. Phrase

searching can be done on most search engines by simply enclosing the phrase in

quotation marks. Anti-phrasing means phrases for which there is no value in indexing

(for example “What xxx means”).

Pipeline – Specially tailored FAST architecture to address the challenges of

flexibility versus the inherent shortcomings of any search engine. The FAST Pipeline

(format conversion, language detection, stemming, entity extraction, lemmatization)

allows the introduction of custom plug-ins (stages) to enrich the data to be indexed; for

example, the entity extractor can be programmed to recognize entities that are

important to an organization.

Polysemy – One word can have several meanings. Language - dependent and very

difficult to address in algorithms.

Precision and Recall – Strongly related to the search accuracy, a simple metric that

computes the fraction of instances for which the correct result is returned. Search

engines often consider a document a match to a query when that document is not really

relevant to the query. These mistakes happen because search engines should conjecture

what the user means. Search engines must find a balance between recall (its ability to

find all relevant documents) and precision (its ability to find only relevant documents).

The aim is to retrieve all relevant documents and nothing else. Precision is scored by

dividing the total number of pages found by the number of relevant pages found. For

example, in a collection of 1000 documents if 100 documents are found and 60 are

relevant, the search engine's precision is 60%. In the same example, if the document

collection contains 70 hits that are relevant but only 60 were found, the Recall is 60/70

= 85%


Promotion / Demotion – Getting a search result to the top of the results rankings

means Promotion. The other way around is Demotion. In Enterprise search engines

there is always a configuration that can be implemented for Promotion and Demotion

of terms. Internet search engines have many different security mechanisms to prevent

the user from promoting or demoting sites in an illegal way.

Property Extraction – Allows the extraction of language-specific properties for

names (locations, company, people).

Ranking – Is the order by Relevance of the search results, so that the most relevant

ones come first. Relevance ranking mainly refers to the different features and

algorithms used to estimate the weight of documents and to sort them appropriately.

The most basic retrieval function is a Boolean query on the incidence of terms in the

information. Assuming a query “word1 word2” the Boolean AND query would return

all documents containing the word1 and word2 at least once. These documents

represent the set of potentially relevant documents: all documents not in this set could

be considered irrelevant and ignored. This step usually reduces the number of

documents to be considered for ranking, but it does not order the documents in the

result set. After that, each document needs to be “scored”: the document’s relevance

must be estimated as a function of its relevance features. Contemporary search engines

use hundreds of features as parameters to estimate the Ranking.

Relevance – How closely the search results that are returned to the user match what

the user wanted to find. Ideally, the results that are returned at the top are the most

relevant: the user does not have to look through several pages of results to find the best

matches for their search. In other words, Relevance describes how well a given search

satisfies a user’s information needs. The problem that search relevance attacks is to

estimate how pertinent a result is to a query. Commercial search engines combine

hundreds of features to estimate relevance. The specific features and their mode of

combination are often kept secret to prevent the user from forging the results.

Nevertheless, the main types of features in use, as well as the methods for their

combination, are publicly known and are the subject of scientific investigation.

Spelling Suggestions – ("Did you mean"). Type mistakes are very common when

users are typing search terms. The linguistic capabilities of modern search engines

allow the detection of the mistakes and the suggestion of related terms improving the

quality of searches. Spell checking exceptions can also be defined in FAST: the words

that are not found in the default spell checking dictionary but that are still valid.

Stemming – The process of reducing words to their stem or root form. An English

stemming algorithm should reduce the words "playing", "player" and "play" to the root

word "play". Stemming is a challenging task in the algorithm world and it is considered

as a difficult linguistic research field. Each language needs its own stemming

algorithms; some of them are more trivial that other, but the more complicated the

morphology and orthography of the language are, the more complex the stemming


becomes. Stemming is close related to Lemmatization. FAST map one form of a word

to its variants to enrich the query results.

Synonyms – Synonyms are different words with almost identical or similar

meanings. Depending of language, geographical origin and social-cultural status,

synonyms can have very different meaning because of etymology, orthography, phonic

qualities, ambiguous meanings, usage, etc. making them unique; this problem makes

Synonyms difficult to process by search engines. Normally Synonyms are presented to

the search engines as Thesauruses, lists of related words.

Term Frequency (TF) – A measure of how often a term is found in a collection of

documents. TF is combined with Inverse Document Frequency (IDF) as a means of

determining which documents are most relevant to a query. TF is also used to measure

how often a word appears in a specific document.

Tokenization – The process of splitting a text into individual words or tokens to be

indexed. All separation characters (spaces, commas, dashes, periods, etc.) are

considered delimiting characters and are excluded from the indexes. Tokenization is

dependent on the language and very important for Relevance.

4.- FAST VERSIONS

FS4SP: Fast Search for SharePoint 2010 is the FAST version packaged for the

SharePoint 2010 environment. Licensing is per Client Access License (CAL)/server.

FSIA: FAST Search for Internal Applications is aimed at organizations looking for

a standalone FAST implementation (not integrated with SharePoint) for internal use. It

is generally sold on a CAL/server basis.

FSIS: FAST Search for Internet Sites is aimed at online search applications. FSIS is

licensed per server.

FAST ESP: In release 5.3 as it was when Microsoft bought the product, it is the

last version of FAST before its integration in the Microsoft server stack. FAST

customers who are currently with FAST ESP Maintenance & Support can upgrade to

either FSIS or FSIA.

Microsoft divides the FAST family into two groups:

Search solutions for Business Productivity:

o Microsoft FAST Search Server 2010 for SharePoint (“FS4SP”)

o Microsoft FAST Search Server 2010 for Internal Applications

(“FSIA”).


Search solutions for Internet Sites

o Microsoft FAST Search Server 2010 for Internet Sites ("FSIS").

o Microsoft SharePoint Server 2010 for Internet Sites, Enterprise (“FIS-

E”). This product includes rights to Microsoft FAST Search Server

2010 for SharePoint Internet Sites (“FS4FIS”).

From the Microsoft sales FAST information:

“FSIS and FSIA must be purchased from FAST or FAST resellers. They are offered

only from the FAST price list, under a FAST EULA, and FAST maintenance and

support options are available. FS4SP and FIS-E will be available through Microsoft VL

only. FAST maintenance and support are not available for these VL products, but

Microsoft support and SA are available.”

“All servers need license coverage, just like for SharePoint or ESP for SharePoint.

The appropriate way to achieve license coverage depends on what the server is used

for.

Production (includes active and fault-tolerance servers), staging, admin, and hot

and warm stand-by servers all require product licensing

Cold stand-by servers used for disaster recovery do not require product licenses

as long as the customer is current on M or SA. This is a benefit of M/SA and

customers who drop M/SA lose this benefit.

Development and testing servers can be covered in a few ways. Under

Microsoft VL, customers can choose to cover them with product licenses

(server/CAL for FS4SP; server for FIS-E) or via MSDN subscriptions. Under

FAST, these rights will be included in the base licenses for FSIS and FSIA.

Each user of FSIA must be covered by a CAL.

Each virtual machine on a physical server counts as a server and requires a

separate license. This matches Microsoft licensing for server technology hosted

in a virtual environment.”

(http://www.microsoft.com/pathways/fast/FAST%20License%20Grants.htm)

Síguenos y descubrirás los mejores trucos y recursos:

¿Te interesa este libro?Cómpralo en nuestra tienda: www.campusmvp.com

Especialistas en formación online y librosde tecnologías Microsoft.

- En papel o en formato electrónico

- Sin DRM- Imprimible- Busca en el contenido

Lightning FAST Enterprise Searches in SharePoint 2010 - Krasis Press

Documents

Transcript of Lightning FAST Enterprise Searches in SharePoint 2010 - Krasis Press