Intro to elasticsearch

Your Data, Your Search !

问志光2016-06-27

Outline Information retrieval Indexing & Searching Elasticsearch

Information retrieval Information Retrieval(IR) is finding

material(usually documents) of an unstructured nature(usually text) that statisfies an information need from within large collections(usually stored on computers).

Search Engine is a software system that is designed to search for information. It’s a kind of implementation of IR.

What is search engine? A search engine is

An index engine for documents A search engine on indexes

A search engine is more powerful to do searches:

It’s designed for it !

Search Engine Architecture

Problems ?? How to store the data ? How to index the data ? How to search the data ?

How to store the data ?

INVERTED LIST

How to

the data ?

the follow two files File1: Students should be allowed to

go out with their friends, but not allowed to drink beer.

File2: My friend Jerry went to school to see his students but found them drunk which is not allowed.

Step 1: Tokenzier Split doc into words Remove the punctuation Remove stop word (the, a, this, that etc.)

“Students”，“ allowed”，“ go”，“ their”，“ friends”，“ allowed”，“ drink”，“ beer”，“My”，“ friend”，“ Jerry”，“went”，“ school”，“ see”，“ his”，“ students”，“ found”，“ them”，“ drunk”，“ allowed”

Step2: Linguistic Processor Lowercase Stemming, cars -> car, etc. Lemmatizatio, drove -> drive, etc.

“student”，“ allow”，“ go”，“ their”，“ friend”，“ allow”，“ drink”，“ beer”，“my”，“ friend”，“ jerry”，“ go”，“ school”，“ see”，“ his”，“ student”，“ find”，“ them”，“ drink”，“ allow”

Step3: IndexTerm Document ID

student 1allow 1go 1their 1friend 1allow 1… …

Dict Sort Posting list

How to

the data ?

SEARCH

Step1: User search query• Suppose you have the follow query：

lucene AND learned NOT hadoop

Step2: Lexical & Syntax Analysis Identify words and keywords

Words: lucene, learned, hadoop Keywords: AND, NOT

Building a syntax tree

lucene learned

hadoopAND

Step3: Search Search in the Inverted List Sort, Conjunction, Disconjunction Scorer

full text search

RESTful API

real time,Search and

analytics engine

open source

high availability

schema free

JSON over HTTP

Lucene based

distributed

RESTful API

ElasticSearch

Elastic Search Distributed and Highly Available Search Engine.

Each index is fully sharded with a configurable number of shards. Each shard can have one or more replicas. Read / Search operations performed on either one of the replica shard.

Multi Tenant with Multi Types. Support for more than one index. Support for more than one type per index. Index level configuration (number of shards, index storage, ...).

Document oriented No need for upfront schema definition. Schema can be defined per type for customization of the indexing process.

Various set of APIs HTTP RESTful API Native Java API. All APIs perform automatic node operation rerouting.

(Near) Real Time Search. Reliable, Asynchronous Write Behind for long term persistency. Built on top of Lucene

Each shard is a fully functional Lucene index All the power of Lucene easily exposed through simple configuration / plugins.

Per operation consistency Single document level operations are atomic, consistent, isolated and durable.

Open Source under the Apache License, version 2 ("ALv2")

Terminologies of Elastic Search Cluster Node Index Shard

Cluster

● A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes

● A cluster is identified by a unique name which by default is "elasticsearch"

Terminologies of Elastic Search

● It is an elasticsearch instance (a java process)

● A node is created when a elasticsearch instance is started

● A random Marvel Charater name is allocated by default

● An index is a collection of documents that have somewhat similar characteristics. eg:customer data, product catalog

● Very crucial while performing indexing, search, update, and delete operations against the documents in it

● One can define as many indexes in one single cluster

Document

● It is the most basic unit of information which can be indexed

● It is expressed in json (key:value) pair. ‘{“user”:”nullcon”}’

● Every Document gets associated with a type and a unique id.

Shard● Every index can be split into multiple shards

to be able to distribute data.● The shard is the atomic part of an index,

which can be distributed over the cluster if you add more nodes.

A terminology comparisonRelational database Elasticsearch

Database IndexTable TypeRow DocumentColumn FieldSchema MappingIndex Everything is indexedSQL Query DSLSELECT * FROm tb … GET http://UPDATE tb SET … PUT http://

Playing with Elasticsearch

REST API: http://host:port/[index]/[type]/[_action/id]HTTP Methods: GET, POST,PUT,DELETE

Playing with Elasticsearch• Search

– curl –XGET http://localhost:9200/my_index/test/_search– curl –XGET http://localhost:9200/my_index/_search– curl –XPUT http://localhost:9200/_search

• Meta Data– curl –XPUT http://localhost:9200/my_index/_status

• Documents:– curl –XPUT http://localhost:9200/my_index/test/1– curl –XGET http://localhost:9200/my_index/test/1– curl –XDELETE http://localhost:9200/my_index/test/1

Example: IndexCurl –XPUT http://localhost:9200/my_index/test/1 -d ‘{ "name": "joeywen", "value": 100}’

Example: SearchCurl –XGET http://localhost:9200/my_index/_search –d ‘{ “query”: { “match_all”: {} }}’

Total number of docs

Relevance

Search time

Max score

Creating, indexing, or deleting a single document

Plugins-Kopf

Plugins-head

Intro to elasticsearch

Engineering

Transcript of Intro to elasticsearch

Intro to category

Intro to Ember.js

Intro to sci_2554

Intro to Twitter

Intro to IMC

Intro to quant_analysis_students

intro to chromatography

Intro to plant

Intro to exploration

Intro to dosage_form_handout

Intro to Excel

Intro to Wordpress.com

Intro to NuGet

Intro to bm25

Intro to cardiology

Intro to Philosophy

Intro to DTrace

Intro to JFDG

Intro to JavaScript

Intro to Ieva