Intro to elasticsearch
-
Upload
joey-wen -
Category
Engineering
-
view
82 -
download
2
Transcript of Intro to elasticsearch
Your Data, Your Search !
问志光2016-06-27
Outline Information retrieval Indexing & Searching Elasticsearch
Information retrieval Information Retrieval(IR) is finding
material(usually documents) of an unstructured nature(usually text) that statisfies an information need from within large collections(usually stored on computers).
Search Engine is a software system that is designed to search for information. It’s a kind of implementation of IR.
What is search engine? A search engine is
An index engine for documents A search engine on indexes
A search engine is more powerful to do searches:
It’s designed for it !
Search Engine Architecture
Problems ?? How to store the data ? How to index the data ? How to search the data ?
How to store the data ?
INVERTED LIST
How to
the data ?
INDEX
the follow two files File1: Students should be allowed to
go out with their friends, but not allowed to drink beer.
File2: My friend Jerry went to school to see his students but found them drunk which is not allowed.
Step 1: Tokenzier Split doc into words Remove the punctuation Remove stop word (the, a, this, that etc.)
“Students”,“ allowed”,“ go”,“ their”,“ friends”,“ allowed”,“ drink”,“ beer”,“My”,“ friend”,“ Jerry”,“went”,“ school”,“ see”,“ his”,“ students”,“ found”,“ them”,“ drunk”,“ allowed”
Step2: Linguistic Processor Lowercase Stemming, cars -> car, etc. Lemmatizatio, drove -> drive, etc.
“student”,“ allow”,“ go”,“ their”,“ friend”,“ allow”,“ drink”,“ beer”,“my”,“ friend”,“ jerry”,“ go”,“ school”,“ see”,“ his”,“ student”,“ find”,“ them”,“ drink”,“ allow”
Term
Step3: IndexTerm Document ID
student 1allow 1go 1their 1friend 1allow 1… …
Dict Sort Posting list
How to
the data ?
SEARCH
Step1: User search query• Suppose you have the follow query:
lucene AND learned NOT hadoop
Step2: Lexical & Syntax Analysis Identify words and keywords
Words: lucene, learned, hadoop Keywords: AND, NOT
Building a syntax tree
lucene learned
hadoopAND
Not
Step3: Search Search in the Inverted List Sort, Conjunction, Disconjunction Scorer
full text search
RESTful API
real time,Search and
analytics engine
open source
high availability
schema free
JSON over HTTP
Lucene based
distributed
RESTful API
ElasticSearch
Elastic Search Distributed and Highly Available Search Engine.
Each index is fully sharded with a configurable number of shards. Each shard can have one or more replicas. Read / Search operations performed on either one of the replica shard.
Multi Tenant with Multi Types. Support for more than one index. Support for more than one type per index. Index level configuration (number of shards, index storage, ...).
Document oriented No need for upfront schema definition. Schema can be defined per type for customization of the indexing process.
Various set of APIs HTTP RESTful API Native Java API. All APIs perform automatic node operation rerouting.
(Near) Real Time Search. Reliable, Asynchronous Write Behind for long term persistency. Built on top of Lucene
Each shard is a fully functional Lucene index All the power of Lucene easily exposed through simple configuration / plugins.
Per operation consistency Single document level operations are atomic, consistent, isolated and durable.
Open Source under the Apache License, version 2 ("ALv2")
Terminologies of Elastic Search Cluster Node Index Shard
Cluster
● A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes
● A cluster is identified by a unique name which by default is "elasticsearch"
Terminologies of Elastic Search
Node
● It is an elasticsearch instance (a java process)
● A node is created when a elasticsearch instance is started
● A random Marvel Charater name is allocated by default
Terminologies of Elastic Search
Index
● An index is a collection of documents that have somewhat similar characteristics. eg:customer data, product catalog
● Very crucial while performing indexing, search, update, and delete operations against the documents in it
● One can define as many indexes in one single cluster
Terminologies of Elastic Search
Document
● It is the most basic unit of information which can be indexed
● It is expressed in json (key:value) pair. ‘{“user”:”nullcon”}’
● Every Document gets associated with a type and a unique id.
Terminologies of Elastic Search
Shard● Every index can be split into multiple shards
to be able to distribute data.● The shard is the atomic part of an index,
which can be distributed over the cluster if you add more nodes.
Terminologies of Elastic Search
A terminology comparisonRelational database Elasticsearch
Database IndexTable TypeRow DocumentColumn FieldSchema MappingIndex Everything is indexedSQL Query DSLSELECT * FROm tb … GET http://UPDATE tb SET … PUT http://
Playing with Elasticsearch
REST API: http://host:port/[index]/[type]/[_action/id]HTTP Methods: GET, POST,PUT,DELETE
Playing with Elasticsearch• Search
– curl –XGET http://localhost:9200/my_index/test/_search– curl –XGET http://localhost:9200/my_index/_search– curl –XPUT http://localhost:9200/_search
• Meta Data– curl –XPUT http://localhost:9200/my_index/_status
• Documents:– curl –XPUT http://localhost:9200/my_index/test/1– curl –XGET http://localhost:9200/my_index/test/1– curl –XDELETE http://localhost:9200/my_index/test/1
Example: IndexCurl –XPUT http://localhost:9200/my_index/test/1 -d ‘{ "name": "joeywen", "value": 100}’
Example: SearchCurl –XGET http://localhost:9200/my_index/_search –d ‘{ “query”: { “match_all”: {} }}’
Total number of docs
Relevance
Search time
Max score
Creating, indexing, or deleting a single document
Plugins-Kopf
Plugins-head
Web
Q&A