特點
- 分散式的即時文件儲存,每個字段都可被索引並可搜索
- 分散式即時分析搜索引擎
- 不規則查詢
- 高擴展,可處裡 PB 級結構或非結構化數據
- Lucene 實現索引和搜索功能
透過簡單的 RESTful API 來隱藏 Lucene 的複雜性,讓搜索變簡單
ES 能做什麼
- 全文檢索
- 模糊查詢
- 數據分析
- 聚合等
Elasticsearch 的交互方式
- 基於 HTTP 協定,以 JSON 為數據交互格式的 RESTful API
_cat
$ curl 'http://192.168.227.141:9200/_cat'
^.^
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes
/_cat/tasks
/_cat/indices
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/thread_pool/{thread_pools}
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}
/_cat/nodeattrs
/_cat/repositories
/_cat/snapshots/{repository}
/_cat/templates
$ curl 'http://192.168.227.141:9200/_cat/health'
1595807442 23:50:42 docker-cluster green 3 2 4 2 0 0 0 0 - 100.0%
$ curl 'http://192.168.227.141:9200/_cat/nodes'
172.18.0.4 25 85 2 0.18 0.53 0.58 mr * es-master
172.18.0.2 12 85 2 0.18 0.53 0.58 dmr - es-node2
172.18.0.3 8 85 2 0.18 0.53 0.58 dmr - es-node3
ElasticSearch 儲存
- 面向文檔
表示可以儲存整個對象或
document
。並還可以索引(index
)每個文檔都可搜索。在 ES 中,可以對文檔進行索引、搜索、排序、過濾等。 傳統資料庫以一張表來存儲,並用SQL
指令進行查詢 在 ES 中每一條數據為一個document
- JSON
從上面圖中的 ES 來說,每一條
document
會以JSON
進行存儲
Index
lab702/doc/1 –> index/type/id
$ curl -XPOST 'http://192.168.227.141:9200/lab702/doc/1' -i -H "Content-Type:application/json" -d '{"name":"itacho", "age": "26", "sex":"boy"}'
HTTP/1.1 201 Created
Location: /lab702/doc/1
Warning: 299 Elasticsearch-7.7.0-81a1e9eda8e6183f5237786246f6dced26a10eaf "[types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id})."
content-type: application/json; charsetUTF-8
content-length: 153
{"_index":"lab702","_type":"doc","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":0,"_primary_term":1}
_search
$ curl 'http://192.168.227.141:9200/lab702/_search' -i -H "Content-Type:application/json" -d '{"query":{"match_all":{}}}'
HTTP/1.1 200 OK
content-type: application/json; charsetUTF-8
content-length: 269
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":"lab702","_type":"doc","_id":"1","_score":1.0,"_source":{"name":"itacho", "age": "26", "sex":"boy"}}]}}
ES 數據結構
在資料庫中,會有 database 在定義 table 在定義屬性最會插入數據。
而在 ES 中一個 index 相當於一個 database。結構如下圖,其中 “id” 和 “name” 可以稱為 field
。
$ curl 'http://192.168.227.141:9200/lab702/doc/1?prettytrue'
{
"_index" : "lab702", // 索引
"_type" : "doc", // 類型
"_id" : "1", // document 唯一 ID
"_version" : 1, // 更新時會遞增,document 只要發生變化都會使此關鍵字遞增,確保程序的一部分不會覆蓋另一部分所做的更改
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : { // 查詢數據
"name" : "itacho",
"age" : "26",
"sex" : "boy"
}
}
詳細介紹可查看此篇
檢索
http://192.168.227.141:9200/lab702/doc/1
該語句為檢索在 lab702 index 中的 doc type 且編號為 1 的數據,相當與 SQL 指令 SELECT * FROM table WHERE id1
簡單檢索
MySQL:SELECT * FROM table
ES:GET /index/type/_search
$ curl 'http://192.168.227.141:9200/lab702/doc/_search?prettytrue'
{
"took" : 418, # 查詢時間
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : { # 查詢內容
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"name" : "itacho",
"age" : "26",
"sex" : "boy",
"about" : "I Love MLB",
"interests" : [
"baseball",
"cooking",
"music"
]
}
},
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"name" : "naruto",
"age" : "18",
"sex" : "boy",
"about" : "I Love NBA",
"interests" : [
"cooking",
"music"
]
}
},
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"name" : "madara",
"age" : "82",
"sex" : "boy",
"about" : "I Hate world",
"interests" : [
"running",
"LOL"
]
}
}
]
}
}
回應內容有被匹配到的 document,在此範例中是將所有 document
取出
全文檢索
此範例是找出 field 值有 madara 的相關 document。
$ curl 'http://192.168.227.141:9200/lab702/doc/_search?qmadara&prettytrue'
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.9808291,
"hits" : [
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "3",
"_score" : 0.9808291,
"_source" : {
"name" : "madara",
"age" : "82",
"sex" : "boy",
"about" : "I Hate world",
"interests" : [
"running",
"LOL"
]
}
}
]
}
}
搜索(模糊查詢)
查詢所有 document 中 field 值分詞包含 love 的 document
$ curl 'http://192.168.227.141:9200/lab702/doc/_search?qlove&prettytrue'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.4700036,
"hits" : [
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "1",
"_score" : 0.4700036,
"_source" : {
"name" : "itacho",
"age" : "26",
"sex" : "boy",
"about" : "I Love MLB",
"interests" : [
"baseball",
"cooking",
"music"
]
}
},
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "2",
"_score" : 0.4700036,
"_source" : {
"name" : "naruto",
"age" : "18",
"sex" : "boy",
"about" : "I Love NBA",
"interests" : [
"cooking",
"music"
]
}
}
]
}
}
聚合
MySQL:Group By
ES:aggs
ES 預設將聚合禁止,用以下方式將聚合的 field 添加優化
$ curl -X PUT "localhost:9200/lab702/_mapping?pretty" -H 'Content-Type: application/json' -d'
{
"properties": {
"interests": {
"type": "text",
"fielddata": true
}
}
}
'
{
"acknowledged" : true
}
$ curl -XPOST 'http://192.168.227.141:9200/lab702/doc/_search?prettytrue' -i -H "Content-Type:application/json" -d '{"aggs":{"all_inter":{"terms":{"field":"interests"}}}}'
HTTP/1.1 200 OK
Warning: 299 Elasticsearch-7.7.0-81a1e9eda8e6183f5237786246f6dced26a10eaf "[types removal] Specifying types in search requests is deprecated."
content-type: application/json; charsetUTF-8
content-length: 1839
{
"took" : 29,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"name" : "itacho",
"age" : "26",
"sex" : "boy",
"about" : "I Love MLB",
"interests" : [
"baseball",
"cooking",
"music"
]
}
},
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"name" : "naruto",
"age" : "18",
"sex" : "boy",
"about" : "I Love NBA",
"interests" : [
"cooking",
"music"
]
}
},
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"name" : "madara",
"age" : "82",
"sex" : "boy",
"about" : "I Hate world",
"interests" : [
"running",
"LOL"
]
}
}
]
},
"aggregations" : {
"all_inter" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "cooking",
"doc_count" : 2
},
{
"key" : "music",
"doc_count" : 2
},
{
"key" : "baseball",
"doc_count" : 1
},
{
"key" : "lol",
"doc_count" : 1
},
{
"key" : "running",
"doc_count" : 1
}
]
}
}
}
搜索原理
正排索引和倒排索引
- 正排索引
- 紀錄 document ID 倒 document 內容、單詞的關聯關係
- 分詞過程
正排索引
ID | Content |
---|---|
1 | 火影行動 |
2 | 探索火影行動 |
3 | 火影特別行動 |
4 | 火影紀錄篇 |
5 | 特工火影特別探索 |
- 倒排索引
- 單詞到 document ID 的關係
- 分詞
- 將句子拆分為單詞
- 檢索
- B+tree 資料結構
- 火影特工行動
- 對應到倒排索引表中"火影"、“行動"和"特工”,因此會將紀錄的對應回傳
- 會利用相關性得分來尋找最符合
- 其紀錄中的 3 和 5 命中的多,從 3 的值來說分詞為三個但命中兩個機率為 2/3,而 5 則為 2/4
- Posting List
- Document ID
- Term Frequency
- 單詞在 document 中出現次數,與相關性得分有關
- Position
- 單詞在 document 的位置,與搜尋有關
- 從 0 開始
- Offset
- 單詞的開始與結束位置
- 相關性得分
- 每個 document 都會有一個倒排索引
正排索引
ID | Content |
---|---|
1 | 火影行動 |
2 | 探索火影行動 |
3 | 火影特別行動 |
4 | 火影紀錄篇 |
5 | 特工火影特別探索 |
Posting List,以火影為例,並用2個詞作為分詞
ID | TF | Position | Offset |
---|---|---|---|
1 | 1 | 0 | <0,1> |
2 | 1 | 1 | <2,3> |
3 | 1 | 0 | <0,1> |
4 | 1 | 0 | <0,1> |
5 | 1 | 1 | <2,3> |
ES 維護的倒排索引表
詞 | 紀錄 |
---|---|
火影 | 1,2,3,4,5 |
行動 | 1,2,3 |
探索 | 2,5 |
特別 | 3,5 |
記錄篇 | 4 |
特工 | 5 |
分詞
- 在 Elasticsearch 中稱為 Analysis
- 將文本轉換成一系列單詞的過程
以"特工火影特別探索"為例,可分成"特工"、“火影”、“特別探索”、“探索”
分詞機制
- Character Filter
- 對原始文本進行處理
- 特殊符號或 Html 標籤等
- Tokenizer
- 對原始文本進行分詞
- “火影忍者”,可分成"火影"、“忍者”
- 對原始文本進行分詞
- Token Filters
- 分詞後關鍵字進行加工
- 轉小寫、刪除語氣詞、同意詞等
- 分詞後關鍵字進行加工
- ES 自帶分詞器
- Standard
- Simple
- Whitespace
- Stop
- Keyword
- 不分詞
- Pattern
- Language
- …
- 中文分詞
- IK
- Jieba
- Hanlp
- THULAC
API _analyze
$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"analyzer" : "standard","text" : "Hello {orld"}'
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
# standard 不支持中文
對索引字段進行分詞測試
$ curl -X GET "localhost:9200/lab702/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"field" : "about","text" : "Hello world!! I Love JAVA"}'
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "i",
"start_offset" : 14,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "love",
"start_offset" : 16,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "java",
"start_offset" : 21,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
自定義分詞器
$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"tokenizer" : "standard","filter" : ["uppercase"], "text" : "Hello JAVA"}'
{
"tokens" : [
{
"token" : "HELLO",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "JAVA",
"start_offset" : 6,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
IK 分詞
在 docker-compose 中的 elasticsearch 服務新增一個掛載 plugins 的 volume,之後的套件透過本機的資料夾就可以連接到該 elasticsearch 的容器中。
- IK 提供 ik_smart 和 ik_max_word
- 前者最少切分
- 後者最細粒度劃分
$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"analyzer" : "ik_smart","text" : "火影忍者它有夠難看"}'
{
"tokens" : [
{
"token" : "火影忍者",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "它有",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "夠",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "難",
"start_offset" : 7,
"end_offset" : 8,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "看",
"start_offset" : 8,
"end_offset" : 9,
"type" : "CN_CHAR",
"position" : 4
}
]
}
# 相比 standard 來得更好些
$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"analyzer" : "ik_max_word","text" : "火影忍者它有夠難看"}'
{
"tokens" : [
{
"token" : "火影忍者",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "火影",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "忍者",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "它有",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "夠",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "難",
"start_offset" : 7,
"end_offset" : 8,
"type" : "CN_CHAR",
"position" : 5
},
{
"token" : "看",
"start_offset" : 8,
"end_offset" : 9,
"type" : "CN_CHAR",
"position" : 6
}
]
}
自定義分類器
- Character Filter
- HTML strip
- 去 HTML 標籤
- Mapping
- 字符串替換操作
- Pattern Replace
- 正規表示替換
- HTML strip
$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"tokenizer": "keyword", "char_filter": ["html_strip"], "text": "<div><b>火影忍者</b></div>"}'
{
"tokens" : [
{
"token" : "\n火影忍者\n",
"start_offset" : 0,
"end_offset" : 22,
"type" : "word",
"position" : 0
}
]
}
document 操作
- 簡單的 CRUD
創建
- 索引一個 document
document
透過index API
被索引。使數據能夠被儲存與搜索。但一開始須先知道document
,而document
透過其_index
、_type
、_id
這些唯一值確定。我們可以自己決定一個_id
,或者使用index API
自動生成。
$ curl -X POST "localhost:9200/test/doc/1?prettytrue" -H 'Content-Type: application/json' -d'{"text": "<div><b>火影忍者</b></div>"}'
{
"_index" : "test",
"_type" : "doc",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
# localhost:9200/index/type/id
# id 不指定會自動生成
取得
同樣使用 _index
、_type
、_id
,這邊就用 GET
去獲取。當不指定 id
時會獲取全部。
- 指定 id
$ curl -X GET "localhost:9200/test/doc/1?prettytrue"
{
"_index" : "test",
"_type" : "doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"text" : "<div><b>火影忍者</b></div>"
}
}
- 搜尋全部
$ curl -X GET "localhost:9200/lab702/doc/_search?prettytrue"
# localhost:9200/index/type/_search
# _search API
- 檢索 document 一部分
通常,
GET
請求會返回 document 全部,並儲存在_source
參數中。但可以利用_source
針對要的field
進行過濾。
$ curl -X GET "localhost:9200/lab702/doc/2?_sourcename,about&prettytrue"
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "2",
"_version" : 1,
"_seq_no" : 2,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "naruto",
"about" : "I Love NBA"
}
}
只想得到 _source
字段而不要其它數據,可以如下請求
$ curl -X GET "localhost:9200/lab702/doc/2/_source?prettytrue"
{
"name" : "naruto",
"age" : "18",
"sex" : "boy",
"about" : "I Love NBA",
"interests" : [
"cooking",
"music"
]
}
$ curl -X GET "localhost:9200/lab702/doc/2/_source?_sourcename,about&prettytrue"
{
"name" : "naruto",
"about" : "I Love NBA"
}
刪除
使用 DELETE
,並指定要刪除的 _index
$ curl -X DELETE "localhost:9200/test/doc/1/?prettytrue"
{
"_index" : "test",
"_type" : "doc",
"_id" : "1",
"_version" : 2,
"result" : "deleted",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 1,
"_primary_term" : 1
}
# 在搜尋刪除 id 驗證
$ curl -X GET "localhost:9200/test/doc/1?prettytrue"
{
"_index" : "test",
"_type" : "doc",
"_id" : "1",
"found" : false
}
# test 的索引還在
$ curl -XGET 127.0.0.1:9200/_cat/indices
green open lab702 gm1but48QvSphA3-zCTtmw 1 1 3 0 12.8kb 6.4kb
green open test JkovYmndSLumkIT2v5cxbA 1 1 1 0 7.4kb 3.7kb
green open cch 4wwHrlA_TwG2REimQGzdrw 1 1 1 0 8.3kb 4.1kb
green open .kibana_1 M2_Fwd1TSqemehM846kb0Q 1 1 52 1 91.4kb 30.2kb
局部更新
_update
關鍵字,但通常覆蓋即可
批量插入 _bulk
每個 json 之間不能有換行,當資料量很大時很適合此方式,可借由 split
指令去切檔案大小在用 _bulk
上傳至 ES 中。
方式有以下
POST /_bulk
POST /<index>/_bulk
# 官方範例
curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "index" : { "_index" : "test", "_id" : "1" } } # 指定 index 與 id
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_id" : "2" } }
{ "create" : { "_index" : "test", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }
'
$ curl -X GET "localhost:9200/test/doc/1?prettytrue"
{
"_index" : "test",
"_type" : "doc",
"_id" : "1",
"_version" : 2, # 先前有定義過此 id,這次將它覆蓋
"_seq_no" : 5,
"_primary_term" : 1,
"found" : true,
"_source" : {
"field1" : "value1",
"field2" : "value2"
}
}
$ curl -X POST "localhost:9200/index/doc/_bulk" -s -H "Content-Type: application/x-ndjson" --data-binary "@file.json"
檢索多 document
合併多個請求可避免每個請求單獨的網路開銷。使用方式是在請求中使用 multi-get
或 mget API
。
mget API
是一個 docs
的數組,數組中每個節點定義一個 document 的_index
、_type
和_id
數據,其中也可以搭配 _source
參數。
$ curl -X GET "localhost:9200/_mget?prettytrue" -H 'Content-Type: application/json' -d'{"docs":[{"_index": "test", "_type": "doc", "_id": 3},{"_index": "lab702", "_type": "doc", "_id": 1, "_source":"about"}]}'
{
"docs" : [
{
"_index" : "test",
"_type" : "doc",
"_id" : "3",
"_version" : 1,
"_seq_no" : 4,
"_primary_term" : 1,
"found" : true,
"_source" : {
"field1" : "value3"
}
},
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "1",
"_version" : 2,
"_seq_no" : 1,
"_primary_term" : 1,
"found" : true,
"_source" : {
"about" : "I Love MLB"
}
}
]
}
# 回應時也是一個 docs 的數組,每個 document 包含一個回應,且按照順序。而此效果與使用 GET 方式是回應相同內容
檢所同一 _index
中或 _type
,可在請求時定義 _index
或 _index/_type
$ curl -X GET "localhost:9200/lab702/doc/_mget?pretty" -H 'Content-Type: application/json' -d'{ "ids":["3", "1"]}'
{
"docs" : [
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "3",
"_version" : 1,
"_seq_no" : 3,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "madara",
"age" : "82",
"sex" : "boy",
"about" : "I Hate world",
"interests" : [
"running",
"LOL"
]
}
},
{
"_index" : "lab702",
"_type" : "doc",
"_id" : "1",
"_version" : 2,
"_seq_no" : 1,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "itacho",
"age" : "26",
"sex" : "boy",
"about" : "I Love MLB",
"interests" : [
"baseball",
"cooking",
"music"
]
}
}
]
}
然而,當請求的資訊有錯誤,其也會反映在回應內容中。
Mapping
定義資料庫中的表的結構的定義,透過 mapping
來控制索引儲存數據的設置
- 定義
index
下的field
名稱 - 定義
field
的類型- 數值
- 字串
- 布林
- …
- 定義倒排索引相關配置
- document id
- postion
- 得分
- …
獲取索引 mapping
$ curl -X GET "localhost:9200/lab702/_mapping?pretty"
{
"lab702" : { # index 名稱
"mappings" : {
"properties" : {
"about" : { # field
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"age" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"interests" : {
"type" : "text",
"fields" : { # 子字段
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"fielddata" : true # 可做聚合操作
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"sex" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
自定義 mapping
$ curl -X PUT "localhost:9200/custom-index?pretty" -H 'Content-Type: application/json' -d'
> {
> "mappings": {
> "properties": {
> "title": { "type": "text" },
> "name": { "type": "keyword" },
> "age": { "type": "integer" }
> }
> }
> }
> '
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "custom-index"
}
$ curl -X POST "localhost:9200/custom-index/_doc?pretty" -H 'Content-Type: application/json' -d'{"age": 22, "name": "naruto", "title": "Itachi"}'
{
"_index" : "custom-index",
"_type" : "_doc",
"_id" : "N8LTonMBzMtUWLBQRrNR",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
# 下面多 sex field,它會透過 dynamic Mapping 自動識別
$ curl -X POST "localhost:9200/custom-index/_doc?pretty" -H 'Content-Type: application/json' -d'{"sex": "boy", "age": 22, "name": "naruto", "title": "Itachi"}'
{
"_index" : "custom-index",
"_type" : "_doc",
"_id" : "OMLUonMBzMtUWLBQD7PN",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 1,
"_primary_term" : 1
}
mapping 只可新增,無法修改。
Dynamic Mapping
ES 依靠 json document中 field 來實現自動識別字段 dynamic-field-mapping
- dynamic 設置
- true
- 允許自動新增字段
- false
- 不允許自動新增字段,但 document 能夠正常寫入,無法對字段進行查詢操作
- static
- document 不能寫入
- true
Search API(URI)
GET /_search # 查詢所有 document
GET /index/_search # 查詢指定索引 document
GET /index, index2/_search # 查詢多索引 document
GET /index_*/_search # 正規表示過濾索引並查詢
查詢 q
$ curl -X GET "localhost:9200/packets/_doc/_search?qlayers.ip.ip_ip_ttl:51&pretty"
完整查詢
$ curl -X GET "localhost:9200/packets/_search?q6158&dflayers.udp.udp_udp_dstport&from0&size3&pretty"
# 從 udp_udp_dstport 字段找 6158,其中顯示分頁 0 前 3 個資訊
q
- 查詢語句
df
- 預設查詢的 field,若不指定則會對所有的 field 進行查詢
sort
- 排序,asc 升序,desc 降序
timeout
- 指定超時時間,默認不超時
from, size
- 分頁用
term 和 phrase
- term 單詞查詢
- phrase 詞語查詢
- 要求先後順序
$ curl -X GET "localhost:9200/packets/_search?q6158
# q 為詞語,term
# 當 qlove you,則會做分詞查詢,phrase
泛查詢
- 對所有的 term 進行查詢
- 全文檢索
指定字段
qfield:value
Group
- 分組設定
- 表達式
$ curl -X GET "localhost:9200/packets/_search?qjob:(JAVA and K8s)
profile
- 查詢過程
$ curl -X GET "localhost:9200/packets/_search?q6158&dflayers.udp.udp_udp_dstport&from0&size1&pretty" -H 'Content-Type: application/json' -d'{"profile": true}'
...
"profile" : {
"shards" : [
{
"id" : "[z2HVKUcpTzed_-VfHGlLSA][packets][0]",
"searches" : [
{
"query" : [
{
"type" : "TermQuery",
"description" : "layers.udp.udp_udp_dstport:6158",
"time_in_nanos" : 1239816,
"breakdown" : {
"set_min_competitive_score_count" : 8,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 5643,
"next_doc" : 17092,
"match" : 0,
"next_doc_count" : 10,
"score_count" : 10,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 14305,
"advance_count" : 4,
"score" : 43095,
"build_scorer_count" : 17,
"create_weight" : 276371,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 883260
}
}
],
"rewrite_time" : 4903,
"collector" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time_in_nanos" : 83802
}
]
}
],
"aggregations" : [ ]
}
]
...
boolean 操作
- AND(&&)
- OR(||)
- NOT(!)
- + must
- 需編碼
%2B
- - must_not
- 需編碼
範圍查詢
- 支援數值與日期
- 區間
- 閉區間 []
- 開區間 {}
age:[1 TO 10] # 1< age < 10
age:[1 TO 10} # 1< age < 10
age:[1 TO ] # 1< age
age:[* TO 10] # age < 10
- 算術符號
age:>1
age:(>1 && <10)
萬用字元表示查詢
?:1
多個字符*:0
0 或多個字符- 執行效率低
正規表達式
field:/[mb]oat/
模糊匹配
field:roam~1 # 匹配與 roam 差一個字元的詞,foam, roams ...
近似度查詢
- 以 term 為單位進行差異比較
"fpx quick"~5
search API(Request Body Search)
Match Query
對字段做全文搜索,但不支持多字段查詢。
$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"match":{"layers.eth.eth_eth_addr_oui_resolved": "Cisco Systems, Inc"}}}' # term 方式
$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{
"query":{
"match":{
"layers.eth.eth_eth_addr_oui_resolved":
"query": "Cisco Systems, Inc",
"operator": "and"
}
}
}'
Match phrase
$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"match_phrase":{"layers.eth.eth_eth_addr_oui_resolved":"Cisco Systems, Inc"}}}'
$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{
"query":{
"match_phrase":{
"layers.eth.eth_eth_addr_oui_resolved":
"query": "Cisco Systems, Inc",
"slop": 1 # 允許有幾個詞的差異
}
}
}'
字符查詢
$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"query_string":{"fields" :["layers.eth.eth_eth_addr_oui_resolved"], "query": "(Cisco OR Dell)"}}}'
# fields 可以多字段
不分詞查詢
curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"term":{"layers.eth.eth_eth_addr_oui_resolved":{"value": "Cisco Systems, Inc"}}}}' # 不分詞
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
使用 Cisco Systems, Inc
匹配該字段值,然而因為分詞導致無法匹配成功。
預設情況下該字段值為 Cisco Systems, Inc
將被分成
Cisco
Systems
Inc
因為這樣,導致每個分詞不等於 Cisco Systems, Inc
Range 查詢
Bool 查詢
欄位 | 描述 |
---|---|
filter | 只過濾符合條件的 document,不計算相關性得分 |
must | document 必須符合 must 中的所有條件,會影響相關性得分 |
must_not | document 必須不符合 must_not 中所有條件 |
should | document 可以符合 should 中的條件,會影響相關性得分 |
相關性算分
指 document 查詢據間的相關度,透過到排索引可以獲取與查詢語句相匹配的 document 列表。
影響參數
- TF:詞頻,即單詞在 document 中出現的次數,詞頻越高相關度越高
- DF:document 詞頻,即單詞出現的 document 數
- IDF:與 document 詞頻相反,即
1/DF
。即單詞出現的 document數越少,相關度越高(如果一個單詞在 document集出現越少,算為越重要單詞) - Firld-length Norm:document 越短,相關度越高
TF-IDE
BM25
ES 集群
集群分片
- index 只是一個用來指向一個或多個分片的羅集命名空間
- 分片是一個最小級別工作單元
- 保存索引中所有數據的一部分,是一個 Lucene 實例
- 本身是一個完整的搜索引擎
- document 儲存在分片中,並在分片中被索引
- 直接與索引通訊
- 分片是 ES 在集群中分發數據的關鍵
- 分片分配到集群中的節點上
- 當作水平擴展時,ES 會自動在節點間遷移分片,同時保持平衡
- Index
- 相當於 MySQL 中的 Insert 或 Database
- Type
- 在 Index 中,可以定義一個或多個類型
- 類似于 MySQL 中的 Table,每一種類型的數據放在一起
- Document
- 保存在某個 index 下,某種 Type 的一個數據
- 以 json 格式呈現
- 相似于 Table 中的內容
- 分片包含
- indices
- 主分片
- 負載
- 索引中每個文檔屬於一個單獨的主分片,所以主分片的數量決定了索引最多能儲存多少數據
- 主分片
- replicate
- 副分片
- 高可用
- 主分片的一個副本,可防止硬體故導致數據的丟失,同時也可提供請求
- 副分片
- indices
# 預設
PUT /index
{
"settings":{
"number_of_shards": 3,
"number_of_replicas": 1
}
}
# 修改方式
PUT /index/_settings
{
"settings":{
"number_of_replicas": 3
}
}
路由
ES 如何知道文檔屬於哪個分片的呢?創建時又是如何知道儲存在哪個分片上 ?透過以下算法決定
shard hash(routing)%number_of_primary_shard
# routing 值適任一字串,默認 _id 但也可以自定義
如果主分片的數量在未來改變了,所有先前路由的值就失效,文檔也就永遠找不到了,因此該數量無法在定義後修改