Elasticsearch 筆記

June 12, 2020

—

特點

分散式的即時文件儲存，每個字段都可被索引並可搜索
分散式即時分析搜索引擎
- 不規則查詢
高擴展，可處裡 PB 級結構或非結構化數據
Lucene 實現索引和搜索功能

透過簡單的 RESTful API 來隱藏 Lucene 的複雜性，讓搜索變簡單

ES 能做什麼

全文檢索
模糊查詢
數據分析
- 聚合等

Elasticsearch 的交互方式

基於 HTTP 協定，以 JSON 為數據交互格式的 RESTful API

_cat

$ curl 'http://192.168.227.141:9200/_cat'
^.^
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes
/_cat/tasks
/_cat/indices
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/thread_pool/{thread_pools}
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}
/_cat/nodeattrs
/_cat/repositories
/_cat/snapshots/{repository}
/_cat/templates

$ curl 'http://192.168.227.141:9200/_cat/health'
1595807442 23:50:42 docker-cluster green 3 2 4 2 0 0 0 0 - 100.0%

$ curl 'http://192.168.227.141:9200/_cat/nodes'
172.18.0.4 25 85 2 0.18 0.53 0.58 mr  * es-master
172.18.0.2 12 85 2 0.18 0.53 0.58 dmr - es-node2
172.18.0.3  8 85 2 0.18 0.53 0.58 dmr - es-node3

ElasticSearch 儲存

面向文檔表示可以儲存整個對象或 document。並還可以索引(index)每個文檔都可搜索。在 ES 中，可以對文檔進行索引、搜索、排序、過濾等。傳統資料庫以一張表來存儲，並用 SQL 指令進行查詢在 ES 中每一條數據為一個 document
JSON 從上面圖中的 ES 來說，每一條 document 會以 JSON 進行存儲

Index

lab702/doc/1 –> index/type/id

$ curl -XPOST 'http://192.168.227.141:9200/lab702/doc/1' -i -H "Content-Type:application/json" -d '{"name":"itacho", "age": "26", "sex":"boy"}'
HTTP/1.1 201 Created
Location: /lab702/doc/1
Warning: 299 Elasticsearch-7.7.0-81a1e9eda8e6183f5237786246f6dced26a10eaf "[types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id})."
content-type: application/json; charsetUTF-8
content-length: 153

{"_index":"lab702","_type":"doc","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":0,"_primary_term":1}

_search

$ curl 'http://192.168.227.141:9200/lab702/_search' -i -H "Content-Type:application/json" -d '{"query":{"match_all":{}}}'
HTTP/1.1 200 OK
content-type: application/json; charsetUTF-8
content-length: 269

{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":"lab702","_type":"doc","_id":"1","_score":1.0,"_source":{"name":"itacho", "age": "26", "sex":"boy"}}]}}

ES 數據結構

在資料庫中，會有 database 在定義 table 在定義屬性最會插入數據。而在 ES 中一個 index 相當於一個 database。結構如下圖，其中 “id” 和 “name” 可以稱為 field。

$ curl 'http://192.168.227.141:9200/lab702/doc/1?prettytrue'
{
  "_index" : "lab702", // 索引
  "_type" : "doc", // 類型
  "_id" : "1", // document 唯一 ID
  "_version" : 1, // 更新時會遞增，document 只要發生變化都會使此關鍵字遞增，確保程序的一部分不會覆蓋另一部分所做的更改
  "_seq_no" : 0, 
  "_primary_term" : 1,
  "found" : true,
  "_source" : { // 查詢數據
    "name" : "itacho",
    "age" : "26",
    "sex" : "boy"
  }
}

詳細介紹可查看此篇

檢索

http://192.168.227.141:9200/lab702/doc/1 該語句為檢索在 lab702 index 中的 doc type 且編號為 1 的數據，相當與 SQL 指令 SELECT * FROM table WHERE id1

簡單檢索

MySQL：SELECT * FROM table ES：GET /index/type/_search

$ curl 'http://192.168.227.141:9200/lab702/doc/_search?prettytrue'
{
  "took" : 418, # 查詢時間
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : { # 查詢內容
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "itacho",
          "age" : "26",
          "sex" : "boy",
          "about" : "I Love MLB",
          "interests" : [
            "baseball",
            "cooking",
            "music"
          ]
        }
      },
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "naruto",
          "age" : "18",
          "sex" : "boy",
          "about" : "I Love NBA",
          "interests" : [
            "cooking",
            "music"
          ]
        }
      },
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "name" : "madara",
          "age" : "82",
          "sex" : "boy",
          "about" : "I Hate world",
          "interests" : [
            "running",
            "LOL"
          ]
        }
      }
    ]
  }
}

回應內容有被匹配到的 document，在此範例中是將所有 document 取出

全文檢索

此範例是找出 field 值有 madara 的相關 document。

$ curl 'http://192.168.227.141:9200/lab702/doc/_search?qmadara&prettytrue'
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.9808291,
    "hits" : [
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 0.9808291,
        "_source" : {
          "name" : "madara",
          "age" : "82",
          "sex" : "boy",
          "about" : "I Hate world",
          "interests" : [
            "running",
            "LOL"
          ]
        }
      }
    ]
  }
}

搜索（模糊查詢）

查詢所有 document 中 field 值分詞包含 love 的 document

$ curl 'http://192.168.227.141:9200/lab702/doc/_search?qlove&prettytrue'
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.4700036,
    "hits" : [
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 0.4700036,
        "_source" : {
          "name" : "itacho",
          "age" : "26",
          "sex" : "boy",
          "about" : "I Love MLB",
          "interests" : [
            "baseball",
            "cooking",
            "music"
          ]
        }
      },
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "2",
        "_score" : 0.4700036,
        "_source" : {
          "name" : "naruto",
          "age" : "18",
          "sex" : "boy",
          "about" : "I Love NBA",
          "interests" : [
            "cooking",
            "music"
          ]
        }
      }
    ]
  }
}

聚合

MySQL：Group By ES：aggs

ES 預設將聚合禁止，用以下方式將聚合的 field 添加優化

$ curl -X PUT "localhost:9200/lab702/_mapping?pretty" -H 'Content-Type: application/json' -d'
{
  "properties": {
    "interests": {
      "type":     "text",
      "fielddata": true
    }
  }
}
'
{
  "acknowledged" : true
}

$ curl -XPOST 'http://192.168.227.141:9200/lab702/doc/_search?prettytrue' -i -H "Content-Type:application/json" -d '{"aggs":{"all_inter":{"terms":{"field":"interests"}}}}'
HTTP/1.1 200 OK
Warning: 299 Elasticsearch-7.7.0-81a1e9eda8e6183f5237786246f6dced26a10eaf "[types removal] Specifying types in search requests is deprecated."
content-type: application/json; charsetUTF-8
content-length: 1839

{
  "took" : 29,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "itacho",
          "age" : "26",
          "sex" : "boy",
          "about" : "I Love MLB",
          "interests" : [
            "baseball",
            "cooking",
            "music"
          ]
        }
      },
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "naruto",
          "age" : "18",
          "sex" : "boy",
          "about" : "I Love NBA",
          "interests" : [
            "cooking",
            "music"
          ]
        }
      },
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "name" : "madara",
          "age" : "82",
          "sex" : "boy",
          "about" : "I Hate world",
          "interests" : [
            "running",
            "LOL"
          ]
        }
      }
    ]
  },
  "aggregations" : {
    "all_inter" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "cooking",
          "doc_count" : 2
        },
        {
          "key" : "music",
          "doc_count" : 2
        },
        {
          "key" : "baseball",
          "doc_count" : 1
        },
        {
          "key" : "lol",
          "doc_count" : 1
        },
        {
          "key" : "running",
          "doc_count" : 1
        }
      ]
    }
  }
}

搜索原理

正排索引和倒排索引

正排索引
- 紀錄 document ID 倒 document 內容、單詞的關聯關係
- 分詞過程

正排索引

ID	Content
1	火影行動
2	探索火影行動
3	火影特別行動
4	火影紀錄篇
5	特工火影特別探索

倒排索引
- 單詞到 document ID 的關係
- 分詞
  - 將句子拆分為單詞
- 檢索
  - B+tree 資料結構
  - 火影特工行動
    - 對應到倒排索引表中"火影"、“行動"和"特工”，因此會將紀錄的對應回傳
    - 會利用相關性得分來尋找最符合
      - 其紀錄中的 3 和 5 命中的多，從 3 的值來說分詞為三個但命中兩個機率為 2/3，而 5 則為 2/4
- Posting List
  - Document ID
  - Term Frequency
    - 單詞在 document 中出現次數，與相關性得分有關
  - Position
    - 單詞在 document 的位置，與搜尋有關
    - 從 0 開始
  - Offset
    - 單詞的開始與結束位置
- 相關性得分
- 每個 document 都會有一個倒排索引

正排索引

ID	Content
1	火影行動
2	探索火影行動
3	火影特別行動
4	火影紀錄篇
5	特工火影特別探索

Posting List，以火影為例，並用2個詞作為分詞

ID	TF	Position	Offset
1	1	0	<0,1>
2	1	1	<2,3>
3	1	0	<0,1>
4	1	0	<0,1>
5	1	1	<2,3>

ES 維護的倒排索引表

詞	紀錄
火影	1,2,3,4,5
行動	1,2,3
探索	2,5
特別	3,5
記錄篇	4
特工	5

分詞

在 Elasticsearch 中稱為 Analysis
將文本轉換成一系列單詞的過程

以"特工火影特別探索"為例，可分成"特工"、“火影”、“特別探索”、“探索”

分詞機制

Character Filter
- 對原始文本進行處理
- 特殊符號或 Html 標籤等
Tokenizer
- 對原始文本進行分詞
  - “火影忍者”，可分成"火影"、“忍者”
Token Filters
- 分詞後關鍵字進行加工
  - 轉小寫、刪除語氣詞、同意詞等

ES 自帶分詞器
- Standard
- Simple
- Whitespace
- Stop
- Keyword
  - 不分詞
- Pattern
- Language
- …
中文分詞
- IK
- Jieba
- Hanlp
- THULAC

API _analyze

$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"analyzer" : "standard","text" : "Hello {orld"}'
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}
 # standard 不支持中文

對索引字段進行分詞測試

$ curl -X GET "localhost:9200/lab702/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"field" : "about","text" : "Hello world!! I Love JAVA"}'
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "i",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "love",
      "start_offset" : 16,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "java",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

自定義分詞器

$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"tokenizer" : "standard","filter" : ["uppercase"], "text" : "Hello JAVA"}'
{
  "tokens" : [
    {
      "token" : "HELLO",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "JAVA",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

IK 分詞

在 docker-compose 中的 elasticsearch 服務新增一個掛載 plugins 的 volume，之後的套件透過本機的資料夾就可以連接到該 elasticsearch 的容器中。

IK 提供 ik_smart 和 ik_max_word
- 前者最少切分
- 後者最細粒度劃分

$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"analyzer" : "ik_smart","text" : "火影忍者它有夠難看"}'
{
  "tokens" : [
    {
      "token" : "火影忍者",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "它有",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "夠",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "難",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "看",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 4
    }
  ]
}
# 相比 standard 來得更好些

$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"analyzer" : "ik_max_word","text" : "火影忍者它有夠難看"}'
{
  "tokens" : [
    {
      "token" : "火影忍者",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "火影",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "忍者",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "它有",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "夠",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "難",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "看",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 6
    }
  ]
}

自定義分類器

Character Filter
- HTML strip
  - 去 HTML 標籤
- Mapping
  - 字符串替換操作
- Pattern Replace
  - 正規表示替換

$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"tokenizer": "keyword", "char_filter": ["html_strip"], "text": "<div><b>火影忍者</b></div>"}'
{
  "tokens" : [
    {
      "token" : "\n火影忍者\n",
      "start_offset" : 0,
      "end_offset" : 22,
      "type" : "word",
      "position" : 0
    }
  ]
}

document 操作

簡單的 CRUD

創建

索引一個 document document 透過 index API 被索引。使數據能夠被儲存與搜索。但一開始須先知道 document，而 document 透過其 _index、_type、_id 這些唯一值確定。我們可以自己決定一個_id，或者使用 index API 自動生成。

$ curl -X POST "localhost:9200/test/doc/1?prettytrue" -H 'Content-Type: application/json' -d'{"text": "<div><b>火影忍者</b></div>"}'
{
  "_index" : "test",
  "_type" : "doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}
# localhost:9200/index/type/id
# id 不指定會自動生成

取得

同樣使用 _index、_type、_id，這邊就用 GET 去獲取。當不指定 id 時會獲取全部。

指定 id

$ curl -X GET "localhost:9200/test/doc/1?prettytrue"
{
  "_index" : "test",
  "_type" : "doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "text" : "<div><b>火影忍者</b></div>"
  }
}

搜尋全部

$ curl -X GET "localhost:9200/lab702/doc/_search?prettytrue"
# localhost:9200/index/type/_search
#  _search API

檢索 document 一部分通常，GET 請求會返回 document 全部，並儲存在 _source 參數中。但可以利用 _source 針對要的 field 進行過濾。

$ curl -X GET "localhost:9200/lab702/doc/2?_sourcename,about&prettytrue"
{
  "_index" : "lab702",
  "_type" : "doc",
  "_id" : "2",
  "_version" : 1,
  "_seq_no" : 2,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "naruto",
    "about" : "I Love NBA"
  }
}

只想得到 _source 字段而不要其它數據，可以如下請求

$ curl -X GET "localhost:9200/lab702/doc/2/_source?prettytrue"
{
  "name" : "naruto",
  "age" : "18",
  "sex" : "boy",
  "about" : "I Love NBA",
  "interests" : [
    "cooking",
    "music"
  ]
}
$ curl -X GET "localhost:9200/lab702/doc/2/_source?_sourcename,about&prettytrue"
{
  "name" : "naruto",
  "about" : "I Love NBA"
}

刪除

使用 DELETE，並指定要刪除的 _index

$ curl -X DELETE "localhost:9200/test/doc/1/?prettytrue"
{
  "_index" : "test",
  "_type" : "doc",
  "_id" : "1",
  "_version" : 2,
  "result" : "deleted",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}
# 在搜尋刪除 id 驗證
$ curl -X GET "localhost:9200/test/doc/1?prettytrue"
{
  "_index" : "test",
  "_type" : "doc",
  "_id" : "1",
  "found" : false
}
# test 的索引還在
$ curl -XGET 127.0.0.1:9200/_cat/indices
green open lab702    gm1but48QvSphA3-zCTtmw 1 1  3 0 12.8kb  6.4kb
green open test      JkovYmndSLumkIT2v5cxbA 1 1  1 0  7.4kb  3.7kb
green open cch       4wwHrlA_TwG2REimQGzdrw 1 1  1 0  8.3kb  4.1kb
green open .kibana_1 M2_Fwd1TSqemehM846kb0Q 1 1 52 1 91.4kb 30.2kb

局部更新

_update 關鍵字，但通常覆蓋即可

批量插入 _bulk

每個 json 之間不能有換行，當資料量很大時很適合此方式，可借由 split 指令去切檔案大小在用 _bulk 上傳至 ES 中。

方式有以下

POST /_bulk
POST /<index>/_bulk

# 官方範例
curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "index" : { "_index" : "test", "_id" : "1" } } # 指定 index 與 id
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_id" : "2" } }
{ "create" : { "_index" : "test", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }
'

$ curl -X GET "localhost:9200/test/doc/1?prettytrue"
{
  "_index" : "test",
  "_type" : "doc",
  "_id" : "1",
  "_version" : 2, # 先前有定義過此 id，這次將它覆蓋
  "_seq_no" : 5,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "field1" : "value1",
    "field2" : "value2"
  }
}

$ curl -X POST "localhost:9200/index/doc/_bulk" -s -H "Content-Type: application/x-ndjson"  --data-binary "@file.json"

檢索多 document

合併多個請求可避免每個請求單獨的網路開銷。使用方式是在請求中使用 multi-get 或 mget API。

mget API 是一個 docs 的數組，數組中每個節點定義一個 document 的_index、_type和_id 數據，其中也可以搭配 _source 參數。

$ curl -X GET "localhost:9200/_mget?prettytrue" -H 'Content-Type: application/json' -d'{"docs":[{"_index": "test", "_type": "doc", "_id": 3},{"_index": "lab702", "_type": "doc", "_id": 1, "_source":"about"}]}'
{
  "docs" : [
    {
      "_index" : "test",
      "_type" : "doc",
      "_id" : "3",
      "_version" : 1,
      "_seq_no" : 4,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "field1" : "value3"
      }
    },
    {
      "_index" : "lab702",
      "_type" : "doc",
      "_id" : "1",
      "_version" : 2,
      "_seq_no" : 1,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "about" : "I Love MLB"
      }
    }
  ]
}
# 回應時也是一個 docs 的數組，每個 document 包含一個回應，且按照順序。而此效果與使用 GET 方式是回應相同內容

檢所同一 _index 中或 _type，可在請求時定義 _index 或 _index/_type

$ curl -X GET "localhost:9200/lab702/doc/_mget?pretty" -H 'Content-Type: application/json' -d'{ "ids":["3", "1"]}'
{
  "docs" : [
    {
      "_index" : "lab702",
      "_type" : "doc",
      "_id" : "3",
      "_version" : 1,
      "_seq_no" : 3,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "name" : "madara",
        "age" : "82",
        "sex" : "boy",
        "about" : "I Hate world",
        "interests" : [
          "running",
          "LOL"
        ]
      }
    },
    {
      "_index" : "lab702",
      "_type" : "doc",
      "_id" : "1",
      "_version" : 2,
      "_seq_no" : 1,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "name" : "itacho",
        "age" : "26",
        "sex" : "boy",
        "about" : "I Love MLB",
        "interests" : [
          "baseball",
          "cooking",
          "music"
        ]
      }
    }
  ]
}

然而，當請求的資訊有錯誤，其也會反映在回應內容中。

Mapping

定義資料庫中的表的結構的定義，透過 mapping 來控制索引儲存數據的設置

定義 index 下的 field 名稱
定義 field 的類型
- 數值
- 字串
- 布林
- …
定義倒排索引相關配置
- document id
- postion
- 得分
- …

獲取索引 mapping

$ curl -X GET "localhost:9200/lab702/_mapping?pretty"
{
  "lab702" : { # index 名稱
    "mappings" : {
      "properties" : {
        "about" : { # field
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "age" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "interests" : {
          "type" : "text",
          "fields" : { # 子字段
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "fielddata" : true # 可做聚合操作
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "sex" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

keyword

fielddata

自定義 mapping

custom mapping

$ curl -X PUT "localhost:9200/custom-index?pretty" -H 'Content-Type: application/json' -d'
> {
>   "mappings": {
>     "properties": {
>       "title":    { "type": "text" },
>       "name":  { "type": "keyword"  },
>       "age":   { "type": "integer"  }
>     }
>   }
> }
> '
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "custom-index"
}

$ curl -X POST  "localhost:9200/custom-index/_doc?pretty" -H 'Content-Type: application/json' -d'{"age": 22, "name": "naruto", "title": "Itachi"}'
{
  "_index" : "custom-index",
  "_type" : "_doc",
  "_id" : "N8LTonMBzMtUWLBQRrNR",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}
# 下面多 sex field，它會透過 dynamic Mapping 自動識別
$ curl -X POST  "localhost:9200/custom-index/_doc?pretty" -H 'Content-Type: application/json' -d'{"sex": "boy", "age": 22, "name": "naruto", "title": "Itachi"}'
{
  "_index" : "custom-index",
  "_type" : "_doc",
  "_id" : "OMLUonMBzMtUWLBQD7PN",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}

mapping 只可新增，無法修改。

_doc 相關資訊

Dynamic Mapping

ES 依靠 json document中 field 來實現自動識別字段 dynamic-field-mapping

dynamic 設置
- true
  - 允許自動新增字段
- false
  - 不允許自動新增字段，但 document 能夠正常寫入，無法對字段進行查詢操作
- static
  - document 不能寫入

Search API(URI)

GET /_search # 查詢所有 document
GET /index/_search # 查詢指定索引 document
GET /index, index2/_search # 查詢多索引 document
GET /index_*/_search # 正規表示過濾索引並查詢

查詢 q

$ curl -X GET "localhost:9200/packets/_doc/_search?qlayers.ip.ip_ip_ttl:51&pretty"

完整查詢

$ curl -X GET "localhost:9200/packets/_search?q6158&dflayers.udp.udp_udp_dstport&from0&size3&pretty" 
# 從 udp_udp_dstport 字段找 6158，其中顯示分頁 0 前 3 個資訊

q
- 查詢語句
df
- 預設查詢的 field，若不指定則會對所有的 field 進行查詢
sort
- 排序，asc 升序，desc 降序
timeout
- 指定超時時間，默認不超時
from, size
- 分頁用
term 和 phrase
- term 單詞查詢
- phrase 詞語查詢
  - 要求先後順序

$ curl -X GET "localhost:9200/packets/_search?q6158 
# q 為詞語，term
# 當 qlove you，則會做分詞查詢，phrase

泛查詢

對所有的 term 進行查詢
全文檢索

指定字段

qfield:value

Group

分組設定
表達式

$ curl -X GET "localhost:9200/packets/_search?qjob:(JAVA and K8s)

profile

查詢過程

$ curl -X GET "localhost:9200/packets/_search?q6158&dflayers.udp.udp_udp_dstport&from0&size1&pretty" -H 'Content-Type: application/json' -d'{"profile": true}'
...
"profile" : {
    "shards" : [
      {
        "id" : "[z2HVKUcpTzed_-VfHGlLSA][packets][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "TermQuery",
                "description" : "layers.udp.udp_udp_dstport:6158",
                "time_in_nanos" : 1239816,
                "breakdown" : {
                  "set_min_competitive_score_count" : 8,
                  "match_count" : 0,
                  "shallow_advance_count" : 0,
                  "set_min_competitive_score" : 5643,
                  "next_doc" : 17092,
                  "match" : 0,
                  "next_doc_count" : 10,
                  "score_count" : 10,
                  "compute_max_score_count" : 0,
                  "compute_max_score" : 0,
                  "advance" : 14305,
                  "advance_count" : 4,
                  "score" : 43095,
                  "build_scorer_count" : 17,
                  "create_weight" : 276371,
                  "shallow_advance" : 0,
                  "create_weight_count" : 1,
                  "build_scorer" : 883260
                }
              }
            ],
            "rewrite_time" : 4903,
            "collector" : [
              {
                "name" : "SimpleTopScoreDocCollector",
                "reason" : "search_top_hits",
                "time_in_nanos" : 83802
              }
            ]
          }
        ],
        "aggregations" : [ ]
      }
    ]
...

boolean 操作

AND(&&)
OR(||)
NOT(!)
+ must
- 需編碼
- %2B
- must_not
- 需編碼

範圍查詢

支援數值與日期
區間
- 閉區間 []
- 開區間 {}

age:[1 TO 10] # 1< age < 10
age:[1 TO 10} # 1< age < 10
age:[1 TO ] # 1< age 
age:[* TO 10] # age < 10

算術符號

age:>1
age:(>1 && <10)

萬用字元表示查詢

?:1 多個字符
*:0 0 或多個字符
執行效率低

正規表達式

field:/[mb]oat/

模糊匹配

field:roam~1 # 匹配與 roam 差一個字元的詞，foam, roams ...

近似度查詢

以 term 為單位進行差異比較

"fpx quick"~5

search API(Request Body Search)

Match Query

對字段做全文搜索，但不支持多字段查詢。

$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"match":{"layers.eth.eth_eth_addr_oui_resolved": "Cisco Systems, Inc"}}}' # term 方式

$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{
    "query":{
        "match":{
            "layers.eth.eth_eth_addr_oui_resolved": 
                "query": "Cisco Systems, Inc",
                "operator": "and"
        }
    }
}'

Match phrase

$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"match_phrase":{"layers.eth.eth_eth_addr_oui_resolved":"Cisco Systems, Inc"}}}'

$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{
    "query":{
        "match_phrase":{
            "layers.eth.eth_eth_addr_oui_resolved": 
                "query": "Cisco Systems, Inc",
                "slop": 1 # 允許有幾個詞的差異
        }
    }
}'

字符查詢

$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"query_string":{"fields" :["layers.eth.eth_eth_addr_oui_resolved"], "query": "(Cisco OR Dell)"}}}' 

# fields 可以多字段

不分詞查詢

curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"term":{"layers.eth.eth_eth_addr_oui_resolved":{"value": "Cisco Systems, Inc"}}}}' # 不分詞

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

使用 Cisco Systems, Inc 匹配該字段值，然而因為分詞導致無法匹配成功。

預設情況下該字段值為 Cisco Systems, Inc 將被分成 Cisco Systems Inc 因為這樣，導致每個分詞不等於 Cisco Systems, Inc

terms查詢

Range 查詢

Bool 查詢

欄位	描述
filter	只過濾符合條件的 document，不計算相關性得分
must	document 必須符合 must 中的所有條件，會影響相關性得分
must_not	document 必須不符合 must_not 中所有條件
should	document 可以符合 should 中的條件，會影響相關性得分

Bool 查詢

ES 集群

集群分片

index 只是一個用來指向一個或多個分片的羅集命名空間
分片是一個最小級別工作單元
- 保存索引中所有數據的一部分，是一個 Lucene 實例
- 本身是一個完整的搜索引擎
- document 儲存在分片中，並在分片中被索引
- 直接與索引通訊
分片是 ES 在集群中分發數據的關鍵
分片分配到集群中的節點上
當作水平擴展時，ES 會自動在節點間遷移分片，同時保持平衡

Index

相當於 MySQL 中的 Insert 或 Database

Type

在 Index 中，可以定義一個或多個類型
類似于 MySQL 中的 Table，每一種類型的數據放在一起

Document

保存在某個 index 下，某種 Type 的一個數據
以 json 格式呈現
相似于 Table 中的內容
分片包含
- indices
  - 主分片
    - 負載
  - 索引中每個文檔屬於一個單獨的主分片，所以主分片的數量決定了索引最多能儲存多少數據
- replicate
  - 副分片
    - 高可用
  - 主分片的一個副本，可防止硬體故導致數據的丟失，同時也可提供請求

# 預設
PUT /index
{
    "settings":{
        "number_of_shards": 3,
        "number_of_replicas": 1
    }
}
# 修改方式
PUT /index/_settings
{
    "settings":{
        "number_of_replicas": 3
    }
}

路由

ES 如何知道文檔屬於哪個分片的呢?創建時又是如何知道儲存在哪個分片上 ?透過以下算法決定

shard  hash(routing)%number_of_primary_shard
# routing 值適任一字串，默認 _id 但也可以自定義

如果主分片的數量在未來改變了，所有先前路由的值就失效，文檔也就永遠找不到了，因此該數量無法在定義後修改

Home

Blog

Project

Life

Kubernetes

Coding

Elasticsearch 筆記

ES 能做什麼

Elasticsearch 的交互方式

_cat

ElasticSearch 儲存

Index

_search

ES 數據結構

檢索

簡單檢索

全文檢索

搜索（模糊查詢）

聚合

搜索原理

正排索引和倒排索引

分詞

分詞機制

API _analyze

對索引字段進行分詞測試

自定義分詞器

IK 分詞

自定義分類器

document 操作

創建

取得

刪除

局部更新

批量插入 _bulk

檢索多 document

Mapping

獲取索引 mapping

自定義 mapping

Dynamic Mapping

Search API(URI)

查詢 q

完整查詢

泛查詢

指定字段

Group

profile

boolean 操作

範圍查詢

萬用字元表示查詢

正規表達式

模糊匹配

近似度查詢

search API(Request Body Search)

Match Query

Match phrase

字符查詢

不分詞查詢

Range 查詢

Bool 查詢

相關性算分

TF-IDE

BM25

ES 集群

集群分片

路由