Elasticsearch 筆記

June 12, 2020

特點

透過簡單的 RESTful API 來隱藏 Lucene 的複雜性,讓搜索變簡單

ES 能做什麼

Elasticsearch 的交互方式

_cat
$ curl 'http://192.168.227.141:9200/_cat'
^.^
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes
/_cat/tasks
/_cat/indices
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/thread_pool/{thread_pools}
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}
/_cat/nodeattrs
/_cat/repositories
/_cat/snapshots/{repository}
/_cat/templates
$ curl 'http://192.168.227.141:9200/_cat/health'
1595807442 23:50:42 docker-cluster green 3 2 4 2 0 0 0 0 - 100.0%
$ curl 'http://192.168.227.141:9200/_cat/nodes'
172.18.0.4 25 85 2 0.18 0.53 0.58 mr  * es-master
172.18.0.2 12 85 2 0.18 0.53 0.58 dmr - es-node2
172.18.0.3  8 85 2 0.18 0.53 0.58 dmr - es-node3

ElasticSearch 儲存

  1. 面向文檔 表示可以儲存整個對象或 document。並還可以索引(index)每個文檔都可搜索。在 ES 中,可以對文檔進行索引、搜索、排序、過濾等。 傳統資料庫以一張表來存儲,並用 SQL 指令進行查詢 在 ES 中每一條數據為一個 document
  2. JSON 從上面圖中的 ES 來說,每一條 document 會以 JSON 進行存儲
Index

lab702/doc/1 –> index/type/id

$ curl -XPOST 'http://192.168.227.141:9200/lab702/doc/1' -i -H "Content-Type:application/json" -d '{"name":"itacho", "age": "26", "sex":"boy"}'
HTTP/1.1 201 Created
Location: /lab702/doc/1
Warning: 299 Elasticsearch-7.7.0-81a1e9eda8e6183f5237786246f6dced26a10eaf "[types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id})."
content-type: application/json; charsetUTF-8
content-length: 153

{"_index":"lab702","_type":"doc","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":0,"_primary_term":1}
$ curl 'http://192.168.227.141:9200/lab702/_search' -i -H "Content-Type:application/json" -d '{"query":{"match_all":{}}}'
HTTP/1.1 200 OK
content-type: application/json; charsetUTF-8
content-length: 269

{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":"lab702","_type":"doc","_id":"1","_score":1.0,"_source":{"name":"itacho", "age": "26", "sex":"boy"}}]}}

ES 數據結構

在資料庫中,會有 database 在定義 table 在定義屬性最會插入數據。 而在 ES 中一個 index 相當於一個 database。結構如下圖,其中 “id” 和 “name” 可以稱為 field

$ curl 'http://192.168.227.141:9200/lab702/doc/1?prettytrue'
{
  "_index" : "lab702", // 索引
  "_type" : "doc", // 類型
  "_id" : "1", // document 唯一 ID
  "_version" : 1, // 更新時會遞增,document 只要發生變化都會使此關鍵字遞增,確保程序的一部分不會覆蓋另一部分所做的更改
  "_seq_no" : 0, 
  "_primary_term" : 1,
  "found" : true,
  "_source" : { // 查詢數據
    "name" : "itacho",
    "age" : "26",
    "sex" : "boy"
  }
}

詳細介紹可查看此篇

檢索

http://192.168.227.141:9200/lab702/doc/1 該語句為檢索在 lab702 index 中的 doc type 且編號為 1 的數據,相當與 SQL 指令 SELECT * FROM table WHERE id1

簡單檢索

MySQL:SELECT * FROM table ES:GET /index/type/_search

$ curl 'http://192.168.227.141:9200/lab702/doc/_search?prettytrue'
{
  "took" : 418, # 查詢時間
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : { # 查詢內容
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "itacho",
          "age" : "26",
          "sex" : "boy",
          "about" : "I Love MLB",
          "interests" : [
            "baseball",
            "cooking",
            "music"
          ]
        }
      },
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "naruto",
          "age" : "18",
          "sex" : "boy",
          "about" : "I Love NBA",
          "interests" : [
            "cooking",
            "music"
          ]
        }
      },
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "name" : "madara",
          "age" : "82",
          "sex" : "boy",
          "about" : "I Hate world",
          "interests" : [
            "running",
            "LOL"
          ]
        }
      }
    ]
  }
}

回應內容有被匹配到的 document,在此範例中是將所有 document 取出

全文檢索

此範例是找出 field 值有 madara 的相關 document。

$ curl 'http://192.168.227.141:9200/lab702/doc/_search?qmadara&prettytrue'
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.9808291,
    "hits" : [
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 0.9808291,
        "_source" : {
          "name" : "madara",
          "age" : "82",
          "sex" : "boy",
          "about" : "I Hate world",
          "interests" : [
            "running",
            "LOL"
          ]
        }
      }
    ]
  }
}
搜索(模糊查詢)

查詢所有 document 中 field 值分詞包含 love 的 document

$ curl 'http://192.168.227.141:9200/lab702/doc/_search?qlove&prettytrue'
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.4700036,
    "hits" : [
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 0.4700036,
        "_source" : {
          "name" : "itacho",
          "age" : "26",
          "sex" : "boy",
          "about" : "I Love MLB",
          "interests" : [
            "baseball",
            "cooking",
            "music"
          ]
        }
      },
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "2",
        "_score" : 0.4700036,
        "_source" : {
          "name" : "naruto",
          "age" : "18",
          "sex" : "boy",
          "about" : "I Love NBA",
          "interests" : [
            "cooking",
            "music"
          ]
        }
      }
    ]
  }
}
聚合

MySQL:Group By ES:aggs

ES 預設將聚合禁止,用以下方式將聚合的 field 添加優化

$ curl -X PUT "localhost:9200/lab702/_mapping?pretty" -H 'Content-Type: application/json' -d'
{
  "properties": {
    "interests": {
      "type":     "text",
      "fielddata": true
    }
  }
}
'
{
  "acknowledged" : true
}
$ curl -XPOST 'http://192.168.227.141:9200/lab702/doc/_search?prettytrue' -i -H "Content-Type:application/json" -d '{"aggs":{"all_inter":{"terms":{"field":"interests"}}}}'
HTTP/1.1 200 OK
Warning: 299 Elasticsearch-7.7.0-81a1e9eda8e6183f5237786246f6dced26a10eaf "[types removal] Specifying types in search requests is deprecated."
content-type: application/json; charsetUTF-8
content-length: 1839

{
  "took" : 29,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "itacho",
          "age" : "26",
          "sex" : "boy",
          "about" : "I Love MLB",
          "interests" : [
            "baseball",
            "cooking",
            "music"
          ]
        }
      },
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "naruto",
          "age" : "18",
          "sex" : "boy",
          "about" : "I Love NBA",
          "interests" : [
            "cooking",
            "music"
          ]
        }
      },
      {
        "_index" : "lab702",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "name" : "madara",
          "age" : "82",
          "sex" : "boy",
          "about" : "I Hate world",
          "interests" : [
            "running",
            "LOL"
          ]
        }
      }
    ]
  },
  "aggregations" : {
    "all_inter" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "cooking",
          "doc_count" : 2
        },
        {
          "key" : "music",
          "doc_count" : 2
        },
        {
          "key" : "baseball",
          "doc_count" : 1
        },
        {
          "key" : "lol",
          "doc_count" : 1
        },
        {
          "key" : "running",
          "doc_count" : 1
        }
      ]
    }
  }
}

搜索原理

正排索引和倒排索引

正排索引

IDContent
1火影行動
2探索火影行動
3火影特別行動
4火影紀錄篇
5特工火影特別探索

正排索引

IDContent
1火影行動
2探索火影行動
3火影特別行動
4火影紀錄篇
5特工火影特別探索

Posting List,以火影為例,並用2個詞作為分詞

IDTFPositionOffset
110<0,1>
211<2,3>
310<0,1>
410<0,1>
511<2,3>

ES 維護的倒排索引表

紀錄
火影1,2,3,4,5
行動1,2,3
探索2,5
特別3,5
記錄篇4
特工5

分詞

以"特工火影特別探索"為例,可分成"特工"、“火影”、“特別探索”、“探索”

分詞機制

API _analyze
$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"analyzer" : "standard","text" : "Hello {orld"}'
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}
 # standard 不支持中文
對索引字段進行分詞測試
$ curl -X GET "localhost:9200/lab702/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"field" : "about","text" : "Hello world!! I Love JAVA"}'
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "i",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "love",
      "start_offset" : 16,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "java",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

自定義分詞器
$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"tokenizer" : "standard","filter" : ["uppercase"], "text" : "Hello JAVA"}'
{
  "tokens" : [
    {
      "token" : "HELLO",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "JAVA",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

IK 分詞

在 docker-compose 中的 elasticsearch 服務新增一個掛載 plugins 的 volume,之後的套件透過本機的資料夾就可以連接到該 elasticsearch 的容器中。

$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"analyzer" : "ik_smart","text" : "火影忍者它有夠難看"}'
{
  "tokens" : [
    {
      "token" : "火影忍者",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "它有",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "夠",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "難",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "看",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 4
    }
  ]
}
# 相比 standard 來得更好些
$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"analyzer" : "ik_max_word","text" : "火影忍者它有夠難看"}'
{
  "tokens" : [
    {
      "token" : "火影忍者",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "火影",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "忍者",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "它有",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "夠",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "難",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "看",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 6
    }
  ]
}

自定義分類器

$ curl -X GET "localhost:9200/_analyze?prettytrue" -H 'Content-Type: application/json' -d'{"tokenizer": "keyword", "char_filter": ["html_strip"], "text": "<div><b>火影忍者</b></div>"}'
{
  "tokens" : [
    {
      "token" : "\n火影忍者\n",
      "start_offset" : 0,
      "end_offset" : 22,
      "type" : "word",
      "position" : 0
    }
  ]
}

document 操作

創建
  1. 索引一個 document document 透過 index API 被索引。使數據能夠被儲存與搜索。但一開始須先知道 document,而 document 透過其 _index_type_id 這些唯一值確定。我們可以自己決定一個_id,或者使用 index API 自動生成。
$ curl -X POST "localhost:9200/test/doc/1?prettytrue" -H 'Content-Type: application/json' -d'{"text": "<div><b>火影忍者</b></div>"}'
{
  "_index" : "test",
  "_type" : "doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}
# localhost:9200/index/type/id
# id 不指定會自動生成
取得

同樣使用 _index_type_id,這邊就用 GET 去獲取。當不指定 id 時會獲取全部。

$ curl -X GET "localhost:9200/test/doc/1?prettytrue"
{
  "_index" : "test",
  "_type" : "doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "text" : "<div><b>火影忍者</b></div>"
  }
}

$ curl -X GET "localhost:9200/lab702/doc/_search?prettytrue"
# localhost:9200/index/type/_search
#  _search API
$ curl -X GET "localhost:9200/lab702/doc/2?_sourcename,about&prettytrue"
{
  "_index" : "lab702",
  "_type" : "doc",
  "_id" : "2",
  "_version" : 1,
  "_seq_no" : 2,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "naruto",
    "about" : "I Love NBA"
  }
}

只想得到 _source 字段而不要其它數據,可以如下請求

$ curl -X GET "localhost:9200/lab702/doc/2/_source?prettytrue"
{
  "name" : "naruto",
  "age" : "18",
  "sex" : "boy",
  "about" : "I Love NBA",
  "interests" : [
    "cooking",
    "music"
  ]
}
$ curl -X GET "localhost:9200/lab702/doc/2/_source?_sourcename,about&prettytrue"
{
  "name" : "naruto",
  "about" : "I Love NBA"
}

刪除

使用 DELETE,並指定要刪除的 _index

$ curl -X DELETE "localhost:9200/test/doc/1/?prettytrue"
{
  "_index" : "test",
  "_type" : "doc",
  "_id" : "1",
  "_version" : 2,
  "result" : "deleted",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}
# 在搜尋刪除 id 驗證
$ curl -X GET "localhost:9200/test/doc/1?prettytrue"
{
  "_index" : "test",
  "_type" : "doc",
  "_id" : "1",
  "found" : false
}
# test 的索引還在
$ curl -XGET 127.0.0.1:9200/_cat/indices
green open lab702    gm1but48QvSphA3-zCTtmw 1 1  3 0 12.8kb  6.4kb
green open test      JkovYmndSLumkIT2v5cxbA 1 1  1 0  7.4kb  3.7kb
green open cch       4wwHrlA_TwG2REimQGzdrw 1 1  1 0  8.3kb  4.1kb
green open .kibana_1 M2_Fwd1TSqemehM846kb0Q 1 1 52 1 91.4kb 30.2kb

局部更新

_update 關鍵字,但通常覆蓋即可

批量插入 _bulk

每個 json 之間不能有換行,當資料量很大時很適合此方式,可借由 split 指令去切檔案大小在用 _bulk 上傳至 ES 中。

方式有以下

# 官方範例
curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "index" : { "_index" : "test", "_id" : "1" } } # 指定 index 與 id
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_id" : "2" } }
{ "create" : { "_index" : "test", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }
'

$ curl -X GET "localhost:9200/test/doc/1?prettytrue"
{
  "_index" : "test",
  "_type" : "doc",
  "_id" : "1",
  "_version" : 2, # 先前有定義過此 id,這次將它覆蓋
  "_seq_no" : 5,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "field1" : "value1",
    "field2" : "value2"
  }
}

$ curl -X POST "localhost:9200/index/doc/_bulk" -s -H "Content-Type: application/x-ndjson"  --data-binary "@file.json"
檢索多 document

合併多個請求可避免每個請求單獨的網路開銷。使用方式是在請求中使用 multi-getmget API

mget API 是一個 docs 的數組,數組中每個節點定義一個 document 的_index_type_id 數據,其中也可以搭配 _source 參數。

$ curl -X GET "localhost:9200/_mget?prettytrue" -H 'Content-Type: application/json' -d'{"docs":[{"_index": "test", "_type": "doc", "_id": 3},{"_index": "lab702", "_type": "doc", "_id": 1, "_source":"about"}]}'
{
  "docs" : [
    {
      "_index" : "test",
      "_type" : "doc",
      "_id" : "3",
      "_version" : 1,
      "_seq_no" : 4,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "field1" : "value3"
      }
    },
    {
      "_index" : "lab702",
      "_type" : "doc",
      "_id" : "1",
      "_version" : 2,
      "_seq_no" : 1,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "about" : "I Love MLB"
      }
    }
  ]
}
# 回應時也是一個 docs 的數組,每個 document 包含一個回應,且按照順序。而此效果與使用 GET 方式是回應相同內容

檢所同一 _index 中或 _type,可在請求時定義 _index_index/_type

$ curl -X GET "localhost:9200/lab702/doc/_mget?pretty" -H 'Content-Type: application/json' -d'{ "ids":["3", "1"]}'
{
  "docs" : [
    {
      "_index" : "lab702",
      "_type" : "doc",
      "_id" : "3",
      "_version" : 1,
      "_seq_no" : 3,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "name" : "madara",
        "age" : "82",
        "sex" : "boy",
        "about" : "I Hate world",
        "interests" : [
          "running",
          "LOL"
        ]
      }
    },
    {
      "_index" : "lab702",
      "_type" : "doc",
      "_id" : "1",
      "_version" : 2,
      "_seq_no" : 1,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "name" : "itacho",
        "age" : "26",
        "sex" : "boy",
        "about" : "I Love MLB",
        "interests" : [
          "baseball",
          "cooking",
          "music"
        ]
      }
    }
  ]
}

然而,當請求的資訊有錯誤,其也會反映在回應內容中。

Mapping

定義資料庫中的表的結構的定義,透過 mapping 來控制索引儲存數據的設置

獲取索引 mapping
$ curl -X GET "localhost:9200/lab702/_mapping?pretty"
{
  "lab702" : { # index 名稱
    "mappings" : {
      "properties" : {
        "about" : { # field
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "age" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "interests" : {
          "type" : "text",
          "fields" : { # 子字段
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "fielddata" : true # 可做聚合操作
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "sex" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

keyword

fielddata

自定義 mapping

custom mapping

$ curl -X PUT "localhost:9200/custom-index?pretty" -H 'Content-Type: application/json' -d'
> {
>   "mappings": {
>     "properties": {
>       "title":    { "type": "text" },
>       "name":  { "type": "keyword"  },
>       "age":   { "type": "integer"  }
>     }
>   }
> }
> '
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "custom-index"
}

$ curl -X POST  "localhost:9200/custom-index/_doc?pretty" -H 'Content-Type: application/json' -d'{"age": 22, "name": "naruto", "title": "Itachi"}'
{
  "_index" : "custom-index",
  "_type" : "_doc",
  "_id" : "N8LTonMBzMtUWLBQRrNR",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}
# 下面多 sex field,它會透過 dynamic Mapping 自動識別
$ curl -X POST  "localhost:9200/custom-index/_doc?pretty" -H 'Content-Type: application/json' -d'{"sex": "boy", "age": 22, "name": "naruto", "title": "Itachi"}'
{
  "_index" : "custom-index",
  "_type" : "_doc",
  "_id" : "OMLUonMBzMtUWLBQD7PN",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}

mapping 只可新增,無法修改。

_doc 相關資訊

Dynamic Mapping

ES 依靠 json document中 field 來實現自動識別字段 dynamic-field-mapping

Search API(URI)

GET /_search # 查詢所有 document
GET /index/_search # 查詢指定索引 document
GET /index, index2/_search # 查詢多索引 document
GET /index_*/_search # 正規表示過濾索引並查詢
查詢 q
$ curl -X GET "localhost:9200/packets/_doc/_search?qlayers.ip.ip_ip_ttl:51&pretty"
完整查詢
$ curl -X GET "localhost:9200/packets/_search?q6158&dflayers.udp.udp_udp_dstport&from0&size3&pretty" 
# 從 udp_udp_dstport 字段找 6158,其中顯示分頁 0 前 3 個資訊
$ curl -X GET "localhost:9200/packets/_search?q6158 
# q 為詞語,term
# 當 qlove you,則會做分詞查詢,phrase
泛查詢
指定字段
Group
$ curl -X GET "localhost:9200/packets/_search?qjob:(JAVA and K8s)
profile
$ curl -X GET "localhost:9200/packets/_search?q6158&dflayers.udp.udp_udp_dstport&from0&size1&pretty" -H 'Content-Type: application/json' -d'{"profile": true}'
...
"profile" : {
    "shards" : [
      {
        "id" : "[z2HVKUcpTzed_-VfHGlLSA][packets][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "TermQuery",
                "description" : "layers.udp.udp_udp_dstport:6158",
                "time_in_nanos" : 1239816,
                "breakdown" : {
                  "set_min_competitive_score_count" : 8,
                  "match_count" : 0,
                  "shallow_advance_count" : 0,
                  "set_min_competitive_score" : 5643,
                  "next_doc" : 17092,
                  "match" : 0,
                  "next_doc_count" : 10,
                  "score_count" : 10,
                  "compute_max_score_count" : 0,
                  "compute_max_score" : 0,
                  "advance" : 14305,
                  "advance_count" : 4,
                  "score" : 43095,
                  "build_scorer_count" : 17,
                  "create_weight" : 276371,
                  "shallow_advance" : 0,
                  "create_weight_count" : 1,
                  "build_scorer" : 883260
                }
              }
            ],
            "rewrite_time" : 4903,
            "collector" : [
              {
                "name" : "SimpleTopScoreDocCollector",
                "reason" : "search_top_hits",
                "time_in_nanos" : 83802
              }
            ]
          }
        ],
        "aggregations" : [ ]
      }
    ]
...
boolean 操作
範圍查詢
age:[1 TO 10] # 1< age < 10
age:[1 TO 10} # 1< age < 10
age:[1 TO ] # 1< age 
age:[* TO 10] # age < 10
age:>1
age:(>1 && <10)
萬用字元表示查詢
正規表達式
field:/[mb]oat/
模糊匹配
field:roam~1 # 匹配與 roam 差一個字元的詞,foam, roams ...
近似度查詢
"fpx quick"~5
Match Query

對字段做全文搜索,但不支持多字段查詢。

$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"match":{"layers.eth.eth_eth_addr_oui_resolved": "Cisco Systems, Inc"}}}' # term 方式
$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{
    "query":{
        "match":{
            "layers.eth.eth_eth_addr_oui_resolved": 
                "query": "Cisco Systems, Inc",
                "operator": "and"
        }
    }
}' 
Match phrase
$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"match_phrase":{"layers.eth.eth_eth_addr_oui_resolved":"Cisco Systems, Inc"}}}'
$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{
    "query":{
        "match_phrase":{
            "layers.eth.eth_eth_addr_oui_resolved": 
                "query": "Cisco Systems, Inc",
                "slop": 1 # 允許有幾個詞的差異
        }
    }
}' 
字符查詢
$ curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"query_string":{"fields" :["layers.eth.eth_eth_addr_oui_resolved"], "query": "(Cisco OR Dell)"}}}' 

# fields 可以多字段
不分詞查詢
curl -X GET "localhost:9200/packets/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"term":{"layers.eth.eth_eth_addr_oui_resolved":{"value": "Cisco Systems, Inc"}}}}' # 不分詞

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

使用 Cisco Systems, Inc 匹配該字段值,然而因為分詞導致無法匹配成功。

預設情況下該字段值為 Cisco Systems, Inc 將被分成 Cisco Systems Inc 因為這樣,導致每個分詞不等於 Cisco Systems, Inc

terms查詢

Range 查詢

Range 查詢

Bool 查詢
欄位描述
filter只過濾符合條件的 document,不計算相關性得分
mustdocument 必須符合 must 中的所有條件,會影響相關性得分
must_notdocument 必須不符合 must_not 中所有條件
shoulddocument 可以符合 should 中的條件,會影響相關性得分

Bool 查詢

相關性算分

指 document 查詢據間的相關度,透過到排索引可以獲取與查詢語句相匹配的 document 列表。

影響參數

  1. TF:詞頻,即單詞在 document 中出現的次數,詞頻越高相關度越高
  2. DF:document 詞頻,即單詞出現的 document 數
  3. IDF:與 document 詞頻相反,即 1/DF。即單詞出現的 document數越少,相關度越高(如果一個單詞在 document集出現越少,算為越重要單詞)
  4. Firld-length Norm:document 越短,相關度越高
TF-IDE
BM25

ES 集群

集群分片

  1. Index
  1. Type
  1. Document
# 預設
PUT /index
{
    "settings":{
        "number_of_shards": 3,
        "number_of_replicas": 1
    }
}
# 修改方式
PUT /index/_settings
{
    "settings":{
        "number_of_replicas": 3
    }
}

路由

ES 如何知道文檔屬於哪個分片的呢?創建時又是如何知道儲存在哪個分片上 ?透過以下算法決定

shard  hash(routing)%number_of_primary_shard
# routing 值適任一字串,默認 _id 但也可以自定義

如果主分片的數量在未來改變了,所有先前路由的值就失效,文檔也就永遠找不到了,因此該數量無法在定義後修改