Elasticsearch评分相关度算法解析

TF算法

TF算法，全称 Term frequency ，索引词频率算法。意义就像它的名字，会根据索引词的频率来计算，索引词出现的次数越多，分数越高。

例子如下

搜索 hello

有两份文档：A文档：hello world!,B文档：hello hello hello

结果是B文档的 score 大于A文档。

搜索 hello world

有两份文档：A文档：hello world!,B文档：hello,are you ok?

结果是A文档的score大于B文档。

要根据索引词来综合考虑。

如果不在意词在某个字段中出现的频次，而只在意是否出现过，则可以在字段映射中禁用词频统计

{
  "mappings": {
    "_doc": {
      "properties": {
        "text": {
          "type":          "string",
          "index_options": "docs"
        }
      }
    }
  }
}

将参数 index_options 设置为 docs 可以禁用词频统计及词频位置，这个映射的字段不会计算词的出现次数，对于短语或近似查询也不可用。要求精确查询的 not_analyzed 字符串字段会默认使用该设置。

IDF算法

IDF 算法全称 Inverse Document Frequency ，逆文本频率。搜索文本的词在整个索引的所有文档中出现的次数越多，这个词所占的 score 的比重就越低。

例子如下

搜索hello world,其中索引中hello出现次数1000次，world出现100次。有三份文档：A 文档 hello,are you ok? , B 文档 The world is interesting! , C 文档 hello world!

结果是：C>B>A

由于hello出现频率高，所以单个hello得到的score比不上world。

Field-length norm算法 (字段长度归一值)

字段的长度是多少？

字段越短，字段的权重越高。如果词出现在类似标题 title 这样的字段，要比它出现在内容 body 这样的字段中的相关度更高。

例子如下：

搜索 hello world! 有两份文档：A文档 hello world! ,B文档 hello world,I'm xxx!

结果是：A>B

词频（term frequency）、逆向文档频率（inverse document frequency）和字段长度归一值（field-length norm）——是在索引时计算并存储的。最后将它们结合在一起计算单个词在特定文档中的权重。

当然，查询通常不止一个词，所以需要一种合并多词权重的方式——向量空间模型（vector space model）。

三种算法的综合

（下面属于理论分析，并不真实这样计算）

TF 算法针对在 Field 中，索引词出现的频率； IDF 算法针对在整个索引中的索引词出现的频率； Field-length norm 算法针对 Field 的长度。

那么可以这样分析，由于 Field-length norm 算法并不直接针对 score ，所以它是最后起作用的，它理论上类似于一个除数。而 TF 和 IDF 是平等的， IDF 计算出每一个索引词的 score 量， TF 来计算整个文档中索引词的 score 的加和。

也就是如下的计算：

IDF：计算索引词的单位 score ，比如 hello=0.1,world=0.2 ，
TF：计算整个文档的 sum(score) ，hello world!I'm xxx. 得到 0.1+0.2=0.3
Field-length norm：将 sum(score)/对应Field的长度 ，得出的结果就是 score 。

利用score计算API分析

创建模拟数据

PUT /test-7

{
  "settings": {
    "index":{
      "number_of_shards":3,
            "number_of_replicas":1
    }
  },
  "mappings": {
      "properties": {
        "name":{
          "type": "text"
        }
      }
  }
}

PUT /test-7/_doc/1

{
    "name": "li feng"
}

PUT /test-7/_doc/2

{
    "name": "li er"
}

explain分析

/test-7/_doc/_search?explain=true

{
    "query": {
        "match": {
            "name": "li"
        }
    }
}

响应

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_shard": "[test-7][1]",
                "_node": "DpJZ5rhrStKpiur5hZ_ilw",
                "_index": "test-7",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.2876821,
                "_source": {
                    "name": "li er"
                },
                // 先列出分数
                "_explanation": {
                    "value": 0.2876821,

                    // 分数的组成， details详细分析
                    "description": "weight(name:li in 0) [PerFieldSimilarity], result of:",

                    // 解释分数
                    "details": [
                        {
                            "value": 0.2876821,
                            "description": "score(freq=1.0), computed as boost * idf * tf from:",
                            "details": [
                                {
                                    "value": 2.2,
                                    "description": "boost",
                                    "details": []
                                },
                                {
                                    "value": 0.2876821,
                                    "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                    // 逆文本频率 计算 idf
                                    "details": [
                                        {
                                            "value": 1,
                                            // 表示从当前分片中匹配到的文档记录数
                                            "description": "n, number of documents containing term",
                                            "details": []
                                        },
                                        {
                                            "value": 1,
                                            // 表示的是当前查询记录所处的分片上当前索引的文档数； 如果我们有多个分片，那么索引数据会被存储到多个分片上，每个分片上的文档记录数相加，得到的就是当前索引的文档总计录数了
                                            "description": "N, total number of documents with field",
                                            "details": []
                                        }
                                    ]
                                },
                                {
                                    "value": 0.45454544,
                                    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                    // 索引词频率计算 tf
                                    "details": [
                                        {
                                            "value": 1.0,
                                            // 检索关键词组在被检索字段的词组中出现的频率，即出现了多少次，比如上面的执行计划搜索 li 在字段中出现1次 即为1
                                            "description": "freq, occurrences of term within document",
                                            "details": []
                                        },
                                        {
                                            "value": 1.2,
                                            // 词的饱和度值，默认值为1.2
                                            "description": "k1, term saturation parameter",
                                            "details": []
                                        },
                                        {
                                            "value": 0.75,
                                            // 长度归一化评分 默认值为0.75
                                            "description": "b, length normalization parameter",
                                            "details": []
                                        },
                                        {
                                            "value": 2.0,
                                            // 被检索字段分词后的词组长度
                                            "description": "dl, length of field",
                                            "details": []
                                        },
                                        {
                                            "value": 2.0,
                                            // 分片中当前被检索字段的平均词组数值
                                            "description": "avgdl, average length of field",
                                            "details": []
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            },
            {
                "_shard": "[test-7][2]",
                "_node": "DpJZ5rhrStKpiur5hZ_ilw",
                "_index": "test-7",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "name": "li feng"
                },
                "_explanation": {
                    "value": 0.2876821,
                    "description": "weight(name:li in 0) [PerFieldSimilarity], result of:",
                    "details": [
                        {
                            "value": 0.2876821,
                            "description": "score(freq=1.0), computed as boost * idf * tf from:",
                            "details": [
                                {
                                    "value": 2.2,
                                    "description": "boost",
                                    "details": []
                                },
                                {
                                    "value": 0.2876821,
                                    "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                    "details": [
                                        {
                                            "value": 1,
                                            "description": "n, number of documents containing term",
                                            "details": []
                                        },
                                        {
                                            "value": 1,
                                            "description": "N, total number of documents with field",
                                            "details": []
                                        }
                                    ]
                                },
                                {
                                    "value": 0.45454544,
                                    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                    "details": [
                                        {
                                            "value": 1.0,
                                            "description": "freq, occurrences of term within document",
                                            "details": []
                                        },
                                        {
                                            "value": 1.2,
                                            "description": "k1, term saturation parameter",
                                            "details": []
                                        },
                                        {
                                            "value": 0.75,
                                            "description": "b, length normalization parameter",
                                            "details": []
                                        },
                                        {
                                            "value": 2.0,
                                            "description": "dl, length of field",
                                            "details": []
                                        },
                                        {
                                            "value": 2.0,
                                            "description": "avgdl, average length of field",
                                            "details": []
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            }
        ]
    }
}

上面还有一个 boost，这个我们解释一下，

对于每一个 term 的权值，其默认值为2.2，我们可以在创建索引 mapping 结构的时候指定字段的 boost 的值，更多情况下，我们可以使用 boost 来作为 ES 搜索结果的调优方案，比如搜索文档标题我们可以将boost 权重设置大一些，在搜索文档内容的时候，我们可以将 boost 权重设置小一些，从而实现动态的调整搜索结果，实现搜索不同的字段计算权重不同