Elasticsearchでtoo_many_clauseに遭遇した時のメモ

Elasticsearchで特定の条件で検索を行うときに too_many_clause に遭遇したときに調べた時のメモ。

indexの定義とデータは以下の感じとする

$ curl "localhost:9200/tests?pretty"
{
  "tests" : {
    "aliases" : { },
    "mappings" : {
      "test" : {
        "properties" : {
          "title" : {
            "type" : "text"
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "number_of_shards" : "5",
        "provided_name" : "tests",
        "creation_date" : "1508801834006",
        "analysis" : {
          "filter" : {
            "ngram" : {
              "type" : "nGram",
              "min_gram" : "3",
              "max_gram" : "25"
            }
          },
          "analyzer" : {
            "default" : {
              "filter" : [
                "ngram"
              ],
              "type" : "custom",
              "tokenizer" : "keyword"
            }
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "Fsi43Wx2S8uyyf8wtn59Xg",
        "version" : {
          "created" : "5060199"
        }
      }
    }
  }
}
$ curl "localhost:9200/tests/_search?pretty"
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "tests",
        "_type" : "test",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "title" : "fugafuga"
        }
      },
      {
        "_index" : "tests",
        "_type" : "test",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "title" : "hogehoge"
        }
      },
      {
        "_index" : "tests",
        "_type" : "test",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "title" : "foobar"
        }
      }
    ]
  }
}

curl だけで色々やるの面倒臭いので Elasticsearchのclientとして elasticsearch-ruby を使う。

まず、上記のindexの定義で普通に検索できることを確認

[29] pry(main)> client.search index: "tests",
[29] pry(main)* body: {
[29] pry(main)*   query: {
[29] pry(main)*     match: { title: "hoge" }
[29] pry(main)*   }
[29] pry(main)* }
=> {"took"=>2,
 "timed_out"=>false,
 "_shards"=>{"total"=>5, "successful"=>5, "skipped"=>0, "failed"=>0},
 "hits"=>{"total"=>1, "max_score"=>0.59868973, "hits"=>[{"_index"=>"tests", "_type"=>"test", "_id"=>"1", "_score"=>0.59868973, "_source"=>{"title"=>"hogehoge"}}]}}

ところが、極端に長い文字列で検索をかけると too_many_clause エラーが発生する

[30] pry(main)> client.search index: "tests",
[30] pry(main)* body: {
[30] pry(main)*   query: {
[30] pry(main)*     match: { title: "hoge" * 20 }
[30] pry(main)*   }
[30] pry(main)* }
Elasticsearch::Transport::Transport::Errors::BadRequest: [400] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"failed to create query: {\n  \"match\" : {\n    \"title\" : {\n      \"query\" : \"hogehogehogehogehogehogehogehogehogehogehogehogehogehogehogehogehogehogehogehoge\",\n      \"operator\" : \"OR\",\n      \"prefix_length\" : 0,\n      \"max_expansions\" : 50,\n      \"fuzzy_transpositions\" : true,\n      \"lenient\" : false,\n      \"zero_terms_query\" : \"NONE\",\n      \"boost\" : 1.0\n    }\n  }\n}","index_uuid":"Fsi43Wx2S8uyyf8wtn59Xg","index":"tests"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"tests","node":"8ggISAxpTti1T1y_YxqBIQ","reason":{"type":"query_shard_exception","reason":"failed to create query: {\n  \"match\" : {\n    \"title\" : {\n      \"query\" : \"hogehogehogehogehogehogehogehogehogehogehogehogehogehogehogehogehogehogehogehoge\",\n      \"operator\" : \"OR\",\n      \"prefix_length\" : 0,\n      \"max_expansions\" : 50,\n      \"fuzzy_transpositions\" : true,\n      \"lenient\" : false,\n      \"zero_terms_query\" : \"NONE\",\n      \"boost\" : 1.0\n    }\n  }\n}","index_uuid":"Fsi43Wx2S8uyyf8wtn59Xg","index":"tests","caused_by":{"type":"too_many_clauses","reason":"maxClauseCount is set to 1024"}}}]},"status":400}
from /Users/syuta.ogido/works/minne/minne-app/vendor/bundle/ruby/2.4.0/gems/elasticsearch-transport-5.0.4/lib/elasticsearch/transport/transport/base.rb:202:in `__raise_transport_error'

まずは エラーメッセージがどういう意味なのか調べる

Elasticsearch の基盤である lucene のページでエラーメッセージの意味を見つけた https://lucene.apache.org/core/6_0_0/core/org/apache/lucene/search/BooleanQuery.TooManyClauses.html

Thrown when an attempt is made to add more than BooleanQuery.getMaxClauseCount() clauses. This typically happens if a PrefixQuery, FuzzyQuery, WildcardQuery, or TermRangeQuery is expanded to many terms during search.

BooleanQuery.getMaxClauseCount() 以上のboolean query を組み立てたときに投げられる例外らしい。

今回、自分で組み立てたqueryには明示的にboolean queryを使ってないので、match query で内部的にboolean query が生成されているのだろう。 エラーメッセージにmaxClauseCount is set to 1024 と書かれているのでmatch query が内部的に組み立てたboolean query が1024 個を超えたっぽい。

次にmatch queryがどんな動作をするのか調べる。 match queryの説明を読んでみると以下のことがか書かれている。 https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html#query-dsl-match-query-boolean

The match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text

match query は検索ワードをアナライザにかけて、その結果を元にboolean query を組み立ててるよという感じ。

token filterに nGram を設定しているのでそれで、検索ワードが大量に分割されていそう。 どのくらいのワードに分割されているのか調べてみる

[37] pry(main)> client.indices.analyze(index: 'tests', text: "hoge" * 20)["tokens"].count
=> 1541

エラーになった時の検索ワードだと1541ワードに分割されている

じゃあ、どのくらいの長さの検索ワードなら大丈夫なの?というのを調べる

[56] pry(main)> client.indices.analyze(index: 'tests', text: "h" * 100)["tokens"].count
=> 2001
[57] pry(main)> client.indices.analyze(index: 'tests', text: "h" * 900)["tokens"].count
=> 20401
[58] pry(main)> client.indices.analyze(index: 'tests', text: "h" * 100)["tokens"].count
=> 2001
[59] pry(main)> client.indices.analyze(index: 'tests', text: "h" * 90)["tokens"].count
=> 1771
[60] pry(main)> client.indices.analyze(index: 'tests', text: "h" * 80)["tokens"].count
=> 1541
[61] pry(main)> client.indices.analyze(index: 'tests', text: "h" * 70)["tokens"].count
=> 1311
[62] pry(main)> client.indices.analyze(index: 'tests', text: "h" * 60)["tokens"].count
=> 1081
[63] pry(main)> client.indices.analyze(index: 'tests', text: "h" * 59)["tokens"].count
=> 1058
[64] pry(main)> client.indices.analyze(index: 'tests', text: "h" * 58)["tokens"].count
=> 1035
[65] pry(main)> client.indices.analyze(index: 'tests', text: "h" * 57)["tokens"].count
=> 1012

予想が正しければ、57 文字まで大丈夫で58文字からエラーになるはずなので試してみる

[67] pry(main)> client.search index: "tests",
[67] pry(main)* body: {
[67] pry(main)*   query: {
[67] pry(main)*     match: { title: "a" * 57 }
[67] pry(main)*   }
[67] pry(main)* }
=> {"took"=>55, "timed_out"=>false, "_shards"=>{"total"=>5, "successful"=>5, "skipped"=>0, "failed"=>0}, "hits"=>{"total"=>0, "max_score"=>nil, "hits"=>[]}}
[68] pry(main)> client.search index: "tests",
[68] pry(main)* body: {
[68] pry(main)*   query: {
[68] pry(main)*     match: { title: "a" * 58 }
[68] pry(main)*   }
[68] pry(main)* }
Elasticsearch::Transport::Transport::Errors::BadRequest: [400] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"failed to create query: {\n  \"match\" : {\n    \"title\" : {\n      \"query\" : \"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\",\n      \"operator\" : \"OR\",\n      \"prefix_length\" : 0,\n      \"max_expansions\" : 50,\n      \"fuzzy_transpositions\" : true,\n      \"lenient\" : false,\n      \"zero_terms_query\" : \"NONE\",\n      \"boost\" : 1.0\n    }\n  }\n}","index_uuid":"Fsi43Wx2S8uyyf8wtn59Xg","index":"tests"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"tests","node":"8ggISAxpTti1T1y_YxqBIQ","reason":{"type":"query_shard_exception","reason":"failed to create query: {\n  \"match\" : {\n    \"title\" : {\n      \"query\" : \"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\",\n      \"operator\" : \"OR\",\n      \"prefix_length\" : 0,\n      \"max_expansions\" : 50,\n      \"fuzzy_transpositions\" : true,\n      \"lenient\" : false,\n      \"zero_terms_query\" : \"NONE\",\n      \"boost\" : 1.0\n    }\n  }\n}","index_uuid":"Fsi43Wx2S8uyyf8wtn59Xg","index":"tests","caused_by":{"type":"too_many_clauses","reason":"maxClauseCount is set to 1024"}}}]},"status":400}
from /Users/syuta.ogido/works/minne/minne-app/vendor/bundle/ruby/2.4.0/gems/elasticsearch-transport-5.0.4/lib/elasticsearch/transport/transport/base.rb:202:in `__raise_transport_error'

57文字のやつはエラーにならず58文字のやつは too_many_clauses になった。 今回は nGramでanalyzeしていて、min_gram が 3 max_gram が25 だから、57文字まで大丈夫だけど、許容できる文字数はanalyzerの設定によって変わってくる。

analyzerの設定とqueryによってはこの辺も気をつけないといけなさそう。