nosql_101

NoSQL tutorial – Elasticsearch, MongoDB

View the Project on GitHub

Mappings & Stempel analysis for Polish

The Stempel plugin integrates Lucene’s Stempel analysis module for Polish into elasticsearch. The plugin provides the polish analyzer and polish_stem token filter, which are not configurable.

elasticsearch-plugin install analysis-stempel

Kilkakrotnie będziemy kasować i tworzyć od nowa index/type:

Dane zaimportujemy do elasticsearch korzystając z Bulk API. Dlatego listy JSON-ów musimy przekształcić do „przeplatanego” formatu:

{"index": {"_index": "db", "_type": "collection", "_id": "id"}} ... action ...
... JSON (in one line) ...

Jak to zrobić? Można skorzystać z programu jq:

< steinhaus.json  jq -c '{"index": {"_type": "steinhaus"}}, .'  > steinhaus.bulk
< szymborska.json jq -c '{"index": {"_type": "szymborska"}}, .' > szymborska.bulk

Do utworzenia na nowo author/steinhaus i author/szymborska użyjemy tych poleceń:

curl -s -XDELETE localhost:9200/authors
curl -s -XPOST   localhost:9200/authors/_bulk --data-binary @data/steinhaus.bulk
curl -s -XPOST   localhost:9200/authors/_bulk --data-binary @data/szymborska.bulk

Sprawdzamy czy coś poszło nie tak:

curl localhost:9200/authors/szymborska/_count # count: 4
curl localhost:9200/authors/szymborska/_count # count: 4

curl -s 'localhost:9200/authors/_search?q=snippet:myli'  | jq .hits.hits
curl -s 'localhost:9200/authors/_search?q=aphorism:brak' | jq .hits.hits

Mappings

You can think of a mapping like a schema in a relational database, because it contains information about each field. For example, «artist» would be a string, while «price» could be an integer.

Oto automatycznie wygenerowane przez elasticsearch mapping dla typów steinhaus i szymborska.

curl http://localhost:9200/authors/_mapping
{
  "authors": {
    "mappings": {
      "steinhaus": {
        "properties": {
          "aphorism": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "tags": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      },
      "szymborska": {
        "properties": {
          "snippet": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "tags": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

Testing Analyzers

It is sometimes difficult to understand what is actually being tokenized and stored into your index. We can use the analyze API to see how text is analyzed.

Testing the standard analyzer.

curl -s localhost:9200/_analyze --data-binary '{
  "analyzer": "standard",
  "text": "Ludzie MYŚLĄ & mówią"
}' | jq .
{
  "tokens": [
    {
      "token": "ludzie",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "myślą",
      "start_offset": 7,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "mówią",
      "start_offset": 15,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Customizing Field Mappings

The two most important mapping attributes for string fields are index and analyzer.

{
  "tweet": {
    "type":     "string",
    "analyzer": "polish"
  }
}
curl -s -XDELETE localhost:9200/authors
curl -s -XPUT    localhost:9200/authors --data-binary @data/authors.mappings      # STEMPEL
curl -s -XPOST   localhost:9200/authors/_bulk --data-binary @data/steinhaus.bulk
curl -s -XPOST   localhost:9200/authors/_bulk --data-binary @data/szymborska.bulk

gdzie authors.mappings zawiera:

{
  "mappings": {
    "steinhaus": {
      "properties": {
        "aphorism": {
          "type":     "text",
          "analyzer": "polish"
        },
        "tags": {
          "type": "keyword"
        }
      }
    },
    "szymborska": {
      "properties": {
        "snippet": {
          "type":     "text",
          "analyzer": "polish"
        }
      }
    }
  }
}

Testing the Polish analyzer

curl -s localhost:9200/authors/_analyze --data-binary '{
  "analyzer": "polish",
  "text": "Ludzie MYŚLĄ & mówiĄ"
}' | jq .
{
  "tokens": [
    {
      "token": "lud",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "myśleć",
      "start_offset": 7,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "mówić",
      "start_offset": 15,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Lucene Query Syntax

Przykładowe zapytania, w których korzystamy z Lucene Query Syntax:

curl -s -XDELETE localhost:9200/authors/lec
curl -s -XPOST   localhost:9200/authors/_bulk --data-binary @data/lec.bulk

curl -s 'localhost:9200/authors/_search?q=aphorism:skar*'   | jq '.hits.hits[]'
curl -s 'localhost:9200/authors/_search?q=aphorism:duchy'   | jq '.hits.hits[]'
curl -s 'localhost:9200/authors/_search?q=tags:math*'       | jq '.hits.hits[]'
curl -s 'localhost:9200/authors/_search?q=aphorism:chocia*' | jq '.hits.hits[]'

curl -s 'localhost:9200/authors/lec/_search?q=milk'               | jq '.hits.hits[]'
curl -s 'localhost:9200/authors/lec/_search?q=aphorism:czekolada' | jq '.hits.hits[]'
curl -s 'localhost:9200/authors/lec/_search?q=tags:women'         | jq '.hits.hits[]'

Jakie jest automatycznie wygenerowane przez elasticsearch mapping dla authors/lec?

curl -s http://localhost:9200/authors/_mapping/lec
curl -s http://localhost:9200/authors/_mapping     | jq .authors.mappings.lec
{
  "properties": {
    "aphorism": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        }
      },
      "analyzer": "polish"
    },
    "tags": {
      "type": "keyword"
    }
  }
}

TODO: napisać skrypt w Bash-u do tego polecenia:

curl -s localhost:9200/authors/_analyze \
  --data-binary '{"analyzer": "polish", "text": "$*"}' | jq .