[ES] 엘라스틱 서치 텍스트 분석(Text Analysis)

1. 역 인덱스(inverted index)

Elasticsearch는 '역인덱스' 라는 구조를 만들어 저장함
역 인덱스는 책의 맨 뒤에 있는 찾아보기 페이지에 비유할 수 있음
역 인덱스가 있다면 키워드를 포함하고 있는 도큐먼트의 id를 얻어올 수 있음
데이터가 늘어나도 찾아가야 할 행이 늘어나는 것이 아닌 역 인덱스가 가리키는 id의 배열값이 추가되는 것 뿐이기 때문에 큰 속도의 저하 없이 빠른 속도로 검색 가능
역 인덱스를 데이터가 저장되는 과정에서 만들기 때문에 엘라스틱서치는 데이터를 입력할 때 저장이 아닌 색인을 한다고 표현함

텀(term): 엘라스틱 서치에서 추출된 각 키워드

전문 검색(Full Text Search)

2. 텍스트 분석(Text Analysis)

문자열 필드가 저장될 때 데이터에서 검색어 토큰을 저장하기 위한 여러 단계의 처리 과정을 텍스트 분석이라고 함
이 과정을 처리하는 기능을 애널라이저(Analyzer)라고 함

3. 애널라이저(Analyzer)

애널라이저는 0 ~ 3개의 캐릭터 필터(Character Filter)와 1개의 토크나이저(Tokenizer), 0 ~ n개 토큰 필터(Token Filter)로 이루어짐
Elasticsearch에서 미리 만들어 놓은 애널라이저도 있고, 캐릭터 필터, 토크나이저, 토큰 필터를 조합한 사용자 정의 애널라이저를 만들 수 있음

1) 캐릭터 필터(Character Filter)

텍스트 분석 중 가장 먼저 처리되는 과정으로 필요에 따라 전체 문장에서 특정 문자를 대치하거나 제거해주는 일종의 전처리 도구
종류는 html_strip, mapping, pattern_replace (8.12 버전 기준 3개 존재하여 0개 ~ 3개 적용 가능)

a. html_strip character filter

HTML 요소를 제거하고 '&' 같은 HTML 엔터티를 디코딩해줌

b. mapping character filter

지정된 문자열의 모든 항목을 지정된 대체 항목으로 교체

c. pattern_replace character filter

정규식과 일치하는 모든 문자를 지정된 대체 문자로 교체

2) 토크나이저(Tokenizer)

문장에 속한 단어들을 텀 단위로 하나씩 분리하는 과정 담당
데이터 색인 과정에서 검색 기능에 가장 큰 영향을 미치는 단계
반드시 1개만 적용 가능

a) 분류

i) 단어 지향 토크나이저(Word Oriented Tokenizers)

일반적으로 전체 텍스트를 개별 단어로 토큰화 하는데 사용됨
종류는 Standard, Letter, Lowercase, Whitespace, UAX URL Email, Classic, Thai 토크나이저가 있음
대표적인 토크나이저
- Standard: 공백으로 텀 구분, 특수문자 제거
- Letter: 공백, 숫자, 기호 기준으로 텀 구분 / 검색 범위가 넓어질 수 있음
- Whitespace: 공백으로 텀 구분 / 특수문자 처리가 안돼서 정확한 검색 필요
- UAX URL Email: 이메일 주소와 웹 URL 경로를 분리하지 않고 하나의 텀으로 저장

ii) 부분 단어 토크나이저(Partial Word Tokenizers)

부분적인 단어 일치를 위해 텍스트나 단어를 작은 조각으로 나눔
종류는 N-Gram, Edge N-Gram 토크나이저가 있음

iii) 구조화된 텍스트 토크나이저(Structured Text Tokenizers)

일반적으로 전첵 텍스트보다는 식별자, 이메일 주소, 우편번호, 주소와 같은 구조화된 텍스트와 함께 사용됨
종류는 Keyword, Pattern, Simple Pattern, Char Group, Simple Pattern Split, Path hierarchy 토크나이저가 있음
대표적인 토크나이저
- pattern: 단일 특수 문자 또는 정규식으로 구분자를 사용하여 텀 구분
- path hierarchy: 경로 데이터를 계층별로 저장

3) 토큰 필터(Token Filter)

토크나이저를 이용하여 분리된 텀들을 지정한 규칙에 따라 하나씩 가공하는 과정 담당
0개 ~ n개 적용 가능
여러 토큰 필터를 입력할 때는 순서가 중요함
filter 항목에 배열 값으로 나열해서 지정(하나만 사용하더라도 배열로 입력)
대표적인 토큰 필터
- Lowercase: 대문자를 소문자로 변환하
- Uppercase: 소문자를 대문자로 변환
- stop: 불용어(the, a 등) 삭제
- snowball : 영어에서는 형태소 분석을 위해 ~s, ~ing 등을 제거하여 같은 텀이 되면 하나로 병함됨(lowercase, stop 필터 기능이 포함되어 있음)
- synonym : 필요에 따라 동의어 추가(예시: Amazon을 aws와 동의어로 놓음)
- NGram: 텀의 일부를 미리 분해해서 저장한 것을 NGram이라고 하는데, 텍스트나 단어의 일부만가지고 검색해야하는 경우 사용
- Edge NGram: 텀 앞쪽의 ngram만 저장, 검색 기능 구현할 때 사용할 수 있음
- Shingle: 단어 단위로 구성된 묶음을 shingle이라고 하며, 텀을 단어로 묶어서 저장
- Unique: 중복되는 텀을 하나만 저장 / match쿼리 사용하여 검색하는 경우 텀이 한개만 저장되므로 스코어 반영에 주의해서 사용

4. _analyze API를 이용해서 적용된 애널라이저 확인하기

애널라이저를 조합하고 그 동작을 자세히 확인할 수 있는 API인 _analyze 제공

GET _analyze
{
  "text": "Anyone who stops learning is old, whether at twenty or eighty. Anyone who keeps learning stays young. The greatest thing in life is to keep your mind young.",
  "tokenizer": "whitespace",
  "filter": [
    "lowercase",
    "stop",
    "snowball"
    ]
}

위 예시는 snowball 애널라이저를 적용한 것과 동일한 결과를 확인할 수 있음

snowball애널라이저는 whitespace 토크나이저와 lowercase, stop, snowball 토큰 필터를 조합해서 만들어진 애널라이저

GET _analyze
{
  "text": "Anyone who stops learning is old, whether at twenty or eighty. Anyone who keeps learning stays young. The greatest thing in life is to keep your mind young.",
  "analyzer": "snowball"
}

5. 사용자 정의 애널라이저(Custom Analyzer)

인덱스 저장 시 데이터 처리에 대한 설정은 애널라이저만 적용 할 수 있음
보통 이미 정의되어 제공되는 애널라이저보다는 사용자 정의 애널라이저(Custom Analyzer)를 사용
참고로 이미 정의되어 제공되는 애널라이저는 매핑에 정의한 text 필드의 anlyzer 항목에 이름을 명시하면 쉽게 적용 가능함
매핑에 아무 설정을 하지 않는 경우 standard 애널라이저가 자동 적용됨

1) 사용자 정의 애널라이저 생성하기

인덱스 settings의 "index" : {"analysis" : } 부분에 analyzer를 정의하여 생성

PUT 인덱스
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_custom_analyzer": {
            "type": "custom",
            "tokenizer": "whitespace",
            "filter": [
              "lowercase",
              "stop",
              "snowball"
            ]
          }
        }
      }
    }
  }
}

2) 사용자 정의 애널라이저 적용

해당 인덱스에서 GET 또는 POST <인덱스명>/_anlyze 명령으로 사용

GET 인덱스/_analyze
{
  "analyzer": "커스텀애널라이저",
  "text": [
    "Anyone who stops learning is old, whether at twenty or eighty. Anyone who keeps learning stays young. The greatest thing in life is to keep your mind young."
  ]
}

커스텀 애널라이저 적용 토큰 확인

{
  "tokens": [
    {
      "token": "anyon",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "who",
      "start_offset": 7,
      "end_offset": 10,
      "type": "word",
      "position": 1
    },
    {
      "token": "stop",
      "start_offset": 11,
      "end_offset": 16,
      "type": "word",
      "position": 2
    },
    {
      "token": "learn",
      "start_offset": 17,
      "end_offset": 25,
      "type": "word",
      "position": 3
    },
    {
      "token": "old,",
      "start_offset": 29,
      "end_offset": 33,
      "type": "word",
      "position": 5
    },
    {
      "token": "whether",
      "start_offset": 34,
      "end_offset": 41,
      "type": "word",
      "position": 6
    },
    {
      "token": "twenti",
      "start_offset": 45,
      "end_offset": 51,
      "type": "word",
      "position": 8
    },
    {
      "token": "eighty.",
      "start_offset": 55,
      "end_offset": 62,
      "type": "word",
      "position": 10
    },
    {
      "token": "anyon",
      "start_offset": 63,
      "end_offset": 69,
      "type": "word",
      "position": 11
    },
    {
      "token": "who",
      "start_offset": 70,
      "end_offset": 73,
      "type": "word",
      "position": 12
    },
    {
      "token": "keep",
      "start_offset": 74,
      "end_offset": 79,
      "type": "word",
      "position": 13
    },
    {
      "token": "learn",
      "start_offset": 80,
      "end_offset": 88,
      "type": "word",
      "position": 14
    },
    {
      "token": "stay",
      "start_offset": 89,
      "end_offset": 94,
      "type": "word",
      "position": 15
    },
    {
      "token": "young.",
      "start_offset": 95,
      "end_offset": 101,
      "type": "word",
      "position": 16
    },
    {
      "token": "greatest",
      "start_offset": 106,
      "end_offset": 114,
      "type": "word",
      "position": 18
    },
    {
      "token": "thing",
      "start_offset": 115,
      "end_offset": 120,
      "type": "word",
      "position": 19
    },
    {
      "token": "life",
      "start_offset": 124,
      "end_offset": 128,
      "type": "word",
      "position": 21
    },
    {
      "token": "keep",
      "start_offset": 135,
      "end_offset": 139,
      "type": "word",
      "position": 24
    },
    {
      "token": "your",
      "start_offset": 140,
      "end_offset": 144,
      "type": "word",
      "position": 25
    },
    {
      "token": "mind",
      "start_offset": 145,
      "end_offset": 149,
      "type": "word",
      "position": 26
    },
    {
      "token": "young.",
      "start_offset": 150,
      "end_offset": 156,
      "type": "word",
      "position": 27
    }
  ]
}

원문

Anyone who stops learning is old, whether at twenty or eighty. Anyone who keeps learning stays young. The greatest thing in life is to keep your mind young.

결과

anyone who stops learning is old, whether at twentiy or eighty. anyone who keeps learning stays young. the greatest thing in life is to keep your mind young.

lowercase 필터가 적용되어 Anyone이 anyone으로 변환되었음, The는 결과에 없지만 the로 처리 되었을 것임

stop 필터가 적용되어 불용어에 해당되는 문자 is, at, or, the, in, to 가 제거 되었음

snowball 필터가 적용되어 형태소를 분석하여 ~s, ~ing 제거, ~y를 ~i로 변경

* 불용어: 검색어 가치가 없는 조사나 전치사

의아한 부분

anyone을 왜 anyon 으로 인식하였는가..

anyon, who, learn, keep, young 단어 2번 씩 인식 했음

문장에 마지막에 오는 . 온점을 처리 못함

6. 사용자 정의 토크나이저, 토큰 필터

1) 사용자 정의 토큰 필터 생성

# 불용문자로 'your'을 추가하는 필터 만들기

PUT 인덱스
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "커스텀애널라이저": {
            "type": "custom",
            "tokenizer": "whitespace",
            "filter": [
              "lowercase",
              "my_stop_filter",
              "snowball"
            ]
          }
        },
        "filter": {
          "my_stop_filter": {
            "type": "stop",
            "stopwords": [
              "your"
            ]
          }
        }
      }
    }
  }
}

2) 내가 만든 필터와 애널라이저 적용하기

커스텀 애널라이저 적용 토큰 확인

{
  "tokens": [
    {
      "token": "anyon",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "who",
      "start_offset": 7,
      "end_offset": 10,
      "type": "word",
      "position": 1
    },
    {
      "token": "stop",
      "start_offset": 11,
      "end_offset": 16,
      "type": "word",
      "position": 2
    },
    {
      "token": "learn",
      "start_offset": 17,
      "end_offset": 25,
      "type": "word",
      "position": 3
    },
    {
      "token": "is",
      "start_offset": 26,
      "end_offset": 28,
      "type": "word",
      "position": 4
    },
    {
      "token": "old,",
      "start_offset": 29,
      "end_offset": 33,
      "type": "word",
      "position": 5
    },
    {
      "token": "whether",
      "start_offset": 34,
      "end_offset": 41,
      "type": "word",
      "position": 6
    },
    {
      "token": "at",
      "start_offset": 42,
      "end_offset": 44,
      "type": "word",
      "position": 7
    },
    {
      "token": "twenti",
      "start_offset": 45,
      "end_offset": 51,
      "type": "word",
      "position": 8
    },
    {
      "token": "or",
      "start_offset": 52,
      "end_offset": 54,
      "type": "word",
      "position": 9
    },
    {
      "token": "eighty.",
      "start_offset": 55,
      "end_offset": 62,
      "type": "word",
      "position": 10
    },
    {
      "token": "anyon",
      "start_offset": 63,
      "end_offset": 69,
      "type": "word",
      "position": 11
    },
    {
      "token": "who",
      "start_offset": 70,
      "end_offset": 73,
      "type": "word",
      "position": 12
    },
    {
      "token": "keep",
      "start_offset": 74,
      "end_offset": 79,
      "type": "word",
      "position": 13
    },
    {
      "token": "learn",
      "start_offset": 80,
      "end_offset": 88,
      "type": "word",
      "position": 14
    },
    {
      "token": "stay",
      "start_offset": 89,
      "end_offset": 94,
      "type": "word",
      "position": 15
    },
    {
      "token": "young.",
      "start_offset": 95,
      "end_offset": 101,
      "type": "word",
      "position": 16
    },
    {
      "token": "the",
      "start_offset": 102,
      "end_offset": 105,
      "type": "word",
      "position": 17
    },
    {
      "token": "greatest",
      "start_offset": 106,
      "end_offset": 114,
      "type": "word",
      "position": 18
    },
    {
      "token": "thing",
      "start_offset": 115,
      "end_offset": 120,
      "type": "word",
      "position": 19
    },
    {
      "token": "in",
      "start_offset": 121,
      "end_offset": 123,
      "type": "word",
      "position": 20
    },
    {
      "token": "life",
      "start_offset": 124,
      "end_offset": 128,
      "type": "word",
      "position": 21
    },
    {
      "token": "is",
      "start_offset": 129,
      "end_offset": 131,
      "type": "word",
      "position": 22
    },
    {
      "token": "to",
      "start_offset": 132,
      "end_offset": 134,
      "type": "word",
      "position": 23
    },
    {
      "token": "keep",
      "start_offset": 135,
      "end_offset": 139,
      "type": "word",
      "position": 24
    },
    {
      "token": "mind",
      "start_offset": 145,
      "end_offset": 149,
      "type": "word",
      "position": 26
    },
    {
      "token": "young.",
      "start_offset": 150,
      "end_offset": 156,
      "type": "word",
      "position": 27
    }
  ]
}

stop 필터 대신 내가 만든(불용문자를 your만 추가한 것) 필터를 사용했더니 is, at, or, the, in, to 가 토큰에 추가 되었음

7. 매핑에 사용자 정의 애널라이저 적용

settings{}에서 설정한 커스텀필터를 적용한 커스텀 애널라이저를 mappings{}의 '필드명' 필드에 적용

PUT 인덱스
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "커스텀애널라이저": {
            "type": "custom",
            "tokenizer": "whitespace",
            "filter": [
              "lowercase",
              "커스텀필터",
              "snowball"
            ]
          }
        },
        "filter": {
          "커스텀필터": {
            "type": "stop",
            "stopwords": [
              "yong"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "필드명": {
        "type": "text",
        "analyzer": "커스텀애널라이저"
      }
    }
  }
}

데이터를 입력하고

PUT 인덱스/_doc/1
{
  "필드명": "Anyone who stops learning is old, whether at twenty or eighty. Anyone who keeps learning stays young. The greatest thing in life is to keep your mind young."
}

조회

불용문자로 넣은 young 조회

불용문자로 넣지 않은 mind 조회

색인된 도큐먼트의 역 인덱스를 확인할 때는 _termvectorsAPI를 이용

저작자표시 비영리 변경금지

engineering blog