Elasticsearch添加中文分词

778 查看

安装IK分词插件

GitHub上下载项目(我下载到了/tmp下),并解压

cd /tmp
wget https://github.com/medcl/elasticsearch-analysis-ik/archive/master.zip
unzip master.zip

进入elasticsearch-analysis-ik-master

cd elasticsearch-analysis-ik/

然后使用mvn命令,编译出jar包,elasticsearch-analysis-ik-1.4.0.jar,这个过程可能需要多尝试几次才能成功

mvn package

顺便说一下,mvn需要安装maven,在Ubuntu上,安装maven的命令如下

apt-cache search maven
sudo apt-get install maven
mvn -version

elasticsearch-analysis-ik-master/下的ik文件夹复制到${ES_HOME}/config/

elasticsearch-analysis-ik-master/target下的elasticsearch-analysis-ik-1.4.0.jar复制到${ES_HOME}/lib

${ES_HOME}/config/下的配置文件elasticsearch.yml中增加ik的配置,在最后增加

index:
  analysis:                   
    analyzer:      
      ik:
          alias: [ik_analyzer]
          type: org.elasticsearch.index.analysis.IkAnalyzerProvider
      ik_max_word:
          type: ik
          use_smart: false
      ik_smart:
          type: ik
          use_smart: true
index.analysis.analyzer.default.type: ik

同时,还需要在${ES_HOME}/lib中引入httpclient-4.3.5.jarhttpcore-4.3.2.jar

IK分词测试

创建一个索引,名为index

curl -XPUT http://localhost:9200/index

为索引index创建mapping

curl -XPOST http://localhost:9200/index/fulltext/_mapping -d ' 
{
        "fulltext": {
             "_all": {
            "analyzer": "ik"
        },
       "properties": {
            "content": {
                "type" : "string",
                "boost" : 8.0,
                "term_vector" : "with_positions_offsets",
                "analyzer" : "ik",
                "include_in_all" : true
            }
        }
    }
}'

测试

curl -XGET 'localhost:9200/index/_analyze?analyzer=ik&pretty=true' -d '
{
     测试Elasticsearch分词器
}'

{
  "tokens" : [ {
    "token" : "测试",
    "start_offset" : 9,
    "end_offset" : 11,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "elasticsearch",
    "start_offset" : 11,
    "end_offset" : 24,
    "type" : "ENGLISH",
    "position" : 2
  }, {
    "token" : "分词器",
    "start_offset" : 24,
    "end_offset" : 27,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "分词",
    "start_offset" : 24,
    "end_offset" : 26,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "词",
    "start_offset" : 25,
    "end_offset" : 26,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "器",
    "start_offset" : 26,
    "end_offset" : 27,
    "type" : "CN_CHAR",
    "position" : 6
  } ]
}