rasa 继续入门 - 语言模型与分词器

Jul 30, 2021 · 2 min read · 技术 python rasa ·

rasa 支持多种语言模型与分词器，语言模型常用的有 MITIENLP 和 SpaCyNLP，分词器常用的有 WhitespaceTokenizer、JiebaTokenizer、MitieTokenizer、SpacyTokenizer 等。

WhitespaceTokenizer

空格分词器，每个空格间隔的文本，都将分为一个token，典型的英文句子的分词。该分词器不支持中文分词。配置方式如下：

pipeline:
- name: "WhitespaceTokenizer"
  # Flag to check whether to split intents
  "intent_tokenization_flag": False
  # Symbol on which intent should be split
  "intent_split_symbol": "+"
  # Regular expression to detect tokens
  "token_pattern": None

intent_tokenization_flag和intent_split_symbol是在nlu返回多意图的时候使用。当intent_tokenization_flag设置为False，nlu只返回一个置信度最高的意图。但有些时候，一句话包含多个意图，例如：

## intent: affirm+ask_transport
- Yes. How do I get there?
- Sounds good. Do you know how I could get there from home?

用户的回答包含2层意思，首先是同意我的建议，另外是询问怎么去。这时候，需要将intent_tokenization_flag设置为True，然后在训练数据里面编写多意图对应的话术，多个意图中间用intent_split_symbol去分割。在运行的时候，用户说“Sounds good. Do you know how I could get there from home?”，Rasa nlu就会返回affirm+ask_transport这个意图。

token_pattern是一个正则表达式，是对分词后的结果做后处理，过程是这样：先对一句话进行分词，生成一个序列，然后将序列中每个token再应用到token_pattern处理一次，将新生成的词也放在最终分词列表里面。

JiebaTokenizer

jieba分词器，仅可以在中文分词使用，支持自定义词库分词。词库的配置方法如下：

pipeline:
- name: "JiebaTokenizer"
  dictionary_path: "path/to/custom/dictionary/dir"
  # Flag to check whether to split intents
  "intent_tokenization_flag": False
  # Symbol on which intent should be split
  "intent_split_symbol": "_"
  # Regular expression to detect tokens
  "token_pattern": None

其中，dictionary_path是字典文件所在路径。

SpaCyNLP

spaCy是一个用Python和Cython编写的高级自然语言处理的库。它跟踪最新的研究成果，并将其应用到实际产品。spaCy带有预训练的统计模型和单词向量，目前支持60多种语言。它用于标记任务，解析任务和命名实体识别任务的卷积神经网络模型，在非常快速的情况下，达到比较好的效果，并且易于在产品中集成应用。
1、安装并下载语言模型包：

pip install -U spacy
# 中文包
python -m spacy download zh_core_web_sm
# 英文包
python -m spacy download en_core_web_sm

如果通过命令下载不下来，可以下载包到本地安装。如下：
1.到 https://spacy.io/ 下载响应的语言包
2.使用 pip 进行安装，如：

# 中文包
pip install \你的下载路径\zh_core_web_sm-3.1.0-py3-none-any.whl
pip install \你的下载路径\zh_core_web_sm-3.1.0.tar.gz
# 英文包
pip install \你的下载路径\en_core_web_sm-3.1.0-py3-none-any.whl
pip install \你的下载路径\en_core_web_sm-3.1.0.tar.gz

SpaCy的语言支撑的很多，每种语言又有不同的包，根据需要恰当选择吧
2、rasa 中配置：
简单配置：

pipeline:
    - name: "SpacyNLP"
    model: "zh_core_web_sm"

更多的配置，具体参见 https://rasa.com/docs/rasa/language-support#spacy ：

pipeline:
    - name: "SpacyNLP"
    # language model to load
      model: "zh_core_web_sm"
    - name: "SpacyTokenizer"
      # Flag to check whether to split intents
      "intent_tokenization_flag": False
      # Symbol on which intent should be split
      "intent_split_symbol": "_"
      # Regular expression to detect tokens
      "token_pattern": None
    - name: "SpacyFeaturizer"
      # Specify what pooling operation should be used to calculate the vector of
      # the complete utterance. Available options: 'mean' and 'max'.
      "pooling": "mean"

此处配置要注意引号和下划线的问题…
修改 config.yml 的时候别忘记修改最开头的语言配置 language: en
！！！不知为啥，我配上SpaCy后，rasa shell 就会 TimeoutError。暂时不知原因，回头解决吧！！！
另，默认超时时间要修改的话，到这：$python安装目录$\Lib\site-packages\rasa\core\channels\console.py，修改 DEFAULT_STREAM_READING_TIMEOUT_IN_SECONDS = 10 为所需的值即可。