Concepts

Natural Language Processing (NLP) is a field of Artificial Intelligence that enables computers to analyze and understand the human language.

Natural Language Understanding (NLU) is a subset of a bigger picture of NLP, just like machine learning, deep learning, NLP, and data mining are a subset of a bigger picture of Artificial Intelligence (AI), which is an umbrella term for any computer program that does something smart.

spaCy is an open-source software library for advanced NLP, written in Python and Cython. It provides intuitive APIs to access its methods trained by deep learning models.

Cornerstones

Before we actually dive into spaCy and code snippets, make sure we have the necessary setup ready.

1
2
3
4
mkdir -p ~/chatBot && cd ~/chatBot
pyenv virtualenv 3.8.5 chatBot
pyenv local chatBot
pip install spacy==3.4.3

spaCy models are just like any other machine learning or deep learning models. A model is a yield of an algorithm or, say, an object that is created after training data using a machine learning algorithm. spaCy has lots of such models that can be placed directly in our program by downloading it just like any other Python package.

1
2
python -m spacy download zh_core_web_lg
python -m spacy download en_core_web_lg

POS Tagging

Part-of-speech (POS) tagging is a process where you read some text and assign parts of speech to each word or token, such noun, verb, adjective, etc.
POS tagging becomes extremely important when you want to identify some entity in a given sentence. The first step is to do POS tagging and see what our text contains.
Let’s get our hands dirty with some of the examples of real POS tagging.

1
2
3
4
5
6
7
8
9
10
11
12
13
import spacy
nlp = spacy.load('zh_core_web_lg')
doc = nlp('明天的天气如何?')
for token in doc:
print(token.text, token.pos_)


# Output #
明天 NOUN
的 PART
天气 NOUN
如何 VERB
? PUNCT

Stemming and Lemmatization

Stemming is the process of reducing inflected words to their word stem, base form.
A stemming algorithm reduces the words “saying” to the root word “say”, whereas “presumable” becomes “presum”. As you can see, this may or may not always be 100% correct.
Lemmatization is closely related to stemming, but lemmatization is the algorithmic process of determining the lemma of a word based on its intended meaning.
spaCy doesn’t have any in-built stemmer, as lemmatization is considered more correct and productive. (spaCy 没有内置的词干提取器,因为词形还原被认为更加准确和有效。)
Difference between Stemming and lemmatization:

  • Stemming does the job in a crude, heuristic way that chops off the ends of words, assuming that the remaining word is what we are actually looking for, but it often includes the removal of derivational affixes.
  • Lemmatization tries to do the job more elegantly with the use of a vocabulary and morphological analysis of words. It tries its best to remove inflectional endings only and return the dictionary form of a word, konwn as the lemma.
1
2
3
4
5
6
7
8
9
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
for w in ['went', 'goes']:
print(w, stemmer.stem(w))


# Output #
went went
goes goe
1
2
3
4
5
6
7
8
9
import spacy
nlp = spacy.load('en_core_web_lg')
for token in nlp('went goes'):
print(token.text, token.lemma_)


# Output #
went go
goes go

Since you are pretty much aware what a stemming or lemmatization does in NLP, you should be able to understand that whenever you come across a situation where you need the root form of the word, you need to do lemmatization there. For example, it is often used in building search engines. You must have wondered how Google gives you the articles in search results that you meant to get even when the search text was not properly formulated.
This is where one makes use of lemmatization.
Imageine you search with the text, “When will the next season of Game of Thrones be releasing?”
Now, suppose the search engine does simple document word frequency matching to give you search results. In this case, the aforementioned query probably won’t match an article with a caption “Game of Thrones next season release date”.
If we do the lemmatization of the orginal question before going to matchh the input with the documents, then we may get better results.

Named-Entity Recognition

NER, also known by other names like entity identification or entity extraction, is a process of finding and classifying named entities existing in the given text into pre-defined categories.
The NER task is hugely dependent on the knowledge base used to train the NE extraction algorithm, so it may or may not work depending upon the provided dataset it was trained on.
spaCy comes with a very fast entiry recognition model that is capable of identifying entity phrases from a given document. Entities can be of different types, such as person, location, organization, dates, numerals, etc. These entities can be accessed through .ents property of the doc object.

1
2
3
4
5
6
7
8
9
10
11
import spacy
nlp = spacy.load('zh_core_web_lg')
doc = nlp("欧盟拟向东南亚投资100亿欧元")
for ent in doc.ents:
print(ent.text, ent.label_)


# Output #
欧盟 ORG
东南亚 LOC
100亿欧元 MONEY

Stop Words

Stop words are high-frequency words like a, an, the, to and also that we sometimes want to filter out of a document before further processing. Stop words usually have little lexical content and do not hold much of meaning.

1
2
3
4
5
6
from spacy.lang.zh.stop_words import STOP_WORDS
print(list(STOP_WORDS)[:10])


# Output #
['岂止', '当即', '纵', '▲', '几经', '上来', '什麽', '假使', '×', '『']

Tosee if a word is a stop word or not, you can use the nlp object of spaCy. We can use the nlp object’s is_stop attribute.

1
2
3
4
5
nlp.vocab["的"].is_stop


# Output #
True

Dependency Parsing

Dependency parsing is one of the more beautiful and powerful features of spaCy that is fast and accurate. The parser can also be used for sentence boundary detection and lets you iterate over base noun phrases, or “chunks”.
This feature of spaCy gives you a parsed tree that explains the parent-child relationship between the words or phrases and is indenpendent of the order in which words occur.

1
2
3
4
5
6
7
8
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp("Book me a flight from Bangalore to Goa")
print(doc[5], list(doc[5].ancestors))


# Output #
Bangalore [from, flight, Book]

Ancestors are the rightmost token of this token’s syntactic descendants.
To check if a doc object item is an ancestor of another doc object item programmactically, we can do the following:
doc[3].is_ancestor(doc[5])
The above returns True because doc[3] (i.e., flight) is an ancestor of doc[5] (i.e., Bangalore).

Children are immediate syntactic dependents of the token. We can see the children of a word by using children attribute just like we used ancestors.
list(doc[3].children) will output [a, from, to]

Dependency parsing is one the most important parts when building chatbots from scratch. It becomes far more important when youn want to figure out the meaning of a text input from your user to your chatbot. There can be cases when you haven’t trained your chatbots, but still you don’t want to lose your customer or reply like a dumb machine.
In these cases, dependency parsing really helps to find the relation and explain a bit more about what the user may be asking for.
If we were to list things for which dependency parsing helps, some might be:

  • It helps in finding relationships between words of grammatically correct sentences.
  • It can be used for sentence boundary detection.
  • It is quite useful to find out if the user is talking about more than one context simulationeously.
    You need to write your own custom NLP to understand the context of the user or your chatbot and, based on that, identify the possible grammatical mistakes a user can meke.

All in all, you must b ready for such scenarious where a user will input garbage values or grammatically incorrect sentences. You can’t handle all such scenarios at once, but you can keep improve you chatboot by adding custom NLP code or by limiting user input by design.

Noun Chunks

Noun chunks or NP-chunking are basically “base noun phrases”. We can say they are flat phrases that have anoun as their head. You can think of noun chunks as a noun with the words describing the noun.

1
2
3
4
5
6
7
8
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp("Boston Dynamics is gearing up to produce thousands of robot dogs")
print(list(doc.noun_chunks))


# Output #
[Boston Dynamics, thousands, robot dogs]

The ‘noun_chunks’ syntax iterator is not implemented for language ‘zh’

Finding Similarity

Finding similarity between two words is a use-case you will find most of the time working with NLP. Sometimes it becomes fairly important to find if two words are similar. While building chatbots you will often come to situations where you don’t have to just find similar-lokking words but also how closely related two words are logically.
spaCy uses high-quality word vectors to find similarity between two words using GloVe algorithm (Global Vectors for Word Representation).
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. GloVe algorithm uses aggregated global word-word co-occurrence statistics from a corpus to train the model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import spacy
nlp = spacy.load('zh_core_web_lg')
doc = nlp("美团小贷注册资本增至75亿元")
for token in doc:
print(token.text, token.vector[:5])


# Output #
美团 [-0.78366 0.31008 -1.0793 -0.87563 0.56224]
小贷 [ 1.0563 -2.695 -0.44203 0.98277 -4.8158 ]
注册 [ 3.4941 -2.8909 -4.572 -2.0436 -1.9986]
资本 [ 3.8849 0.3031 -2.382 2.8471 -2.7938]
增至 [ 1.469 1.0054 5.4963 1.8396 -3.7624]
75亿 [ 0.3056 0.69498 1.4444 0.91454 -0.97031]
元 [ 2.8509 -0.53181 -2.646 -0.5862 -4.0398 ]

Seeing this output, it doesn’t make much sense and meaning. From an application’s perspective, what matters the most is how similar the vectors of different words are – that is, the word’s meaning itself.
In order to find similarity between two words in spaCy, we can do the following.

1
2
3
4
5
6
7
8
9
import spacy
nlp = spacy.load('en_core_web_lg')
print(nlp("car").similarity(nlp("truck")))
print(nlp("car").similarity(nlp("plane")))


# Output #
0.7760473774094219
0.4592993678544094

The word ‘car’ is more related and similar to the word ‘truck’ than the word ‘plane’.
We can also get the similarity between sentences.

1
2
3
4
5
6
7
8
9
import spacy
nlp = spacy.load('en_core_web_lg')
str1 = nlp("When will next season of Game of Thrones be releasing?")
str2 = nlp("Game of Thrones next season release date?")
print(str1.similarity(str2))


# Output #
0.8226084934378249

As we can see in this example, the similarity between both of the sentences is about 82%, which is good enough to say that both of the sentences are quite similar, which is true. This can help us save a lot of time for writing custom code when build chatbots.

Tokenization

Tokenization is one of the simple yet basic concepts of NLP where we split a text into meaningful segments. spaCy first tokenizes the text (i.e., segments it into words and then punctuation and other things). A question might come to your mind: Why can’t I just use the built-in split method of Python language and do the tokenization? Python’s split method is just a raw method to split the sentence into tokens given a sepatator. It doesn’t take any meaning into account, whereas tokenization tries to preserve the meaning as well.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import spacy
nlp = spacy.load('zh_core_web_lg')
doc = nlp("告诉我处女座今天的运势")
for tk in doc:
print(tk.text)


# Output #
告诉

处女座
今天

运势

If you are not satisfied with spaCy’s tokenization, you can use its add_special_case method to add your own rules before relying completely on spaCy’s tokenization method.

Regular Expressions

You must already know about regular expressions and their usage.
Text analysis and processing is a big subject in itself. Sometimes words play together in a way that makes it extremely difficult for machines to understand and get trained upon.
Regular expression can come handy for some fallback for a machine learning model. It has the power of pattern-matching, which can ensure that the data we are processing is correct or incorrect. Most of the early chatbots were hugely dependent on pattern-matching.
Given the power of machine learning these days, regular expression and pattern-matching has taken a back step, but make sure you brush up a bit about it as it may be needed at any time to parse specific details from words, sentences, or text documents.

The Hard Way

“Building Chatbots the Hard Way” is not to hard to learn. It’s the hard way of building chatbots to have full control over your own chatbots. If you want to build everything yourself, then you take the hard route. The harder route is hard when you go through it but beautiful and clear when you look back.

It is a rough road that leads to the heights of etness. – Lucius Annaeus Seneca

1
pip install rasa==3.4.0

构建Rasa机器人的步骤

  • 初始化项目
  • 准备NLU训练数据
  • 准备故事story
  • 定义领域domain
  • 定义规则rule
  • 定义动作action
  • 配置config
  • 训练模型
  • 测试机器人
  • 发布机器人

Building a Simple Horoscope Bot

Let’s decide the scope of this chatbot and see what it does and can do.

  • The Horoscope Bot should be able to understand greetings and reply with a greeting.
  • The bot should be able to understand if there user is asking for horoscope.
  • The bot should be able to ask the horoscope sign of the user if the user doesn’t provide it.
  • The bot should learn from existing responses to formulate a new response.
    It is pretty simple what our bot is supposed to do here.

Possible intents

  • Greeting Intent: User starting with a greeting
  • Get Horoscope Intent: User asking for horoscope
  • User’s Horoscope Intent: User telling the horoscope sign

We’ll try to build the bot that does the basic task of giving a horoscope.
Let’s create a possible conversation script between our chatbot and the user.

User: Hello
Bot: 你好,有什么能帮到你?
User: 看一下今年的运势
Bot: 想查哪个星座的运势?
User: 双鱼的
Bot: 由于天王星的逆行,可能会打乱双鱼座的节奏,所以双鱼座本年要懂得韬光养晦,要努力的去沉淀自己...

This conversation is just to have a fair idea of how our chatbot conversation is going to look.
We can have our chatbot model itself trained to prepare a valid response instead of writing a bunch of if ... else statements.

Initializing the bot

Let’s init our bot.

1
2
cd ~/chatBot
rasa init --no-prompt

Preparing data

First, prepare the nlu data. (nlu负责意图提取和实体提取)
The following is what my nlu.yml under data folder looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
version: "3.1"

nlu:
- intent: greet
examples: |
- hi
- hello
- intent: goodbye
examples: |
- bye
- goodbye
- intent: get_horoscope
examples: |
- 查一下[今天](date_time)的运势
- 看一下[狮子座](horoscope_sign)如何
- intent: info_date_time
examples: |
- [今日](date_time)
- [明天](date_time)
- intent: info_horoscope_sign
examples: |
- [白羊](horoscope_sign)
- [金牛](horoscope_sign)

Then, prepare the stories data (Rasa是通过学习story的方式来学习对话管理知识).
The following is what my stories.yml under data folder looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
version: "3.1"

stories:

- story: greet
steps:
- intent: greet
- action: utter_greet

- story: say goodbye
steps:
- intent: goodbye
- action: utter_goodbye

- story: get horoscope
steps:
- or:
- intent: get_horoscope
- intent: get_horoscope
entities:
- horoscope_sign: 双鱼
- intent: get_horoscope
entities:
- date_time: 今天
- intent: get_horoscope
entities:
- horoscope_sign: 双鱼
- date_time: 今天
- action: horoscope_form
- active_loop: horoscope_form

Then, prepare the domain data (domain定义了chatbot需要知道的所有信息,包括intent, entity, slot, action, form, response).
The following is what my domain.yml under project root looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
version: "3.1"

intents:
- greet
- goodbye
- get_horoscope
- info_date_time
- info_horoscope_sign

entities:
- date_time
- horoscope_sign

slots:
date_time:
type: text
influence_conversation: false
mappings:
- entity: date_time
type: from_entity
conditions:
- active_loop: horoscope_form

horoscope_sign:
type: text
influence_conversation: false
mappings:
- entity: horoscope_sign
type: from_entity
conditions:
- active_loop: horoscope_form
responses:
utter_greet:
- text: "你好,有什么能帮到你?"

utter_goodbye:
- text: "再见"

utter_ask_horoscope_sign:
- text: "想查哪个星座的运势?"

utter_ask_date_time:
- text: "想查今天、明天、本周、本月还是今年的运势?"

actions:
- utter_greet
- utter_goodbye
- utter_ask_horoscope_sign
- utter_ask_date_time
- action_get_horoscope
- action_default_fallback

forms:
horoscope_form:
ignored_intents: []
required_slots:
- date_time
- horoscope_sign

session_config:
session_expiration_time: 60
carry_over_slots_to_new_session: true

Then, prepare the rule (rule负责将问题分类映射到对应的动作上).
The following is what my rules.yml under data folder looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
version: "3.1"

rules:

- rule: activate horoscope form
steps:
- intent: get_horoscope
- action: horoscope_form
- active_loop: horoscope_form

- rule: submit form
condition:
- active_loop: horoscope_form
steps:
- action: horoscope_form
- active_loop: null
- slot_was_set:
- requested_slot: null
- action: action_get_horoscope

Then, prepare the action (action接收用户输入和对话状态信息,按照业务逻辑进行处理,并输出改变对话状态的事件和回复用户的消息).
The following is what my actions.py under actions folder looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class ActionGetHoroscope(Action):

def name(self) -> Text:
return "action_get_horoscope"

def run(self, dispatcher: CollectingDispatcher,
tracker: Tracker,
domain: Dict[Text, Any]) -> List[Dict[Text, Any]]:

date_time = tracker.get_slot("date_time")
horoscope_sign = tracker.get_slot("horoscope_sign")

dispatcher.utter_message(
text=json_object.get(horoscope_sign, {}) \
.get(date_time, "抱歉,我目前无法找到您的运势")
)
return []

class ActionDefaultFallback(Action):
def name(self) -> Text:
return "action_default_fallback"

def run(
self,
dispatcher: CollectingDispatcher,
tracker: Tracker,
domain: Dict[Text, Any],
) -> List[Dict[Text, Any]]:
dispatcher.utter_message(text="我不明白您说的内容,请换个说法。")
return [UserUtteranceReverted(), ]

Then, prepare the config.
The following is what my config.yml under project root looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
recipe: default.v1
language: zh
pipeline:
- name: SpacyNLP
model: zh_core_web_lg
- name: SpacyTokenizer
intent_tokenization_flag: False
intent_split_symbol: "_"
token_pattern: None
# - name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
constrain_similarities: true
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
constrain_similarities: true
- name: FallbackClassifier
threshold: 0.3
ambiguity_threshold: 0.1
policies:
- name: MemoizationPolicy
- name: TEDPolicy
max_history: 5
epochs: 200
- name: RulePolicy
core_fallback_threshold: 0.5
core_fallback_action_name: action_default_fallback
enable_fallback_prediction: True

Training

1
rasa train

Test

1
rasa test

Ok, if everything goes well, we’ve got a simple chatbot out there.

Run

Run an action server.

1
rasa run actions --actions actions.actions

Run a shell

1
rasa shell

Let’s play with it in that shell.

Test Case

Try it out for yourself.

References