NLP之文本分類：「Tf-Idf、Word2Vec和

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

幕組雙語原文：NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

英語原文：Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT

翻譯：雷鋒字幕組（關山、wiige）

概要

在本文中，我將使用NLP和Python來解釋3種不同的文本多分類策略：老式的詞袋法（tf-ldf），著名的詞嵌入法（Word2Vec）和最先進的語言模型（BERT）。

NLP（自然語言處理）是人工智能的一個領域，它研究計算機和人類語言之間的交互作用，特別是如何通過計算機編程來處理和分析大量的自然語言數據。NLP常用于文本數據的分類。文本分類是指根據文本數據內容對其進行分類的問題。

我們有多種技術從原始文本數據中提取信息，并用它來訓練分類模型。本教程比較了傳統的詞袋法（與簡單的機器學習算法一起使用）、流行的詞嵌入模型（與深度學習神經網絡一起使用）和最先進的語言模型（和基于attention的transformers模型中的遷移學習一起使用），語言模型徹底改變了NLP的格局。

我將介紹一些有用的Python代碼，這些代碼可以輕松地應用在其他類似的案例中（僅需復制、粘貼、運行），并對代碼逐行添加注釋，以便你能復現這個例子（下面是全部代碼的鏈接）。

mdipietro09/DataScience_ArtificialIntelligence_Utils

我將使用“新聞類別數據集”（News category dataset），這個數據集提供了從HuffPost獲取的2012-2018年間所有的新聞標題，我們的任務是把這些新聞標題正確分類，這是一個多類別分類問題（數據集鏈接如下）。

News Category Dataset

特別地，我要講的是：

設置：導入包，讀取數據，預處理，分區。
詞袋法：用scikit-learn進行特征工程、特征選擇以及機器學習，測試和評估，用lime解釋。
詞嵌入法：用gensim擬合Word2Vec，用tensorflow/keras進行特征工程和深度學習，測試和評估，用Attention機制解釋。
語言模型：用transformers進行特征工程，用transformers和tensorflow/keras進行預訓練BERT的遷移學習，測試和評估。

設置

首先，我們需要導入下面的庫：

## for data

import json

import pandas as pd

import numpy as np## for plotting

import matplotlib.pyplot as plt

import seaborn as sns## for bag-of-words

from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing## for explainer

from lime import lime_text## for word embedding

import gensim

import gensim.downloader as gensim_api## for deep learning

from tensorflow.keras import models, layers, preprocessing as kprocessing

from tensorflow.keras import backend as K## for bert language model

import transformers

該數據集包含在一個jason文件中，所以我們首先將其讀取到一個帶有json的字典列表中，然后將其轉換為pandas的DataFrame。

lst_dics=

with open('data.json', mode='r', errors='ignore') as json_file:

for dic in json_file:

lst_dics.append( json.loads(dic) )## print the first one

lst_dics[0]

原始數據集包含30多個類別，但出于本教程中的目的，我將使用其中的3個類別：娛樂（Entertainment）、政治（Politics）和科技（Tech）。

## create dtf

dtf=pd.DataFrame(lst_dics)## filter categories

dtf=dtf[ dtf["category"].isin(['ENTERTAINMENT','POLITICS','TECH']) ][["category","headline"]]## rename columns

dtf=dtf.rename(columns={"category":"y", "headline":"text"})## print 5 random rows

dtf.sample(5)

從圖中可以看出，數據集是不均衡的：和其他類別相比，科技新聞的占比很小，這會使模型很難識別科技新聞。

在解釋和構建模型之前，我將給出一個預處理示例，包括清理文本、刪除停用詞以及應用詞形還原。我們要寫一個函數，并將其用于整個數據集上。

'''

Preprocess a string.

:parameter

:param text: string - name of column containing text

:param lst_stopwords: list - list of stopwords to remove

:param flg_stemm: bool - whether stemming is to be applied

:param flg_lemm: bool - whether lemmitisation is to be applied

:return

cleaned text

'''

def utils_preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):

## clean (convert to lowercase and remove punctuations and

characters and then strip)

text=re.sub(r'[^\w\s]', '', str(text).lower.strip)

## Tokenize (convert from string to list)

lst_text=text.split ## remove Stopwords

if lst_stopwords is not None:

lst_text=[word for word in lst_text if word not in

lst_stopwords]

## Stemming (remove -ing, -ly, ...)

if flg_stemm==True:

ps=nltk.stem.porter.PorterStemmer

lst_text=[ps.stem(word) for word in lst_text]

## Lemmatisation (convert the word into root word)

if flg_lemm==True:

lem=nltk.stem.wordnet.WordNetLemmatizer

lst_text=[lem.lemmatize(word) for word in lst_text]

## back to string from list

text=" ".join(lst_text)

return text

該函數從語料庫中刪除了一組單詞（如果有的話）。我們可以用nltk創建一個英語詞匯的通用停用詞列表（我們可以通過添加和刪除單詞來編輯此列表）。

lst_stopwords=nltk.corpus.stopwords.words("english")

lst_stopwords

現在，我將在整個數據集中應用編寫的函數，并將結果存儲在名為“text_clean”的新列中，以便你選擇使用原始的語料庫，或經過預處理的文本。

dtf["text_clean"]=dtf["text"].apply(lambda x:

utils_preprocess_text(x, flg_stemm=False, flg_lemm=True,

lst_stopwords=lst_stopwords))dtf.head

如果你對更深入的文本分析和預處理感興趣，你可以查看這篇文章。我將數據集劃分為訓練集（70%）和測試集（30%），以評估模型的性能。

## split dataset

dtf_train, dtf_test=model_selection.train_test_split(dtf, test_size=0.3)## get target

y_train=dtf_train["y"].values

y_test=dtf_test["y"].values

讓我們開始吧！

詞袋法

詞袋法的模型很簡單：從文檔語料庫構建一個詞匯表，并計算單詞在每個文檔中出現的次數。換句話說，詞匯表中的每個單詞都成為一個特征，文檔由具有相同詞匯量長度的矢量（一個“詞袋”）表示。例如，我們有3個句子，并用這種方法表示它們：

特征矩陣的形狀：文檔數x詞匯表長度

可以想象，這種方法將會導致很嚴重的維度問題：文件越多，詞匯表越大，因此特征矩陣將是一個巨大的稀疏矩陣。所以，為了減少維度問題，詞袋法模型通常需要先進行重要的預處理（詞清除、刪除停用詞、詞干提取/詞形還原）。

詞頻不一定是文本的最佳表示方法。實際上我們會發現，有些常用詞在語料庫中出現頻率很高，但是它們對目標變量的預測能力卻很小。為了解決此問題，有一種詞袋法的高級變體，它使用詞頻-逆向文件頻率（Tf-Idf）代替簡單的計數。基本上，一個單詞的值和它的計數成正比地增加，但是和它在語料庫中出現的頻率成反比。

先從特征工程開始，我們通過這個流程從數據中提取信息來建立特征。使用Tf-Idf向量器(vectorizer)，限制為1萬個單詞（所以詞長度將是1萬），捕捉一元文法（即 "new "和 "york"）和二元文法（即 "new york"）。以下是經典的計數向量器的代碼:

ngram_range=(1,2))vectorizer=feature_extraction.text.TfidfVectorizer(max_features=10000, ngram_range=(1,2))

現在將在訓練集的預處理語料上使用向量器來提取詞表并創建特征矩陣。

corpus=dtf_train["text_clean"]vectorizer.fit(corpus)X_train=vectorizer.transform(corpus)dic_vocabulary=vectorizer.vocabulary_

特征矩陣X_train的尺寸為34265（訓練集中的文檔數）×10000（詞長度），這個矩陣很稀疏:

sns.heatmap(X_train.todense[:,np.random.randint(0,X.shape[1],100)]==0, vmin=0, vmax=1, cbar=False).set_title('Sparse Matrix Sample')

從特征矩陣中隨機抽樣（黑色為非零值）

為了知道某個單詞的位置，可以這樣在詞表中查詢:

word="new york"dic_vocabulary[word]

如果詞表中存在這個詞，這行腳本會輸出一個數字N，表示矩陣的第N個特征就是這個詞。

為了降低矩陣的維度所以需要去掉一些列，我們可以進行一些特征選擇（Feature Selection），這個流程就是選擇相關變量的子集。操作如下:

將每個類別視為一個二進制位（例如，"科技"類別中的科技新聞將分類為1，否則為0）;
進行卡方檢驗，以便確定某個特征和其（二進制）結果是否獨立;
只保留卡方檢驗中有特定p值的特征。

y=dtf_train["y"]

X_names=vectorizer.get_feature_names

p_value_limit=0.95dtf_features=pd.DataFrame

for cat in np.unique(y):

chi2, p=feature_selection.chi2(X_train, y==cat)

dtf_features=dtf_features.append(pd.DataFrame(

{"feature":X_names, "score":1-p, "y":cat}))

dtf_features=dtf_features.sort_values(["y","score"],

ascending=[True,False])

dtf_features=dtf_features[dtf_features["score"]>p_value_limit]X_names=dtf_features["feature"].unique.tolist

這將特征的數量從10000個減少到3152個，保留了最有統計意義的特征。選一些打印出來是這樣的:

for cat in np.unique(y):

print("# {}:".format(cat))

print(" . selected features:",

len(dtf_features[dtf_features["y"]==cat]))

print(" . top features:", ",".join(

dtf_features[dtf_features["y"]==cat]["feature"].values[:10]))

print(" ")

我們將這組新的詞表作為輸入，在語料上重新擬合向量器。這將輸出一個更小的特征矩陣和更短的詞表。

vectorizer=feature_extraction.text.TfidfVectorizer(vocabulary=X_names)vectorizer.fit(corpus)X_train=vectorizer.transform(corpus)dic_vocabulary=vectorizer.vocabulary_

新的特征矩陣X_train的尺寸是34265（訓練中的文檔數量）×3152（給定的詞表長度）。你看矩陣是不是沒那么稀疏了:

從新的特征矩陣中隨機抽樣（非零值為黑色）

現在我們該訓練一個機器學習模型試試了。我推薦使用樸素貝葉斯算法：它是一種利用貝葉斯定理的概率分類器，貝葉斯定理根據可能相關條件的先驗知識進行概率預測。這種算法最適合這種大型數據集了，因為它會獨立考察每個特征，計算每個類別的概率，然后預測概率最高的類別。

classifier=naive_bayes.MultinomialNB

我們在特征矩陣上訓練這個分類器，然后在經過特征提取后的測試集上測試它。因此我們需要一個scikit-learn流水線：這個流水線包含一系列變換和最后接一個estimator。將Tf-Idf向量器和樸素貝葉斯分類器放入流水線，就能輕松完成對測試數據的變換和預測。

## pipelinemodel=pipeline.Pipeline([("vectorizer", vectorizer),

("classifier", classifier)])## train classifiermodel["classifier"].fit(X_train, y_train)## testX_test=dtf_test["text_clean"].values

predicted=model.predict(X_test)

predicted_prob=model.predict_proba(X_test)

至此我們可以使用以下指標評估詞袋模型了:

準確率: 模型預測正確的比例。
混淆矩陣: 是一張記錄每類別預測正確和預測錯誤數量的匯總表。
ROC: 不同閾值下，真正例率與假正例率的對比圖。曲線下的面積(AUC)表示分類器中隨機選擇的正觀察值排序比負觀察值更靠前的概率。
精確率: "所有被正確檢索的樣本數(TP)"占所有"實際被檢索到的(TP+FP)"的比例。
召回率: 所有"被正確檢索的樣本數(TP)"占所有"應該檢索到的結果(TP+FN)"的比例。

classes=np.unique(y_test)

y_test_array=pd.get_dummies(y_test, drop_first=False).values

## Accuracy, Precision, Recallaccuracy=metrics.accuracy_score(y_test, predicted)

auc=metrics.roc_auc_score(y_test, predicted_prob,

multi_)

print("Accuracy:", round(accuracy,2))

print("Auc:", round(auc,2))

print("Detail:")

print(metrics.classification_report(y_test, predicted))

## Plot confusion matrixcm=metrics.confusion_matrix(y_test, predicted)

fig, ax=plt.subplots

sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues,

cbar=False)

ax.set(xlabel="Pred", ylabel="True", xticklabels=classes,

yticklabels=classes, title="Confusion matrix")

plt.yticks(rotation=0)

fig, ax=plt.subplots(nrows=1, ncols=2)## Plot rocfor i in range(len(classes)):

fpr, tpr, thresholds=metrics.roc_curve(y_test_array[:,i],

predicted_prob[:,i])

ax[0].plot(fpr, tpr, lw=3,

label='{0} (area={1:0.2f})'.format(classes[i],

metrics.auc(fpr, tpr))

)

ax[0].plot([0,1], [0,1], color='navy', lw=3, line)

ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05],

xlabel='False Positive Rate',

ylabel="True Positive Rate (Recall)",

title="Receiver operating characteristic")

ax[0].legend(loc="lower right")

ax[0].grid(True)

## Plot precision-recall curvefor i in range(len(classes)):

precision, recall, thresholds=metrics.precision_recall_curve(

y_test_array[:,i], predicted_prob[:,i])

ax[1].plot(recall, precision, lw=3,

label='{0} (area={1:0.2f})'.format(classes[i],

metrics.auc(recall, precision))

)

ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall',

ylabel="Precision", title="Precision-Recall curve")

ax[1].legend(loc="best")

ax[1].grid(True)

plt.show

詞袋模型能夠在測試集上正確分類85%的樣本（準確率為0.85），但在辨別科技新聞方面卻很吃力（只有252條預測正確）。

讓我們探究一下為什么模型會將新聞分類為其他類別，順便看看預測結果是不是能解釋些什么。lime包可以幫助我們建立一個解釋器。為讓這更好理解，我們從測試集中隨機采樣一次, 看看能發現些什么:

## select observationi=0

txt_instance=dtf_test["text"].iloc[i]## check true value and predicted valueprint("True:", y_test[i], "--> Pred:", predicted[i], "| Prob:", round(np.max(predicted_prob[i]),2))## show explanationexplainer=lime_text.LimeTextExplainer(class_names=

np.unique(y_train))

explained=explainer.explain_instance(txt_instance,

model.predict_proba, num_features=3)

explained.show_in_notebook(text=txt_instance, predict_proba=False)

這就一目了然了：雖然"舞臺(stage)"這個詞在娛樂新聞中更常見, "克林頓(Clinton) "和 "GOP "這兩個詞依然為模型提供了引導（政治新聞）。

詞嵌入

詞嵌入（Word Embedding）是將中詞表中的詞映射為實數向量的特征學習技術的統稱。這些向量是根據每個詞出現在另一個詞之前或之后的概率分布計算出來的。換一種說法，上下文相同的單詞通常會一起出現在語料庫中，所以它們在向量空間中也會很接近。例如，我們以前面例子中的3個句子為例:

二維向量空間中的詞嵌入

在本教程中，我門將使用這類模型的開山怪: Google的Word2Vec（2013）。其他流行的詞嵌入模型還有斯坦福大學的GloVe（2014）和Facebook的FastText（2016）。

Word2Vec生成一個包含語料庫中的每個獨特單詞的向量空間，通常有幾百維, 這樣在語料庫中擁有共同上下文的單詞在向量空間中的位置就會相互靠近。有兩種不同的方法可以生成詞嵌入：從某一個詞來預測其上下文（Skip-gram）或根據上下文預測某一個詞（Continuous Bag-of-Words）。

在Python中，可以像這樣從genism-data中加載一個預訓練好的詞嵌入模型:

nlp=gensim_api.load("word2vec-google-news-300")

我將不使用預先訓練好的模型，而是用gensim在訓練數據上自己訓練一個Word2Vec。在訓練模型之前，需要將語料轉換為n元文法列表。具體來說，就是嘗試捕獲一元文法（"york"）、二元文法（"new york"）和三元文法（"new york city"）。

corpus=dtf_train["text_clean"]## create list of lists of unigramslst_corpus=

for string in corpus:

lst_words=string.split

lst_grams=[" ".join(lst_words[i:i+1])

for i in range(0, len(lst_words), 1)]

lst_corpus.append(lst_grams)## detect bigrams and trigramsbigrams_detector=gensim.models.phrases.Phrases(lst_corpus,

delimiter=" ".encode, min_count=5, threshold=10)

bigrams_detector=gensim.models.phrases.Phraser(bigrams_detector)trigrams_detector=gensim.models.phrases.Phrases(bigrams_detector[lst_corpus],

delimiter=" ".encode, min_count=5, threshold=10)

trigrams_detector=gensim.models.phrases.Phraser(trigrams_detector)

在訓練Word2Vec時，需要設置一些參數:

詞向量維度設置為300;
窗口大小，即句子中當前詞和預測詞之間的最大距離，這里使用語料庫中文本的平均長度;
訓練算法使用 skip-grams (sg=1)，因為一般來說它的效果更好。

## fit w2vnlp=gensim.models.word2vec.Word2Vec(lst_corpus, size=300,

window=8, min_count=1, sg=1, iter=30)

現在我們有了詞嵌入模型，所以現在可以從語料庫中任意選擇一個詞，將其轉化為一個300維的向量。

word="data"nlp[word].shape

甚至可以通過某些維度縮減算法（比如TSNE），將一個單詞及其上下文可視化到一個更低的維度空間（2D或3D）。

word="data"

fig=plt.figure## word embedding

tot_words=[word] + [tupla[0] for tupla in

nlp.most_similar(word, topn=20)]

X=nlp[tot_words]## pca to reduce dimensionality from 300 to 3

pca=manifold.TSNE(perplexity=40, n_components=3, init='pca')

X=pca.fit_transform(X)## create dtf

dtf_=pd.DataFrame(X, index=tot_words, columns=["x","y","z"])

dtf_["input"]=0

dtf_["input"].iloc[0:1]=1## plot 3d

from mpl_toolkits.mplot3d import Axes3D

ax=fig.add_subplot(111, projection='3d')

ax.scatter(dtf_[dtf_["input"]==0]['x'],

dtf_[dtf_["input"]==0]['y'],

dtf_[dtf_["input"]==0]['z'], c="black")

ax.scatter(dtf_[dtf_["input"]==1]['x'],

dtf_[dtf_["input"]==1]['y'],

dtf_[dtf_["input"]==1]['z'], c="red")

ax.set(xlabel=None, ylabel=None, zlabel=None, xticklabels=,

yticklabels=, zticklabels=)

for label, row in dtf_[["x","y","z"]].iterrows:

x, y, z=row

ax.text(x, y, z, s=label)

這非常酷，但詞嵌入在預測新聞類別這樣的任務上有何裨益呢？詞向量可以作為神經網絡的權重。具體是這樣的:

首先，將語料轉化為單詞id的填充(padded)序列，得到一個特征矩陣。
然后，創建一個嵌入矩陣，使id為N的詞向量位于第N行。
最后，建立一個帶有嵌入層的神經網絡，對序列中的每一個詞都用相應的向量進行加權。

還是從特征工程開始，用 tensorflow/keras 將 Word2Vec 的同款預處理語料（n-grams 列表）轉化為文本序列的列表:

## tokenize texttokenizer=kprocessing.text.Tokenizer(lower=True, split=' ',

oov_token="NaN",

filters='!"#$%&*+,-./:;?@[\]^_`{|}~\t\n')

tokenizer.fit_on_texts(lst_corpus)

dic_vocabulary=tokenizer.word_index## create sequencelst_text2seq=tokenizer.texts_to_sequences(lst_corpus)## padding sequenceX_train=kprocessing.sequence.pad_sequences(lst_text2seq,

maxlen=15, padding="post", truncating="post")

特征矩陣X_train的尺寸為34265×15（序列數×序列最大長度）。可視化一下是這樣的:

sns.heatmap(X_train==0, vmin=0, vmax=1, cbar=False)

plt.show

特征矩陣(34 265 x 15)

現在語料庫中的每一個文本都是一個長度為15的id序列。例如，如果一個文本中有10個詞符，那么這個序列由10個id和5個0組成，這個0這就是填充元素（而詞表中沒有的詞其id為1）。我們來輸出一下看看一段訓練集文本是如何被轉化成一個帶有填充元素的詞序列:

i=0## list of text: ["I like this", ...]len_txt=len(dtf_train["text_clean"].iloc[i].split)print("from: ", dtf_train["text_clean"].iloc[i], "| len:", len_txt)## sequence of token ids: [[1, 2, 3], ...]len_tokens=len(X_train[i])print("to: ", X_train[i], "| len:", len(X_train[i]))## vocabulary: {"I":1, "like":2, "this":3, ...}print("check: ", dtf_train["text_clean"].iloc[i].split[0],

" -- idx in vocabulary -->",

dic_vocabulary[dtf_train["text_clean"].iloc[i].split[0]])print("vocabulary: ", dict(list(dic_vocabulary.items)[0:5]), "... (padding element, 0)")

記得在測試集上也要做這個特征工程:

corpus=dtf_test["text_clean"]## create list of n-gramslst_corpus=

for string in corpus:

lst_words=string.split

lst_grams=[" ".join(lst_words[i:i+1]) for i in range(0,

len(lst_words), 1)]

lst_corpus.append(lst_grams)

## detect common bigrams and trigrams using the fitted detectorslst_corpus=list(bigrams_detector[lst_corpus])

lst_corpus=list(trigrams_detector[lst_corpus])## text to sequence with the fitted tokenizerlst_text2seq=tokenizer.texts_to_sequences(lst_corpus)## padding sequenceX_test=kprocessing.sequence.pad_sequences(lst_text2seq, maxlen=15,

padding="post", truncating="post")

X_test (14,697 x 15)

現在我們就有了X_train和X_test，現在需要創建嵌入矩陣，它將作為神經網絡分類器的權重矩陣.

## start the matrix (length of vocabulary x vector size) with all 0sembeddings=np.zeros((len(dic_vocabulary)+1, 300))for word,idx in dic_vocabulary.items:

## update the row with vector try:

embeddings[idx]=nlp[word]

## if word not in model then skip and the row stays all 0s except:

pass

這段代碼生成的矩陣尺寸為22338×300（從語料庫中提取的詞表長度×向量維度）。它可以通過詞表中的詞id。

word="data"print("dic[word]:", dic_vocabulary[word], "|idx")print("embeddings[idx]:", embeddings[dic_vocabulary[word]].shape,

"|vector")

終于要建立深度學習模型了! 我門在神經網絡的第一個Embedding層中使用嵌入矩陣，訓練它之后就能用來進行新聞分類。輸入序列中的每個id將被視為訪問嵌入矩陣的索引。這個嵌入層的輸出是一個包含輸入序列中每個詞id對應詞向量的二維矩陣（序列長度 x 詞向量維度）。以 "我喜歡這篇文章(I like this article) "這個句子為例:

我的神經網絡的結構如下:

一個嵌入層，如前文所述, 將文本序列作為輸入, 詞向量作為權重。
一個簡單的Attention層，它不會影響預測，但它可以捕捉每個樣本的權重, 以便將作為一個不錯的解釋器（對于預測來說它不是必需的，只是為了提供可解釋性，所以其實可以不用加它）。這篇論文（2014）提出了序列模型（比如LSTM）的Attention機制，探究了長文本中哪些部分實際相關。
兩層雙向LSTM，用來建模序列中詞的兩個方向。
最后兩層全連接層，可以預測每個新聞類別的概率。

## code attention layerdef attention_layer(inputs, neurons):

x=layers.Permute((2,1))(inputs)

x=layers.Dense(neurons, activation="softmax")(x)

x=layers.Permute((2,1), name="attention")(x)

x=layers.multiply([inputs, x])

return x## inputx_in=layers.Input(shape=(15,))## embeddingx=layers.Embedding(input_dim=embeddings.shape[0],

output_dim=embeddings.shape[1],

weights=[embeddings],

input_length=15, trainable=False)(x_in)## apply attentionx=attention_layer(x, neurons=15)## 2 layers of bidirectional lstmx=layers.Bidirectional(layers.LSTM(units=15, dropout=0.2,

return_sequences=True))(x)

x=layers.Bidirectional(layers.LSTM(units=15, dropout=0.2))(x)## final dense layersx=layers.Dense(64, activation='relu')(x)

y_out=layers.Dense(3, activation='softmax')(x)## compilemodel=models.Model(x_in, y_out)

model.compile(loss='sparse_categorical_crossentropy',

optimizer='adam', metrics=['accuracy'])

model.summary

現在來訓練模型，不過在實際測試集上測試之前，我們要在訓練集上劃一小塊驗證集來驗證模型性能。

## encode ydic_y_mapping={n:label for n,label in

enumerate(np.unique(y_train))}

inverse_dic={v:k for k,v in dic_y_mapping.items}

y_train=np.array([inverse_dic[y] for y in y_train])## traintraining=model.fit(x=X_train, y=y_train, batch_size=256,

epochs=10, shuffle=True, verbose=0,

validation_split=0.3)## plot loss and accuracymetrics=[k for k in training.history.keys() if ("loss" not in k) and ("val" not in k)]

fig, ax=plt.subplots(nrows=1, ncols=2, sharey=True)ax[0].set(title="Training")

ax11=ax[0].twinx

ax[0].plot(training.history['loss'], color='black')

ax[0].set_xlabel('Epochs')

ax[0].set_ylabel('Loss', color='black')for metric in metrics:

ax11.plot(training.history[metric], label=metric)

ax11.set_ylabel("Score", color='steelblue')

ax11.legendax[1].set(title="Validation")

ax22=ax[1].twinx

ax[1].plot(training.history['val_loss'], color='black')

ax[1].set_xlabel('Epochs')

ax[1].set_ylabel('Loss', color='black')for metric in metrics:

ax22.plot(training.history['val_'+metric], label=metric)

ax22.set_ylabel("Score", color="steelblue")

plt.show

Nice！在某些epoch中準確率達到了0.89。為了對詞嵌入模型進行評估，在測試集上也要進行預測，并用相同指標進行對比（評價指標的代碼與之前相同）。

## testpredicted_prob=model.predict(X_test)

predicted=[dic_y_mapping[np.argmax(pred)] for pred in

predicted_prob]

該模式的表現與前一個模型差不多。其實，它的科技新聞分類也不怎么樣。

但它也具有可解釋性嗎? 是的! 因為在神經網絡中放了一個Attention層來提取每個詞的權重，我們可以了解這些權重對一個樣本的分類貢獻有多大。所以這里我將嘗試使用Attention權重來構建一個解釋器（類似于上一節里的那個）:

## select observationi=0txt_instance=dtf_test["text"].iloc[i]## check true value and predicted valueprint("True:", y_test[i], "--> Pred:", predicted[i], "| Prob:", round(np.max(predicted_prob[i]),2))## show explanation### 1. preprocess inputlst_corpus=for string in [re.sub(r'[^\w\s]','', txt_instance.lower.strip)]:

lst_words=string.split

lst_grams=[" ".join(lst_words[i:i+1]) for i in range(0,

len(lst_words), 1)]

lst_corpus.append(lst_grams)

lst_corpus=list(bigrams_detector[lst_corpus])

lst_corpus=list(trigrams_detector[lst_corpus])

X_instance=kprocessing.sequence.pad_sequences(

tokenizer.texts_to_sequences(corpus), maxlen=15,

padding="post", truncating="post")### 2. get attention weightslayer=[layer for layer in model.layers if "attention" in

layer.name][0]

func=K.function([model.input], [layer.output])

weights=func(X_instance)[0]

weights=np.mean(weights, axis=2).flatten### 3. rescale weights, remove null vector, map word-weightweights=preprocessing.MinMaxScaler(feature_range=(0,1)).fit_transform(np.array(weights).reshape(-1,1)).reshape(-1)

weights=[weights[n] for n,idx in enumerate(X_instance[0]) if idx

!=0]

dic_word_weigth={word:weights[n] for n,word in

enumerate(lst_corpus[0]) if word in

tokenizer.word_index.keys}### 4. barplotif len(dic_word_weigth) > 0:

dtf=pd.DataFrame.from_dict(dic_word_weigth, orient='index',

columns=["score"])

dtf.sort_values(by="score",

ascending=True).tail(top).plot(kind="barh",

legend=False).grid(axis='x')

plt.showelse:

print("--- No word recognized ---")### 5. produce html visualizationtext=for word in lst_corpus[0]:

weight=dic_word_weigth.get(word)

if weight is not None:

text.append('' + word + '')

else:

text.append(word)

text=' '.join(text)### 6. visualize on notebookprint("3[1m"+"Text with highlighted words")from IPython.core.display import display, HTML

display(HTML(text))

就像之前一樣，"克林頓 (clinton)"和 "老大黨(gop) "這兩個詞激活了模型的神經元，而且這次發現 "高(high) "和 "班加西(benghazi) "與預測也略有關聯。

語言模型

語言模型, 即上下文/動態詞嵌入（Contextualized/Dynamic Word Embeddings），克服了經典詞嵌入方法的最大局限：多義詞消歧義，一個具有不同含義的詞（如" bank "或" stick"）只需一個向量就能識別。最早流行的是 ELMO（2018），它并沒有采用固定的嵌入，而是利用雙向 LSTM觀察整個句子，然后給每個詞分配一個嵌入。

到Transformers時代, 谷歌的論文Attention is All You Need（2017）提出的一種新的語言建模技術，在該論文中，證明了序列模型（如LSTM）可以完全被Attention機制取代，甚至獲得更好的性能。

而后谷歌的BERT（Bidirectional Encoder Representations from Transformers，2018）包含了ELMO的上下文嵌入和幾個Transformers，而且它是雙向的（這是對Transformers的一大創新改進）。BERT分配給一個詞的向量是整個句子的函數，因此，一個詞可以根據上下文不同而有不同的詞向量。我們輸入岸河(bank river)到Transformer試試:

txt="bank river"## bert tokenizertokenizer=transformers.BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)## bert modelnlp=transformers.TFBertModel.from_pretrained('bert-base-uncased')## return hidden layer with embeddingsinput_ids=np.array(tokenizer.encode(txt))[None,:]

embedding=nlp(input_ids)

embedding[0][0]

如果將輸入文字改為 "銀行資金(bank money)"，則會得到這樣的結果:

為了完成文本分類任務，可以用3種不同的方式來使用BERT:

從零訓練它，并將其作為分類器使用。
提取詞嵌入，并在嵌入層中使用它們（就像上面用Word2Vec那樣）。
對預訓練模型進行精調(遷移學習)。

我打算用第三種方式，從預訓練的輕量 BERT 中進行遷移學習，人稱 Distil-BERT （用6600 萬個參數替代1.1 億個參數）

## distil-bert tokenizertokenizer=transformers.AutoTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)

在訓練模型之前，還是需要做一些特征工程，但這次會比較棘手。為了說明我們需要做什么，還是以我們這句 "我喜歡這篇文章(I like this article) "為例，他得被轉化為3個向量（Ids, Mask, Segment）:

尺寸為 3 x 序列長度

首先，我們需要確定最大序列長度。這次要選擇一個大得多的數字(比如50)，因為BERT會將未知詞分割成子詞符(sub-token)，直到找到一個已知的單字。比如若給定一個像 "zzdata "這樣的虛構詞，BERT會把它分割成["z"，"##z"，"##data"]。除此之外, 我們還要在輸入文本中插入特殊的詞符，然后生成掩碼(musks)和分段(segments)向量。最后，把它們放進一個張量里得到特征矩陣，其尺寸為3（id、musk、segment）x 語料庫中的文檔數 x 序列長度。

這里我使用原始文本作為語料（前面一直用的是clean_text列）。

corpus=dtf_train["text"]

maxlen=50## add special tokensmaxqnans=np.int((maxlen-20)/2)

corpus_tokenized=["[CLS] "+

" ".join(tokenizer.tokenize(re.sub(r'[^\w\s]+|\n', '',

str(txt).lower.strip))[:maxqnans])+

" [SEP] " for txt in corpus]## generate masksmasks=[[1]*len(txt.split(" ")) + [0]*(maxlen - len(

txt.split(" "))) for txt in corpus_tokenized]

## paddingtxt2seq=[txt + " [PAD]"*(maxlen-len(txt.split(" "))) if len(txt.split(" ")) !=maxlen else txt for txt in corpus_tokenized]

## generate idxidx=[tokenizer.encode(seq.split(" ")) for seq in txt2seq]

## generate segmentssegments=for seq in txt2seq:

temp, i=, 0 for token in seq.split(" "):

temp.append(i)

if token=="[SEP]":

i +=1 segments.append(temp)## feature matrixX_train=[np.asarray(idx, dtype='int32'),

np.asarray(masks, dtype='int32'),

np.asarray(segments, dtype='int32')]

特征矩陣X_train的尺寸為3×34265×50。我們可以從特征矩陣中隨機挑一個出來看看:

i=0print("txt: ", dtf_train["text"].iloc[0])

print("tokenized:", [tokenizer.convert_ids_to_tokens(idx) for idx in X_train[0][i].tolist])

print("idx: ", X_train[0][i])

print("mask: ", X_train[1][i])

print("segment: ", X_train[2][i])

這段代碼在dtf_test["text"]上跑一下就能得到X_test。

現在要從預練好的 BERT 中用遷移學習一個深度學習模型。具體就是，把 BERT 的輸出用平均池化壓成一個向量，然后在最后添加兩個全連接層來預測每個新聞類別的概率.

下面是使用BERT原始版本的代碼（記得用正確的tokenizer重做特征工程):

## inputsidx=layers.Input((50), dtype="int32", name="input_idx")

masks=layers.Input((50), dtype="int32", name="input_masks")

segments=layers.Input((50), dtype="int32", name="input_segments")## pre-trained bertnlp=transformers.TFBertModel.from_pretrained("bert-base-uncased")

bert_out, _=nlp([idx, masks, segments])## fine-tuningx=layers.GlobalAveragePooling1D(bert_out)

x=layers.Dense(64, activation="relu")(x)

y_out=layers.Dense(len(np.unique(y_train)),

activation='softmax')(x)## compilemodel=models.Model([idx, masks, segments], y_out)for layer in model.layers[:4]:

layer.trainable=Falsemodel.compile(loss='sparse_categorical_crossentropy',

optimizer='adam', metrics=['accuracy'])model.summary

這里用輕量級的Distil-BERT來代替BERT:

## inputsidx=layers.Input((50), dtype="int32", name="input_idx")

masks=layers.Input((50), dtype="int32", name="input_masks")## pre-trained bert with configconfig=transformers.DistilBertConfig(dropout=0.2,

attention_dropout=0.2)

config.output_hidden_states=Falsenlp=transformers.TFDistilBertModel.from_pretrained('distilbert-

base-uncased', config=config)

bert_out=nlp(idx, attention_mask=masks)[0]## fine-tuningx=layers.GlobalAveragePooling1D(bert_out)

x=layers.Dense(64, activation="relu")(x)

y_out=layers.Dense(len(np.unique(y_train)),

activation='softmax')(x)## compilemodel=models.Model([idx, masks], y_out)for layer in model.layers[:3]:

layer.trainable=Falsemodel.compile(loss='sparse_categorical_crossentropy',

optimizer='adam', metrics=['accuracy'])model.summary

最后我們訓練.測試并評估該模型 (評價代碼與前文一致):

## encode ydic_y_mapping={n:label for n,label in

enumerate(np.unique(y_train))}

inverse_dic={v:k for k,v in dic_y_mapping.items}

y_train=np.array([inverse_dic[y] for y in y_train])## traintraining=model.fit(x=X_train, y=y_train, batch_size=64,

epochs=1, shuffle=True, verbose=1,

validation_split=0.3)## testpredicted_prob=model.predict(X_test)

predicted=[dic_y_mapping[np.argmax(pred)] for pred in

predicted_prob]

BERT的表現要比之前的模型稍好，它能識別的科技新聞要比其他模型多一些.

結語

本文是一個通俗教程，展示了如何將不同的NLP模型應用于多類分類任務上。文中比較了3種流行的方法: 用Tf-Idf的詞袋模型, 用Word2Vec的詞嵌入, 和用BERT的語言模型. 每個模型都介紹了其特征工程與特征選擇、模型設計與測試、模型評價與模型解釋，并在(可行時的)每一步中比較了這3種模型。

雷鋒字幕組是一個由AI愛好者組成的翻譯團隊，匯聚五五多位志愿者的力量，分享最新的海外AI資訊，交流關于人工智能技術領域的行業轉變與技術創新的見解。

團隊成員有大數據專家，算法工程師，圖像處理工程師，產品經理，產品運營，IT咨詢人，在校師生；志愿者們來自IBM，AVL，Adobe，阿里，百度等知名企業，北大，清華，港大，中科院，南卡羅萊納大學，早稻田大學等海內外高校研究所。

如果，你也是位熱愛分享的AI愛好者。歡迎與雷鋒字幕組一起，學習新知，分享成長。

言

在制作網頁時，文字是最基本的元素之一。讓閱讀者更容易閱讀，短時間里獲得更多信息，是網頁創作者的目標。本篇將介紹各種文字格式標簽的使用方法。

本篇主要針對初學者的一篇教程，如果你非常熟悉html，可以忽略本篇文章。

標題文字

在網上瀏覽時經常看到一些標題文字，用來對應章節劃分，它們以固定的字號顯示，總共有6種級別的標題，從 h1 至 h6 依次減小，如下圖：

html 代碼：

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>標題</title>
</head>
<body>
<h1>這是標題 1</h1>
<h2>這是標題 2</h2>
<h3>這是標題 3</h3>
<h4>這是標題 4</h4>
<h5>這是標題 5</h5>
<h6>這是標題 6</h6>
</body>
</html>

標題對齊方式可以使用 align 屬性，分別有三個屬性：

left —— 左對齊
center —— 居中對齊
right —— 右對齊

如下圖：

html代碼：

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>標題</title>
</head>
<body>
<h1>這是標題 1</h1>
<h2 align="left">這是標題 2</h2>
<h3 align="center">這是標題 3</h3>
<h4 align="right">這是標題 4</h4>
<h5>這是標題 5</h5>
<h6>這是標題 6</h6>
</body>
</html>

文字格式標簽

除了標題，網頁中普通文字也是不可缺少的，而各種文字效果可以使網頁更加漂亮。

只需在<body>和</body>之間輸入文字，就會直接在頁面中顯示，如何設置這些文字的格式，這里使用標簽，下面將逐一介紹各種文字格式用法。

一、設置字體、字號、顏色 —— 標簽

標簽在HTML 4 中用于指定字體、字體大小和文本顏色，但在HTML5 中不支持。

face 屬性：字體類型
size 屬性：字體字號大小
color 屬性：字體顏色

html代碼：

<html>
<body>
<div><font face="宋體">字體</font></div>
<div><font size="5">5號字體</font></div>
<div><font color="red">顏色</font></div>
<div><font size="5" face="arial" color="blue">一起使用</font></div>
</body>
</html>

在html5中不建議使用，請用 css 樣式代替。

二、粗體、斜體、下劃線、刪除線—— strong、em、u、del

效果如下：

html代碼：

<!DOCTYPE html>
<html>
<body>
<p>這是普通文本 - <strong>這是粗體文本</strong>。</p>
<p>這是普通文本 - <em>這是斜體</em>。</p>
<p>這是普通文本 - <u>這是下劃線</u>。</p>
<p>這是普通文本 - <del>這是下劃線</del>。</p>
</body>
</html>

注：html 5 和 html 4 相關標簽存在巨大差異，比如 strong 和 b 、del 和 s、em 和 i 等效果相同，在html5 中不支持，b、s、i 標簽，已不建議使用，關于各種差異，可自己了解下就可以了。

3、上標和下標 —— sup、sub

效果如下：

html代碼：

<html>
<body>
<p>
普通文本 <sup>上標</sup>
</p>
<p>
普通文本 <sub>下標</sub>
</p>
<p>
數學公式 X<sup>3</sup> + 5X<sup>2</sup> - 5=0
</p>
<p>
數學公式 X<sub>1</sub> - 2X<sub>1</sub>=0
</p>
</body>
</html>

4、空格——

一般在網頁中輸入文字時，在段落中明明增加了空格，卻在頁面中看不到，這是因為在html中，瀏覽器本身會將2個句子之間的所有半角空白僅當做一個空白來看待。所以在這里使用空格符代替，每個空格符代表一個半角空格，多個空格可以使用多次。

html代碼：

由于頭條不顯示空格字符，所以用圖片代替

效果：

5、其它特殊字符

除了空格字符，在網頁中還有一些特殊字符也需要使用代碼來代替，一般情況下，特殊字符由前綴 “&” 開始、字符名和后綴 “;” 組成，和空格符類似。如下表

特殊字符有很多，這里只列出一些例子，具體自己搜索了解下。

段落

在網頁中要把文字有條理地顯示，需要使用到段落標簽，下面介紹一些與段落相關的標簽。

段落標簽——p

在網頁中，通過 定義為一個段落。

html代碼：

<html>
<body>
<p>這是段落。</p>
<p>這是段落。</p>
<p>這是段落。</p>
<p>段落元素由 p 標簽定義。</p> 
</body>
</html>

效果：

換行標簽——br

在寫文字時，除了自動換行外，換可以使用 標簽強制文字換行，這個和 p 段落標簽不一樣。段落標簽的換行是隔行的，而br不是，時2行文字更加緊湊。

html代碼：

<html>
<body>
<p>
第一個段落<br />換行1<br />換行2<br />換行3<br />最后一行.
</p>
<p>
第二個段落 <br />換行1<br />換行2<br />換行3<br />最后一行.
</p>
</body>
</html>

效果如下：

如果不想文字被瀏覽器自動換行，可以使用標簽處理，如下圖：

改行文字不會被自動換行，會看到出現橫向滾動條。

保留原始排版方式——pre

在網頁制作中，有時需要保留一些特殊的排版效果，這是使用標簽控制就會很麻煩，使用<pre>標簽就可以保留文本的格式排版效果。如下圖：

html代碼：

<html>
<body>
<pre>
這是
預格式文本。
它保留了      空格
和換行。
</pre>
<p>pre 標簽很適合顯示計算機代碼：</p>
<pre>
for i=1 to 10
     print i
next i
</pre>
<p>這是一個ok效果</p>
<pre>
  O O    k  K
 O   O   K K
  O O    K  K
</pre>
</body>
</html>

其它標簽

右縮進—— blockquote

使用<blockquote>可以實現文字段落縮進，每使用一次，段落就縮進一次，可以嵌套使用。

實例代碼：

<html>
<body>
Here comes a long quotation:
<blockquote>
This is a long quotation. This is a long quotation. This is a long quotation. This is a long quotation. This is a long quotation.
</blockquote>
請注意，瀏覽器在 blockquote 元素前后添加了換行，并增加了外邊距。
</body>
</html>

效果如下：

請注意，瀏覽器在 blockquote 元素前后添加了換行，并增加了外邊距。

水平線——hr

在段落和段落之間加上一行水平線，將段落隔開。如下效果：

html代碼：

<html>
<body>
<p>hr 標簽定義水平線：</p>
<hr />
<p>這是段落。</p>
<hr />
<p>這是段落。</p>
<hr />
<p>這是段落。</p>
</body>
</html>

文字標注——ruby

在網頁中可以通過添加對文字的標注來說明某段文本。

效果如下：

html代碼：

<!DOCTYPE HTML>
<html>
<body>
<p>ruby 使用語法：</p>
<ruby>
 被說明的文字 <rt> 標注 </rt>
</ruby>
</body>
</html>

其它標簽——var、code、kbd等

<dfn>	定義一個定義項目。
<code>	定義計算機代碼文本。
<samp>	定義樣本文本。
<kbd>	定義鍵盤文本。它表示文本是從鍵盤上鍵入的。它經常用在與計算機相關的文檔或手冊中。
<var>	定義變量。您可以將此標簽與 <pre> 及 <code> 標簽配合使用。
<cite>	定義引用。可使用該標簽對參考文獻的引用進行定義，比如書籍或雜志的標題。

總結

本篇介紹了大部分常用的文本格式標簽，在制作網頁時會經常使用到。如何掌握這些標簽使用，很簡單，可以使用文本編輯器或類似w3cshool 在線可編輯預覽的工具，親手寫一寫，熟悉每個標簽的用處，無需死記硬背，關鍵在于理解。

最后，感謝您的閱讀及關注，祝你學習愉快。

上篇：前端入門——HTML的發展歷史

下篇：前端入門——html 列表

果文章對你有幫助，記得點贊收藏哦，如果有疑問記得評論區留下你的問題，我會第一時間回復的！

前言

之前書寫了使用pytorch進行短文本分類，其中的數據處理方式比較簡單粗暴。自然語言處理領域包含很多任務，很多的數據像之前那樣處理的話未免有點繁瑣和耗時。在pytorch中眾所周知的數據處理包是處理圖片的torchvision，而處理文本的少有提及，快速處理文本數據的包也是有的，那就是torchtext[1]。下面還是結合上一個案例：【深度學習】textCNN論文與原理——短文本分類(基于pytorch)[2]，使用torchtext進行文本數據預處理，然后再使用torchtext進行模型分類。

關于torchtext的基本使用除了可以參考官方文檔，也可以看看這篇文章：TorchText用法示例及完整代碼[3]。

下面就開始看看該如何進行處理吧。

1 數據處理

首先導入包：

from torchtext import data

我們處理的語料中，主要涉及兩個內容：文本，文本對應的類別。下面使用torchtext構建這兩個字段：

# 文本內容，使用自定義的分詞方法，將內容轉換為小寫，設置最大長度等
TEXT = data.Field(tokenize=utils.en_seg, lower=True, fix_length=config.MAX_SENTENCE_SIZE, batch_first=True)
# 文本對應的標簽
LABEL = data.LabelField(dtype=torch.float)

其中的一些參數在一個config.py文件中，如下：

# 模型相關參數
RANDOM_SEED = 1000  # 隨機數種子
BATCH_SIZE = 128    # 批次數據大小
LEARNING_RATE = 1e-3   # 學習率
EMBEDDING_SIZE = 200   # 詞向量維度
MAX_SENTENCE_SIZE = 50  # 設置最大語句長度
EPOCH = 20            # 訓練測輪次

# 語料路徑
NEG_CORPUS_PATH = './corpus/neg.txt'
POS_CORPUS_PATH = './corpus/pos.txt'

utils.en_seg是自定義的文本分詞函數，如下：

def en_seg(sentence):
    """
    簡單的英文分詞方法，
    :param sentence: 需要分詞的語句
    :return: 返回分詞結果
    """
    return sentence.split()

當然也可以書寫更復雜的，或者使用spacy。下面就是書寫讀取文本數據到torchtext對象的數據了，便于使用torchtext中的方法，如下：

def get_dataset(corpus_path, text_field, label_field, datatype):
    """
    構建torchtext數據集
    :param corpus_path: 數據路徑
    :param text_field: torchtext設置的文本域
    :param label_field: torchtext設置的文本標簽域
    :param datatype: 文本的類別
    :return: torchtext格式的數據集以及設置的域
    """
    fields = [('text', text_field), ('label', label_field)]
    examples = []
    with open(corpus_path, encoding='utf8') as reader:
        for line in reader:
            content = line.rstrip()
            if datatype == 'pos':
                label = 1
            else:
                label = 0
            # content[：-2]是由于原始文本最后的兩個內容是空格和.，這里直接去掉，并將數據與設置的域對應起來
            examples.append(data.Example.fromlist([content[:-2], label], fields))

    return examples, fields

現在就可以獲取torchtext格式的數據了，如下：

# 構建data數據
pos_examples, pos_fields = dataloader.get_dataset(config.POS_CORPUS_PATH, TEXT, LABEL, 'pos')
neg_examples, neg_fields = dataloader.get_dataset(config.NEG_CORPUS_PATH, TEXT, LABEL, 'neg')
all_examples, all_fields = pos_examples + neg_examples, pos_fields + neg_fields

# 構建torchtext類型的數據集
total_data = data.Dataset(all_examples, all_fields)

有了上面的數據，下面就可以快速地為準備模型需要的數據了，如切分，構造批次數據，獲取字典等，如下：


# 數據集切分
train_data, test_data = total_data.split(random_state=random.seed(config.RANDOM_SEED), split_ratio=0.8)

# 切分后的數據查看
# # 數據維度查看
print('len of train data: %r' % len(train_data))  # len of train data: 8530
print('len of test data: %r' % len(test_data))  # len of test data: 2132

# # 抽一條數據查看
print(train_data.examples[100].text)
# ['never', 'engaging', ',', 'utterly', 'predictable', 'and', 'completely', 'void', 'of', 'anything', 'remotely',
# 'interesting', 'or', 'suspenseful']
print(train_data.examples[100].label)
# 0

# 為該樣本數據構建字典，并將子每個單詞映射到對應數字
TEXT.build_vocab(train_data)
LABEL.build_vocab(train_data)

# 查看字典長度
print(len(TEXT.vocab))  # 19206
# 查看字典中前10個詞語
print(TEXT.vocab.itos[:10])  # ['<unk>', '<pad>', ',', 'the', 'a', 'and', 'of', 'to', '.', 'is']
# 查找'name'這個詞對應的詞典序號, 本質是一個dict
print(TEXT.vocab.stoi['name'])  # 2063

# 構建迭代(iterator)類型的數據
train_iterator, test_iterator = data.BucketIterator.splits((train_data, test_data),
                                                           batch_size=config.BATCH_SIZE,
                                                           sort=False)

這樣一看，是不是減少了我們書寫的很多代碼了。下面就是老生常談的模型預測和模型效果查看了。

2 構建模型并訓練

模型的相關理論已在前文介紹，如果忘了可以回過頭看看。模型還是那個模型，如下：

import torch
from torch import nn

import config


class TextCNN(nn.Module):
    # output_size為輸出類別（2個類別，0和1）,三種kernel，size分別是3,4，5，每種kernel有100個
    def __init__(self, vocab_size, embedding_dim, output_size, filter_num=100, kernel_list=(3, 4, 5), dropout=0.5):
        super(TextCNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # 1表示channel_num，filter_num即輸出數據通道數，卷積核大小為(kernel, embedding_dim)
        self.convs = nn.ModuleList([
            nn.Sequential(nn.Conv2d(1, filter_num, (kernel, embedding_dim)),
                          nn.LeakyReLU(),
                          nn.MaxPool2d((config.MAX_SENTENCE_SIZE - kernel + 1, 1)))
            for kernel in kernel_list
        ])
        self.fc = nn.Linear(filter_num * len(kernel_list), output_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.embedding(x)  # [128, 50, 200] (batch, seq_len, embedding_dim)
        x = x.unsqueeze(1)  # [128, 1, 50, 200] 即(batch, channel_num, seq_len, embedding_dim)
        out = [conv(x) for conv in self.convs]
        out = torch.cat(out, dim=1)  # [128, 300, 1, 1]，各通道的數據拼接在一起
        out = out.view(x.size(0), -1)  # 展平
        out = self.dropout(out)  # 構建dropout層
        logits = self.fc(out)  # 結果輸出[128, 2]
        return logits

為了方便模型訓練，測試書寫了兩個函數，當然也和之前的相同，如下：

def binary_acc(pred, y):
    """
    計算模型的準確率
    :param pred: 預測值
    :param y: 實際真實值
    :return: 返回準確率
    """
    correct = torch.eq(pred, y).float()
    acc = correct.sum() / len(correct)
    return acc


def train(model, train_data, optimizer, criterion):
    """
    模型訓練
    :param model: 訓練的模型
    :param train_data: 訓練數據
    :param optimizer: 優化器
    :param criterion: 損失函數
    :return: 該論訓練各批次正確率平均值
    """
    avg_acc = []
    model.train()       # 進入訓練模式
    for i, batch in enumerate(train_data):
        pred = model(batch.text)
        loss = criterion(pred, batch.label.long())
        acc = binary_acc(torch.max(pred, dim=1)[1], batch.label)
        avg_acc.append(acc)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # 計算所有批次數據的結果
    avg_acc = np.array(avg_acc).mean()
    return avg_acc


def evaluate(model, test_data):
    """
    使用測試數據評估模型
    :param model: 模型
    :param test_data: 測試數據
    :return: 該論訓練好的模型預測測試數據，查看預測情況
    """
    avg_acc = []
    model.eval()  # 進入測試模式
    with torch.no_grad():
        for i, batch in enumerate(test_data):
            pred = model(batch.text)
            acc = binary_acc(torch.max(pred, dim=1)[1], batch.label)
            avg_acc.append(acc)
    return np.array(avg_acc).mean()

涉及相關包的話，就自行導入即可。下面就是創建模型和模型訓練測試了。好緊張，又到了這個環節了。

# 創建模型
text_cnn = model.TextCNN(len(TEXT.vocab), config.EMBEDDING_SIZE, len(LABEL.vocab))
# 選取優化器
optimizer = optim.Adam(text_cnn.parameters(), lr=config.LEARNING_RATE)
# 選取損失函數
criterion = nn.CrossEntropyLoss()

# 繪制結果
model_train_acc, model_test_acc = [], []

# 模型訓練
for epoch in range(config.EPOCH):
    train_acc = utils.train(text_cnn, train_iterator, optimizer, criterion)
    print("epoch = {}, 訓練準確率={}".format(epoch + 1, train_acc))

    test_acc = utils.evaluate(text_cnn, test_iterator)
    print("epoch = {}, 測試準確率={}".format(epoch + 1, test_acc))

    model_train_acc.append(train_acc)
    model_test_acc.append(test_acc)

# 繪制訓練過程
plt.plot(model_train_acc)
plt.plot(model_test_acc)
plt.ylim(ymin=0.5, ymax=1.01)
plt.title("The accuracy of textCNN mode")
plt.legend(['train', 'test'])
plt.show()

模型最后的結果如下：

模型訓練過程

這個和之前結果沒多大區別，但是在數據處理中卻省去更多的時間，并且也更加規范化。所以還是有時間學習一下torchtext咯。

3 總結

torchtext支持的自然語言處理處理任務還是比較多的，并且自身和帶有一些數據集。最近還在做實體識別任務，使用的算法模型是bi-lstm+crf。這個任務的本質就是序列標注，torchtext也是支持這種類型數據的處理的，后期有時間的話也會做相關的介紹，記得關注哦。對啦，本文的全部代碼和語料，我都上傳到github上了:https://github.com/Htring/NLP_Applications[4]，后續其他相關應用代碼也會陸續更新，也歡迎star，指點哦。

參考文獻

[1] torchtext: https://pytorch.org/text/stable/index.html

[2]【深度學習】textCNN論文與原理——短文本分類(基于pytorch): https://piqiandong.blog.csdn.net/article/details/110149143

[3] TorchText用法示例及完整代碼: https://blog.csdn.net/nlpuser/article/details/88067167

[4] https://github.com/Htring/NLP_Applications: https://github.com/Htring/NLP_Applications

首發公眾號【AIAS編程有道】,頭條同步。

原創不易，科皮子菊麻煩你關注，轉發，評論，感謝你的批評和指導，你的支持是我在頭條發布文章的源源動力。我是愛編程，愛算法的科皮子菊，下篇博文見！

在線咨詢

上一篇：使用FLIP技術讓編寫動畫事半功倍
下一篇：JS學習之正則

您的項目需求

*請認真填寫需求信息，我們會在24小時內與您取得聯系。

整合營銷服務商

NLP之文本分類：「Tf-Idf、Word2Vec和

概要

設置

語言模型

結語

言

目錄

標題文字

文字格式標簽

段落

其它標簽

總結

前言

1 數據處理

2 構建模型并訓練

3 總結

參考文獻

您的項目需求