Word Embedding and Text Classification Performance
This report investigates using a recurrent neural network (RNN) model for classifying sentences extracted from Chinese Wikipedia articles. We evaluate the classification performance across various Byte-Pair Encoding (BPE) settings and RNN architectures, finding that BPE settings significantly influence classification outcomes.
Classification Problem and Data
Inspired by an example from Bertrand’s Introduction to Machine Learning, this study applies RNNs to classify sentences from Chinese Wikipedia articles into three topics: 物理学 (physics), 数学 (mathematics), and 生物学 (biology). For training, we use three articles corresponding to these topics, while three additional articles on 万有引力 (gravity), 群论 (group theory), and 细胞 (cell) serve as the validation set. Each article is split into sentences using the Chinese period “。”.
We employ the pre-trained subword embeddings provided by BPEmb, which uses BPE and has been trained on a multilingual corpus, including Chinese. BPEmb offers several configurations with vocabulary sizes of 50000, 100000, and 200000 tokens, along with embedded vector dimensions of 25, 100, and 300.
The articles are processed into tokens using BPEmb’s encoder. The frequency distribution of these tokens aligns with a power law and fits well with the Zipf distribution, as illustrated in Figure 1.
Tokens appearing more frequently in the articles are likely to be more representative of their topics. We select tokens at or above the 90th percentile frequency for each topic. Figure 2 shows that most selected tokens are topic-specific, although some common tokens, such as punctuation marks, are included.
The selected tokens are also visualized in a two-dimensional scatter plot in Figure 3. While several clusters are visible, they do not correspond to distinct topics, indicating that a more sophisticated classification approach may be necessary.
RNN Model
Figure 4 presents a summary of the RNN model for the classification task, using a long short-term memory (LSTM) layer.
The training dynamics, including loss and error rate over epochs, are depicted in Figure 5.
Effect of Embedding Setting and RNN Model
We assessed the effects of vocabulary sizes and embedding vector dimensions using BPEmb with our RNN model. The error rates obtained on the validation set are summarized in the table below. Notably, the embedding dimension has a more pronounced effect on performance than vocabulary size. Smaller embedding dimensions yield higher error rates, while the largest dimensions do not necessarily enhance performance. This could be attributed to the limited size of the training data, which may benefit from a moderate embedding dimension. The combination of a vocabulary size of 100000 and an embedding dimension of 100 resulted in the lowest error rate.
Dimension=25 | Dimension=100 | Dimension=300 | |
---|---|---|---|
Vocabulary Size=50K | 23.4% | 15.5% | 17.6% |
Vocabulary Size=100K | 23.8% | 10.3% | 17.9% |
Vocabulary Size=200K | 22.1% | 17.6% | 14.1% |
Figure 6 illustrates the validation set’s error rate across different vocabulary sizes and embedding dimensions, highlighting that the best performance is achieved with a vocabulary size of 100000 and an embedding dimension of 100. Figure 7 presents the confusion matrix for this optimal BPEmb setting.
We further evaluated different RNN models under the optimal BPEmb setting. Figure 8 summarizes the architectures of four different models evaluated.
The differences in model architecture and their corresponding validation error rates are detailed in the table below. While the BPEmb settings had a significant impact, variations among the RNN models showed only a mild effect on performance.
Model | LSTM Dimension | Dropout Layer | Validation Failure Rate (%) |
---|---|---|---|
Model1 | 50 | No | 10.3 |
Model2 | 10 | No | 16.2 |
Model3 | 100 | No | 12.4 |
Model4 | 50 | Yes | 11.4 |
Conclusion
This note has examined the influence of various Byte-Pair Encoding (BPE) settings and recurrent neural network (RNN) architectures on the classification of sentences from Chinese Wikipedia articles across three distinct topics: physics, mathematics, and biology. Our findings highlight the critical role of BPE configurations in determining classification performance, particularly the vocabulary size and embedding dimensionality. This emphasizes the necessity of carefully selecting embedding parameters when applying RNN models for text classification, especially in contexts with limited data resources.
Code
The following is the Wolfram Language code used in this study.
(*Split string in Chinese into sentences*)
SplitToSentences[string_]:= StringSplit[string, "。"]
(*function to generate data for training and validation*)
makedata[dimensions_, vocsize_]:= Module[{bpe, train, validation},
classes = {"数学", "物理学", "生物学"};
bpe = NetModel[{"BPEmb Subword Embeddings Trained on Wikipedia Data",
"Language" -> "Chinese",
"Dimensions"->dimensions,
"VocabularySize"->vocsize}]; (*BPEmb embedding*)
train = SplitToSentences[WikipediaData[#, Language->"Chinese"]] & /@ classes;
train = Map[bpe, train, {2}]; (*embedding*)
train = Flatten[Thread /@ Thread[train -> classes]];
validation = SplitToSentences[WikipediaData[#, Language->"Chinese"]] & /@ {"群论","万有引力", "细胞"};
validation = Map[bpe, validation, {2}]; (*embedding*)
validation = Flatten[Thread /@ Thread[validation -> classes]];
<|"training"->train, "validation"->validation|>
]
(*function to train a network model using *)
runmodel[net_, dimensions_, vocsize_]:=Module[{data},
data = makedata[dimensions, vocsize];
<|"Dimensions"->dimensions, "VocabularySize"->vocsize,
"result"->NetTrain[net, data["training"], All, ValidationSet->data["validation"], MaxTrainingRounds->30]|>
]
(*RNN model with LSTM*)
classes = {"数学", "物理学", "生物学"};
rnn = NetChain[<|"LSTM"->LongShortTermMemoryLayer[50], "Last"->SequenceLastLayer[], "Linear"->LinearLayer[3], "Softmax"->SoftmaxLayer[]|>,
"Output"->NetDecoder[{"Class", classes}]]
(* classification using the RNN model with various embedding dimensions and vocabulary sizes *)
vocsize = {50000, 100000, 200000};
dim = {25, 100, 300};
result = Outer[runmodel[rnn, #1, #2]&, dim, vocsize];
References
Bernard, E. (2021). Introduction to Machine Learning. Wolfram Media, Inc.
Heinzerling B, and Strube M., BPEmb: Subword Embeddings in 275 Languages ( https://bpemb.h-its.org)