Word Embedding and Text Classification Performance

This report investigates using a recurrent neural network (RNN) model for classifying sentences extracted from Chinese Wikipedia articles. We evaluate the classification performance across various Byte-Pair Encoding (BPE) settings and RNN architectures, finding that BPE settings significantly influence classification outcomes.

Classification Problem and Data

Inspired by an example from Bertrand’s Introduction to Machine Learning, this study applies RNNs to classify sentences from Chinese Wikipedia articles into three topics: 物理学 (physics), 数学 (mathematics), and 生物学 (biology). For training, we use three articles corresponding to these topics, while three additional articles on 万有引力 (gravity), 群论 (group theory), and 细胞 (cell) serve as the validation set. Each article is split into sentences using the Chinese period “。”.

We employ the pre-trained subword embeddings provided by BPEmb, which uses BPE and has been trained on a multilingual corpus, including Chinese. BPEmb offers several configurations with vocabulary sizes of 50000, 100000, and 200000 tokens, along with embedded vector dimensions of 25, 100, and 300.

The articles are processed into tokens using BPEmb’s encoder. The frequency distribution of these tokens aligns with a power law and fits well with the Zipf distribution, as illustrated in Figure 1.

Figure 1. Token frequency distribution for the training set articles, fitted to Zipf distribution. The tokens are generated with a vocabulary size of 100000 and an embedded vector dimension of 100.

Tokens appearing more frequently in the articles are likely to be more representative of their topics. We select tokens at or above the 90th percentile frequency for each topic. Figure 2 shows that most selected tokens are topic-specific, although some common tokens, such as punctuation marks, are included.

Figure 2. Word cloud of BPE tokens from the training set, displaying only those at or above the 90th percentile frequency.

The selected tokens are also visualized in a two-dimensional scatter plot in Figure 3. While several clusters are visible, they do not correspond to distinct topics, indicating that a more sophisticated classification approach may be necessary.

Figure 3. Two-dimensional representation of the BPE tokens at or above the 90th percentile frequency.

RNN Model

Figure 4 presents a summary of the RNN model for the classification task, using a long short-term memory (LSTM) layer.

Figure 4. RNN architecture for text topic classification.

The training dynamics, including loss and error rate over epochs, are depicted in Figure 5.

Figure 5. Loss and error rate of the RNN model over time. The BPEmb parameters used: embedding vector dimensions = 100, vocabulary size = 100000.

Effect of Embedding Setting and RNN Model

We assessed the effects of vocabulary sizes and embedding vector dimensions using BPEmb with our RNN model. The error rates obtained on the validation set are summarized in the table below. Notably, the embedding dimension has a more pronounced effect on performance than vocabulary size. Smaller embedding dimensions yield higher error rates, while the largest dimensions do not necessarily enhance performance. This could be attributed to the limited size of the training data, which may benefit from a moderate embedding dimension. The combination of a vocabulary size of 100000 and an embedding dimension of 100 resulted in the lowest error rate.

	Dimension=25	Dimension=100	Dimension=300
Vocabulary Size=50K	23.4%	15.5%	17.6%
Vocabulary Size=100K	23.8%	10.3%	17.9%
Vocabulary Size=200K	22.1%	17.6%	14.1%

Figure 6 illustrates the validation set’s error rate across different vocabulary sizes and embedding dimensions, highlighting that the best performance is achieved with a vocabulary size of 100000 and an embedding dimension of 100. Figure 7 presents the confusion matrix for this optimal BPEmb setting.

Figure 6. Validation set error rate over training epochs for various BPEmb configurations. The optimal setting yields the lowest error rate.

Figure 7. Classification confusion matrix for the optimal BPEmb setting: vocabulary size = 100000, embedding dimension = 100.

We further evaluated different RNN models under the optimal BPEmb setting. Figure 8 summarizes the architectures of four different models evaluated.

Figure 8. Summary graphs of the RNN models whose performances are compared.

The differences in model architecture and their corresponding validation error rates are detailed in the table below. While the BPEmb settings had a significant impact, variations among the RNN models showed only a mild effect on performance.

Model	LSTM Dimension	Dropout Layer	Validation Failure Rate (%)
Model1	50	No	10.3
Model2	10	No	16.2
Model3	100	No	12.4
Model4	50	Yes	11.4

Conclusion

This note has examined the influence of various Byte-Pair Encoding (BPE) settings and recurrent neural network (RNN) architectures on the classification of sentences from Chinese Wikipedia articles across three distinct topics: physics, mathematics, and biology. Our findings highlight the critical role of BPE configurations in determining classification performance, particularly the vocabulary size and embedding dimensionality. This emphasizes the necessity of carefully selecting embedding parameters when applying RNN models for text classification, especially in contexts with limited data resources.

Code

The following is the Wolfram Language code used in this study.

(*Split string in Chinese into sentences*)
SplitToSentences[string_]:= StringSplit[string, "。"]

(*function to generate data for training and validation*)
makedata[dimensions_, vocsize_]:= Module[{bpe, train, validation},
    classes = {"数学", "物理学", "生物学"};
    bpe = NetModel[{"BPEmb Subword Embeddings Trained on Wikipedia Data", 
    "Language" -> "Chinese", 
    "Dimensions"->dimensions,
    "VocabularySize"->vocsize}]; (*BPEmb embedding*)
    train = SplitToSentences[WikipediaData[#, Language->"Chinese"]] & /@ classes;
    train = Map[bpe, train, {2}]; (*embedding*)
    train = Flatten[Thread /@ Thread[train -> classes]];


    validation = SplitToSentences[WikipediaData[#, Language->"Chinese"]] & /@ {"群论","万有引力", "细胞"};
    validation = Map[bpe, validation, {2}]; (*embedding*)
    validation = Flatten[Thread /@ Thread[validation -> classes]];
    <|"training"->train, "validation"->validation|>
    ]

(*function to train a network model using *)
runmodel[net_, dimensions_, vocsize_]:=Module[{data},
data = makedata[dimensions, vocsize];
<|"Dimensions"->dimensions, "VocabularySize"->vocsize, 
"result"->NetTrain[net, data["training"], All, ValidationSet->data["validation"], MaxTrainingRounds->30]|>
]

(*RNN model with LSTM*)
classes = {"数学", "物理学", "生物学"};
rnn = NetChain[<|"LSTM"->LongShortTermMemoryLayer[50], "Last"->SequenceLastLayer[], "Linear"->LinearLayer[3], "Softmax"->SoftmaxLayer[]|>,
"Output"->NetDecoder[{"Class", classes}]]

(* classification using the RNN model with various embedding dimensions and vocabulary sizes *)
vocsize = {50000, 100000, 200000};
dim = {25, 100, 300};
result = Outer[runmodel[rnn, #1, #2]&, dim, vocsize];

References

Bernard, E. (2021). Introduction to Machine Learning. Wolfram Media, Inc.

Heinzerling B, and Strube M., BPEmb: Subword Embeddings in 275 Languages ( https://bpemb.h-its.org)