Named Entity Recognition with Classical ML
Project for the course INFO 521 "Introduction to Machine Learning"
Highlights
We present a brief survey on classical machine learning algorithms in the context of the CoNNL-2003 Shared Task. A named entity recognition (NER) system is built to recognize and classify objects in a body of text into predefined categories. While the task is certainly not new, we approach the task naively1. by creating a baseline to get some inspiration and determine the complexity of the problem which motivates the use of machine learning. Finally, we include an analysis between generative and discriminative machine learning algorithms.
A secondary objective for the project required each algorithm to be written in pure python with numpy. The scikit-learn
library was used for comparision purposes during evaluation.
Introduction
The Named Entity Recognition and Classification (NERC) task refers to the process that identifies phrases in text into various categories. Common entity types include ORGANIZATION (ORG), PERSON (PER), LOCATION (LOC), and MISCELLANEOUS (MISC). Tagging is the process of labeling each word in a sentence with its respective named entity (NE). As an example:
Note the categorization is subjective as one can argue Arizona is also a LOCATION. Approaches to NERC generally use BIO notation which separates the beginning (B-) and inside (I-) of entities. Otherwise the label outside (O) is used. The named entity above is now:
NER remains a relevant natural language processing task that benefits several extrinsic tasks such as question answering and information extraction. The traditional schema largely ignores nested entities in favor of flat structures that are easier to identity but frames the problem with missing information. State-of-the-art (SOTA) methods make use of bidirectional Long Short-Term Memory (LSTM) networks with a Conditional Random Field (CRF) layer [1].
Our contributions:
- We approach the NER task through a rule-based approach and create two baseline systems that motivate the use of machine learning.
- We provide an analysis between generative and discriminative machine learning algorithms.
CoNNL-2002/-03
The project describes a baseline & benchmark methodology for developing a NER system in the context of the CoNNL-2003 Shared Task [2]. We will focus on the English dataset and compare the Spanish corpus from the CoNNL-2002 Shared [3]2.
The English dataset was preprocessed into training, development, and test sets. Additionally, the number of tokens per entity type highlight the imbalanced nature of the CoNNL-2003 dataset and of NER in general.


Methods
Named entity recognition is treated as a sequence labeling task, that is, given a sequence of words \(W = (w_1, ..., w_n)\), we are interested in maximizing the best sequence of tags:
\[\hat{t}_{1:n} \approx \arg\max_{\hat{t}_{1:n}} P(tag_{1:n}|word\_{1:n}) \tag{1}\]We’ll introduce two class sequence labeling algorithms, the generative—Hidden Markov Model (HMM)—and the discriminative—Maximum Entropy Markov Model (MEMM). We encourage the reader who is interested in the mathematical derivation of each algorithm to read our paper. Otherwise the equation above is sufficient for intuition.
Key differences:
- Generative: probabilistic modeling may lead to greater interpretability
- Discriminative: easy to implement feature functions
Baselines
Two rule-based systems were computed for the English and Spanish corpora with the seqeval package [4]. The AlwaysNonEntity is a naive baseline that labels all tokens as non-entities (O), see Table 1. The less naive system SingleEntity is based on a lookup table where only the beginning (B-) of entities is added, see Table 2.

As evidenced with the AlwaysNonEntity baseline, token-level accuracy has a modest 80% but is meaninguless for NEs. The improved baseline SingleEntity labels entities only if they appear in the training data and is the default baseline for our system to improve upon.
Earlier we noticed the CoNNL-2003 dataset is imbalanced. Largely the distribution of named entities is imbalanced where the majority class is outside (O). Therefore we will use the micro-averaged F1 score for evaluation because this doesn’t prefer the majority class. However other evaluation methods may be appropriate depending on the domain. We do not consider these.
Experiments
We compare the MEMM and HMM on the NERC task. The Spanish dataset was evaluated exclusively by an HMM, the results in Table 4 suggest a first-order model an appropriate benchmark for the task. We obtained an 𝐹1 score of 67.5% which is an improvement over baselines and the transformation-based model proposed by Villalba and Guzmán [5].

For the English dataset we tested several variants by using various add-𝛼 smoothing (𝛼 = 1 or > 1) values and used pairwise grid search to find optimal values for regularization, epochs, and learning rate. We compared the results of the MEMM imported from scikit-learn and our implementation in Table 4.

The MEMM clearly shows discriminative modeling gives good results from feature functions and word embeddings. It easily outperforms the baseline and shows a 4% improvement over the HMM on the CoNNL-2003 dataset. We are satisfied that the performance of these two models is almost identical suggesting our implementation is on par with scikit-learn’s version except for runtime. The MEMM performed and generalized the best with greedy search as the validation score did not drop drastically for the test data.
Ablation Study
The motivation for maximum-entropy modeling is clear: global and local features are useful for prediction. The inclusion of feature functions and static word embeddings allowed the MEMM to recognize complex entities. This increase in features, and in effect parameters, improved the overall performance of the system. The ablation study below highlights how each category of features improved our system.

Conclusion
We described the performance of our algorithms and their variants in Tables 3 & 4. The ME approach clearly shows discriminative modeling gives good results; it easily outperforms the AlwaysNonEntity and SingleEntity baselines and beats a first-order HMM as well.6
The discussion of generative and discriminative modeling by Ng and Jordan [6] also suggests logistic regression will eventually catch up and overtake the performance of a generative classifier given enough data. This implication may be extended to deep neural networks that already hold SOTA performance.

Therefore we conclude that advancements in NLP (i.e., embeddings) have improved our model (see Table 7) with little feature engineering and can reach competitive performance with tuning or more careful modeling. We are not interested in performance however, and propose that the CoNNL-2003 dataset may be dated and we should instead access the generalization of these models or evaluate nested entities.
-
Naive in the sense that we should attempt to solve the task with appropriate tools. We may also note various when we baseline systems correctly. ↩
-
The transformation-based model is inspired by Brill’s Tagger: an inductive method for part-of-speech tagging. We used the model for comparision because it’s intended to use linguistic intuition. ↩
References
2020
2018
- seqeval: A Python framework for sequence labeling evaluationJun 2018Software available from https://github.com/chakki-works/seqeval
2003
- Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity RecognitionIn Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Jun 2003
2002
- Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity RecognitionIn COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), Jun 2002
2001
- On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive BayesIn Advances in Neural Information Processing Systems, Jun 2001