Named Entity Recognition with Classical ML

Course Project — INFO 521 "Introduction to Machine Learning"

Highlights

This blog post present a brief survey on classical machine learning algorithms in the context of the CoNNL-2003 Shared Task. A named entity recognition (NER) system is built to recognize and classify objects in a body of text into predefined categories. While the task is certainly not new, I approach the task naively1 by creating a baseline to get some inspiration and determine the complexity of the problem which motivates the use of machine learning. Finally, an analysis is included between generative and discriminative machine learning algorithms.

NOTE Access to GitHub repository here and paper here.

A secondary objective for the project required each algorithm to be written in python. While not required, the scikit-learn library was used as comparision for a sanity-check during evaluation.2

  • Since the writing of this blog/paper, I’ve implemented various architectures from papers (e.g., deep averaging networks, lstms with attention) and can confidently cite the difference in runtime between the written algorithm and the imported algorithm due to inefficent matrix multiplication. That is, my inexperience was at fault. While the performance was nearly identical (suggesting a correctly implemented algorithm) that did not take advantage of the hardware, resulting is slower performance. I highly suggest newcomers to try implementing an architecture from scratch as it really teaches you the dirty details in machine learning.

Introduction

The Named Entity Recognition and Classification (NERC) task refers to the process that classifies phrases in text as various categories. Common entity types include ORGANIZATION (ORG), PERSON (PER), LOCATION (LOC), and MISCELLANEOUS (MISC). Tagging is the process of labeling each word in a sentence with its respective named entity (NE). As an example:

The University of Arizona ORG is a public land-grant university located in Tucson GPE , Arizona GPE .


The remaining tokens are classified as the label outside (O), meaning the token is simply outside the named entity phrase. Note the categorization is subjective as one can argue Arizona is also a LOCATION. Approaches to NERC generally use BIO notation which separates the beginning (B-) and inside (I-) of entities. The example above is now:

The B-ORG University I-ORG of I-ORG Arizona I-ORG is a public land-grant university located in Tucson B-GPE , Arizona B-GPE .


NER remains a relevant natural language processing task that benefits several extrinsic tasks such as question answering and information extraction. The traditional schema largely ignores nested entities in favor of flat structures that are easier to identity but frames the problem with missing information. State-of-the-art (SOTA) methods make use of bidirectional Long Short-Term Memory (LSTM) networks with a Conditional Random Field (CRF) layer alongside contextualized embeddings [1].

Contributions:

  • The NER task is approached with a simple rule-based baseline that is further developed into a more complex model, i.e., a hidden markov model. The purpose of this methodology was to motivate the use of machine learning for this particular task, and to motivate the use of Logistic Regression applied to a sequence-labeling problem.
  • An analysis is provided between generative and discriminative machine learning algorithms.

CoNNL-2002/-03

The project describes a baseline & benchmark methodology for developing a NER system in the context of the CoNNL-2003 Shared Task [2]. We will focus on the English dataset and compare3 the Spanish corpus from the CoNNL-2002 Shared [3].

The English dataset was preprocessed into training, development, and test sets. Additionally, the number of tokens per entity type highlight the imbalanced nature of the CoNNL-2003 dataset and of NER in general.

Comparison of the English split and number of tokens per entity type.

Methods

Named entity recognition is treated as a sequence labeling task, that is, given a sequence of words \(W = (w_1, ..., w_n)\), we are interested in maximizing the best sequence of tags:

\[\hat{t}_{1:n} \approx \arg\max_{\hat{t}_{1:n}} P(tag_{1:n}|word\_{1:n}) \tag{1}\]

We’ll introduce two class sequence labeling algorithms, the generative—Hidden Markov Model (HMM)—and the discriminative—Maximum Entropy Markov Model (MEMM). We encourage the reader who is interested in the mathematical derivation of each algorithm to read our paper. Otherwise the equation above is sufficient for intuition.

Key differences:

  • Generative: probabilistic modeling may lead to greater interpretability
  • Discriminative: easy to implement feature functions

Baselines

Two rule-based systems were computed for the English and Spanish corpora with the seqeval package [4]. The AlwaysNonEntity is a naive baseline that labels all tokens as non-entities (O), see Table 1. The less naive system SingleEntity is based on a lookup table where only the beginning (B-) of entities is added, see Table 2.

As evidenced with the AlwaysNonEntity baseline, token-level accuracy has a modest 80% but is meaninguless for NEs. The improved baseline SingleEntity labels entities only if they appear in the training data and is the default baseline for our system to improve upon.

Earlier I stated the CoNNL-2003 dataset is imbalanced. Largely the distribution of named entities is imbalanced where the majority class is outside (O). Therefore the micro-averaged F1 score will be used for evaluation because this doesn’t prefer the majority class. However other evaluation methods may be appropriate depending on the domain, but are not considered to focus on the general design process for machine learning.

Experiments

We compare the MEMM and HMM on the NERC task. The Spanish dataset was evaluated exclusively by an HMM, and the results in Table 4 suggest a first-order model an appropriate benchmark for the task. We obtained an 𝐹1 score of 67.5% which is an improvement over baselines and the transformation-based model proposed by Villalba and Guzmán [5].

For the English dataset, several variants were tested by using various add-𝛼 smoothing (𝛼 = 1 or > 1) values and used pairwise grid search to find optimal values for regularization, epochs, and learning rate. We compared the results of the MEMM imported from scikit-learn and our implementation in Table 4.

The MEMM clearly shows discriminative modeling gives good results from feature functions and word embeddings. It easily outperforms the baseline and shows a 4% improvement over the HMM on the CoNNL-2003 dataset. We are satisfied that the performance of these two models is almost identical suggesting our implementation is on par with scikit-learn’s version except for runtime. The MEMM performed and generalized the best with greedy search as the validation score did not drop drastically for the test data.

Ablation Study

The motivation for maximum-entropy modeling is clear: global and local features are useful for prediction. The inclusion of feature functions and static word embeddings allowed the MEMM to recognize complex entities. This increase in features, and in effect parameters, improved the overall performance of the system. The ablation study below highlights how each category of features improved our system.

Conclusion

We described the performance of our algorithms and their variants in Tables 3 & 4. The ME approach clearly shows discriminative modeling gives good results; the approach easily outperforms the stated baselines, AlwaysNonEntity and SingleEntity. It also achieves a ~10 point performance increase when compared to the first-order HMM on the dev set. Importantly, the generalization error is reduced with the ME approach, which is highly desirable.

The discussion of generative and discriminative modeling by Ng and Jordan [6] also suggests logistic regression will eventually catch up and overtake the performance of a generative classifier given enough data. This implication may be extended to deep neural networks that already hold SOTA performance.

Therefore recent advancements (approx. 10 years ago) in NLP (i.e., embeddings) have improved our model (see Table 7) with little feature engineering and can reach competitive performance with tuning or more careful modeling. Ultimately, the performance is a secondary interest,and propose that the CoNNL-2003 dataset may be dated and we should instead access the generalization of these models by iteratively included recent entities or perhaps tackle nested entities.

  1. Naive in the sense that one should attempt to solve the task with appropriate tools. This also exposes details in the data when one uses relevant baselines. 

  2. Debugging machine learning is a difficult but useful skill. Writing algorithms from scratch (in some capacity) is an excellent exercise. 

  3. The transformation-based model is inspired by Brill’s Tagger: an inductive method for part-of-speech tagging. We used the model for comparision because it’s intended to use linguistic intuition. 

References

2020

  1. Named Entity Extraction with Finite State Transducers
    Diego Alexander Huérfano Villalba, and Elizabeth León Guzmán
    May 2020

2018

  1. A Neural Layered Model for Nested Named Entity Recognition
    Meizhi Ju, Makoto Miwa, and Sophia Ananiadou
    In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Jun 2018
  2. seqeval: A Python framework for sequence labeling evaluation
    Hiroki Nakayama
    Jun 2018
    Software available from https://github.com/chakki-works/seqeval

2003

  1. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
    Erik F. Tjong Kim Sang, and Fien De Meulder
    In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Jun 2003

2002

  1. Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition
    Erik F. Tjong Kim Sang
    In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), Jun 2002

2001

  1. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
    Andrew Ng, and Michael Jordan
    In Advances in Neural Information Processing Systems, Jun 2001