Named Entity Recognition with Classical ML

Project for the course INFO 521 "Introduction to Machine Learning"

Highlights

We present a brief survey on classical machine learning algorithms in the context of the CoNNL-2003 Shared Task. A named entity recognition (NER) system is built to recognize and classify objects in a body of text into predefined categories. While the task is certainly not new, we approach the task naively1. by creating a baseline to get some inspiration and determine the complexity of the problem which motivates the use of machine learning. Finally, we include an analysis between generative and discriminative machine learning algorithms.

NOTE Access to GitHub repository here and paper here

A secondary objective for the project required each algorithm to be written in pure python with numpy. The scikit-learn library was used for comparision purposes during evaluation.

Introduction

The Named Entity Recognition and Classification (NERC) task refers to the process that identifies phrases in text into various categories. Common entity types include ORGANIZATION (ORG), PERSON (PER), LOCATION (LOC), and MISCELLANEOUS (MISC). Tagging is the process of labeling each word in a sentence with its respective named entity (NE). As an example:

The University of Arizona ORG is a public land-grant university located in Tucson GPE , Arizona GPE .


Note the categorization is subjective as one can argue Arizona is also a LOCATION. Approaches to NERC generally use BIO notation which separates the beginning (B-) and inside (I-) of entities. Otherwise the label outside (O) is used. The named entity above is now:

The B-ORG University I-ORG of I-ORG Arizona I-ORG is a public land-grant university located in Tucson B-GPE , Arizona B-GPE .


NER remains a relevant natural language processing task that benefits several extrinsic tasks such as question answering and information extraction. The traditional schema largely ignores nested entities in favor of flat structures that are easier to identity but frames the problem with missing information. State-of-the-art (SOTA) methods make use of bidirectional Long Short-Term Memory (LSTM) networks with a Conditional Random Field (CRF) layer [1].

Our contributions:

  • We approach the NER task through a rule-based approach and create two baseline systems that motivate the use of machine learning.
  • We provide an analysis between generative and discriminative machine learning algorithms.

CoNNL-2002/-03

The project describes a baseline & benchmark methodology for developing a NER system in the context of the CoNNL-2003 Shared Task [2]. We will focus on the English dataset and compare the Spanish corpus from the CoNNL-2002 Shared [3]2.

The English dataset was preprocessed into training, development, and test sets. Additionally, the number of tokens per entity type highlight the imbalanced nature of the CoNNL-2003 dataset and of NER in general.

Comparison of the English split and number of tokens per entity type.

Methods

Named entity recognition is treated as a sequence labeling task, that is, given a sequence of words \(W = (w_1, ..., w_n)\), we are interested in maximizing the best sequence of tags:

\[\hat{t}_{1:n} \approx \arg\max_{\hat{t}_{1:n}} P(tag_{1:n}|word\_{1:n}) \tag{1}\]

We’ll introduce two class sequence labeling algorithms, the generative—Hidden Markov Model (HMM)—and the discriminative—Maximum Entropy Markov Model (MEMM). We encourage the reader who is interested in the mathematical derivation of each algorithm to read our paper. Otherwise the equation above is sufficient for intuition.

Key differences:

  • Generative: probabilistic modeling may lead to greater interpretability
  • Discriminative: easy to implement feature functions

Baselines

Two rule-based systems were computed for the English and Spanish corpora with the seqeval package [4]. The AlwaysNonEntity is a naive baseline that labels all tokens as non-entities (O), see Table 1. The less naive system SingleEntity is based on a lookup table where only the beginning (B-) of entities is added, see Table 2.

As evidenced with the AlwaysNonEntity baseline, token-level accuracy has a modest 80% but is meaninguless for NEs. The improved baseline SingleEntity labels entities only if they appear in the training data and is the default baseline for our system to improve upon.

Earlier we noticed the CoNNL-2003 dataset is imbalanced. Largely the distribution of named entities is imbalanced where the majority class is outside (O). Therefore we will use the micro-averaged F1 score for evaluation because this doesn’t prefer the majority class. However other evaluation methods may be appropriate depending on the domain. We do not consider these.

Experiments

We compare the MEMM and HMM on the NERC task. The Spanish dataset was evaluated exclusively by an HMM, the results in Table 4 suggest a first-order model an appropriate benchmark for the task. We obtained an 𝐹1 score of 67.5% which is an improvement over baselines and the transformation-based model proposed by Villalba and Guzmán [5].

For the English dataset we tested several variants by using various add-𝛼 smoothing (𝛼 = 1 or > 1) values and used pairwise grid search to find optimal values for regularization, epochs, and learning rate. We compared the results of the MEMM imported from scikit-learn and our implementation in Table 4.

The MEMM clearly shows discriminative modeling gives good results from feature functions and word embeddings. It easily outperforms the baseline and shows a 4% improvement over the HMM on the CoNNL-2003 dataset. We are satisfied that the performance of these two models is almost identical suggesting our implementation is on par with scikit-learn’s version except for runtime. The MEMM performed and generalized the best with greedy search as the validation score did not drop drastically for the test data.

Ablation Study

The motivation for maximum-entropy modeling is clear: global and local features are useful for prediction. The inclusion of feature functions and static word embeddings allowed the MEMM to recognize complex entities. This increase in features, and in effect parameters, improved the overall performance of the system. The ablation study below highlights how each category of features improved our system.

Conclusion

We described the performance of our algorithms and their variants in Tables 3 & 4. The ME approach clearly shows discriminative modeling gives good results; it easily outperforms the AlwaysNonEntity and SingleEntity baselines and beats a first-order HMM as well.6

The discussion of generative and discriminative modeling by Ng and Jordan [6] also suggests logistic regression will eventually catch up and overtake the performance of a generative classifier given enough data. This implication may be extended to deep neural networks that already hold SOTA performance.

Therefore we conclude that advancements in NLP (i.e., embeddings) have improved our model (see Table 7) with little feature engineering and can reach competitive performance with tuning or more careful modeling. We are not interested in performance however, and propose that the CoNNL-2003 dataset may be dated and we should instead access the generalization of these models or evaluate nested entities.

  1. Naive in the sense that we should attempt to solve the task with appropriate tools. We may also note various when we baseline systems correctly. 

  2. The transformation-based model is inspired by Brill’s Tagger: an inductive method for part-of-speech tagging. We used the model for comparision because it’s intended to use linguistic intuition. 

References

2020

  1. Named Entity Extraction with Finite State Transducers
    Diego Alexander Huérfano Villalba, and Elizabeth León Guzmán
    2020

2018

  1. A Neural Layered Model for Nested Named Entity Recognition
    Meizhi Ju, Makoto Miwa, and Sophia Ananiadou
    In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Jun 2018
  2. seqeval: A Python framework for sequence labeling evaluation
    Hiroki Nakayama
    Jun 2018
    Software available from https://github.com/chakki-works/seqeval

2003

  1. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
    Erik F. Tjong Kim Sang, and Fien De Meulder
    In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Jun 2003

2002

  1. Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition
    Erik F. Tjong Kim Sang
    In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), Jun 2002

2001

  1. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
    Andrew Ng, and Michael Jordan
    In Advances in Neural Information Processing Systems, Jun 2001