Named Entity Recognition with Classical ML

Course Project — INFO 521 "Introduction to Machine Learning"

Highlights

This blog post present a brief survey on classical machine learning algorithms in the context of the CoNNL-2003 Shared Task. A named entity recognition (NER) system is built to recognize and classify objects in a body of text into predefined categories. While the task is certainly not new, I approach the task naively¹ by creating a baseline to get some inspiration and determine the complexity of the problem which motivates the use of machine learning. Finally, an analysis is included between generative and discriminative machine learning algorithms.

NOTE Access to GitHub repository here and paper here.

A secondary objective for the project required each algorithm to be written in python. While not required, the scikit-learn library was used as comparision for a sanity-check during evaluation.²

Since the writing of this blog/paper, I’ve implemented various architectures from papers (e.g., deep averaging networks, lstms with attention) and can confidently cite the difference in runtime between the written algorithm and the imported algorithm due to inefficent matrix multiplication. That is, my inexperience was at fault. While the performance was nearly identical (suggesting a correctly implemented algorithm) that did not take advantage of the hardware, resulting is slower performance. I highly suggest newcomers to try implementing an architecture from scratch as it really teaches you the dirty details in machine learning.

Introduction

The Named Entity Recognition and Classification (NERC) task refers to the process that classifies phrases in text as various categories. Common entity types include ORGANIZATION (ORG), PERSON (PER), LOCATION (LOC), and MISCELLANEOUS (MISC). Tagging is the process of labeling each word in a sentence with its respective named entity (NE). As an example:

The University of Arizona ORG is a public land-grant university located in Tucson GPE , Arizona GPE .

The remaining tokens are classified as the label outside (O), meaning the token is simply outside the named entity phrase. Note the categorization is subjective as one can argue Arizona is also a LOCATION. Approaches to NERC generally use BIO notation which separates the beginning (B-) and inside (I-) of entities. The example above is now:

The B-ORG University I-ORG of I-ORG Arizona I-ORG is a public land-grant university located in Tucson B-GPE , Arizona B-GPE .

NER remains a relevant natural language processing task that benefits several extrinsic tasks such as question answering and information extraction. The traditional schema largely ignores nested entities in favor of flat structures that are easier to identity but frames the problem with missing information. State-of-the-art (SOTA) methods make use of bidirectional Long Short-Term Memory (LSTM) networks with a Conditional Random Field (CRF) layer alongside contextualized embeddings [1].

Contributions:

The NER task is approached with a simple rule-based baseline that is further developed into a more complex model, i.e., a hidden markov model. The purpose of this methodology was to motivate the use of machine learning for this particular task, and to motivate the use of Logistic Regression applied to a sequence-labeling problem.
An analysis is provided between generative and discriminative machine learning algorithms.

CoNNL-2002/-03

The project describes a baseline & benchmark methodology for developing a NER system in the context of the CoNNL-2003 Shared Task [2]. We will focus on the English dataset and compare³ the Spanish corpus from the CoNNL-2002 Shared [3].

The English dataset was preprocessed into training, development, and test sets. Additionally, the number of tokens per entity type highlight the imbalanced nature of the CoNNL-2003 dataset and of NER in general.

Comparison of the English split and number of tokens per entity type.

Methods

Named entity recognition is treated as a sequence labeling task, that is, given a sequence of words \(W = (w_1, ..., w_n)\), we are interested in maximizing the best sequence of tags:

\[\hat{t}_{1:n} \approx \arg\max_{\hat{t}_{1:n}} P(tag_{1:n}|word_{1:n}) \tag{1}\]

We’ll introduce two class sequence labeling algorithms, the generative—Hidden Markov Model (HMM)—and the discriminative—Maximum Entropy Markov Model (MEMM). We encourage the reader who is interested in the mathematical derivation of each algorithm to read our paper. Otherwise the equation above is sufficient for intuition.

Key differences:

Generative: probabilistic modeling may lead to greater interpretability
Discriminative: easy to implement feature functions

Baselines

Two rule-based systems were computed for the English and Spanish corpora with the seqeval package [4]. The AlwaysNonEntity is a naive baseline that labels all tokens as non-entities (O), see Table 1. The less naive system SingleEntity is based on a lookup table where only the beginning (B-) of entities is added, see Table 2.

As evidenced with the AlwaysNonEntity baseline, token-level accuracy has a modest 80% but is meaningless for NEs. The improved baseline SingleEntity labels entities only if they appear in the training data and is the default baseline for our system to improve upon.

Earlier I stated the CoNNL-2003 dataset is imbalanced. Largely the distribution of named entities is imbalanced where the majority class is outside (O). Therefore the micro-averaged F1 score will be used for evaluation because this doesn’t prefer the majority class. However other evaluation methods may be appropriate depending on the domain, but are not considered to focus on the general design process for machine learning.

Experiments

We compare the MEMM and HMM on the NERC task. The Spanish dataset was evaluated exclusively by an HMM, and the results in Table 4 suggest a first-order model an appropriate benchmark for the task. We obtained an 𝐹1 score of 67.5% which is an improvement over baselines and the transformation-based model proposed by Villalba and Guzmán [5].

For the English dataset, several variants were tested by using various add-𝛼 smoothing (𝛼 = 1 or > 1) values and used pairwise grid search to find optimal values for regularization, epochs, and learning rate. We compared the results of the MEMM imported from scikit-learn and our implementation in Table 4.

The MEMM clearly shows discriminative modeling gives good results from feature functions and word embeddings. It easily outperforms the baseline and shows a 4% improvement over the HMM on the CoNNL-2003 dataset. We are satisfied that the performance of these two models is almost identical suggesting our implementation is on par with scikit-learn’s version except for runtime. The MEMM performed and generalized the best with greedy search as the validation score did not drop drastically for the test data.

Ablation Study

The motivation for maximum-entropy modeling is clear: global and local features are useful for prediction. The inclusion of feature functions and static word embeddings allowed the MEMM to recognize complex entities. This increase in features, and in effect parameters, improved the overall performance of the system. The ablation study below highlights how each category of features improved our system.

Conclusion

We described the performance of our algorithms and their variants in Tables 3 & 4. The ME approach clearly shows discriminative modeling gives good results; the approach easily outperforms the stated baselines, AlwaysNonEntity and SingleEntity. It also achieves a ~10 point performance increase when compared to the first-order HMM on the dev set. Importantly, the generalization error is reduced with the ME approach, which is highly desirable.

The discussion of generative and discriminative modeling by Ng and Jordan [6] also suggests logistic regression will eventually catch up and overtake the performance of a generative classifier given enough data. This implication may be extended to deep neural networks that already hold SOTA performance.

Therefore recent advancements (approx. 10 years ago) in NLP (i.e., embeddings) have improved our model (see Table 7) with little feature engineering and can reach competitive performance with tuning or more careful modeling. Ultimately, performance is a secondary interest, and we propose that the CoNNL-2003 dataset is dated; the generalization of these models should instead be the main focus (consider new entities), or handle nested entities.

Naive in the sense that one should attempt to solve the task with appropriate tools. This also exposes details in the data when one uses relevant baselines. ↩
Debugging machine learning is a difficult but useful skill. Writing algorithms from scratch (in some capacity) is an excellent exercise. ↩
The transformation-based model is inspired by Brill’s Tagger: an inductive method for part-of-speech tagging. We used the model for comparision because it’s intended to use linguistic intuition. ↩

References

2020

Named Entity Extraction with Finite State Transducers

Diego Alexander Huérfano Villalba, and Elizabeth León Guzmán

May 2020

Bib

@misc{villalba2020namedentityextractionfinite,
  title = {Named Entity Extraction with Finite State Transducers},
  author = {Villalba, Diego Alexander Huérfano and Guzmán, Elizabeth León},
  year = {2020},
  eprint = {2006.11548},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  url = {https://arxiv.org/abs/2006.11548}
}

2018

A Neural Layered Model for Nested Named Entity Recognition

Meizhi Ju, Makoto Miwa, and Sophia Ananiadou

In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Jun 2018

Abs DOI Bib

Entity mentions embedded in longer entity mentions are referred to as nested entities. Most named entity recognition (NER) systems deal only with the flat entities and ignore the inner nested ones, which fails to capture finer-grained semantic information in underlying texts. To address this issue, we propose a novel neural model to identify nested entities by dynamically stacking flat NER layers. Each flat NER layer is based on the state-of-the-art flat NER model that captures sequential context representation with bidirectional Long Short-Term Memory (LSTM) layer and feeds it to the cascaded CRF layer. Our model merges the output of the LSTM layer in the current flat NER layer to build new representation for detected entities and subsequently feeds them into the next flat NER layer. This allows our model to extract outer entities by taking full advantage of information encoded in their corresponding inner entities, in an inside-to-outside way. Our model dynamically stacks the flat NER layers until no outer entities are extracted. Extensive evaluation shows that our dynamic model outperforms state-of-the-art feature-based systems on nested NER, achieving 74.7% and 72.2% on GENIA and ACE2005 datasets, respectively, in terms of F-score.
@inproceedings{ju-etal-2018-neural, title = {A Neural Layered Model for Nested Named Entity Recognition}, author = {Ju, Meizhi and Miwa, Makoto and Ananiadou, Sophia}, editor = {Walker, Marilyn and Ji, Heng and Stent, Amanda}, booktitle = {Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)}, month = jun, year = {2018}, address = {New Orleans, Louisiana}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/N18-1131/}, doi = {10.18653/v1/N18-1131}, pages = {1446--1459} }
seqeval: A Python framework for sequence labeling evaluation

Hiroki Nakayama

Jun 2018

Software available from https://github.com/chakki-works/seqeval

2003

Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

Erik F. Tjong Kim Sang, and Fien De Meulder

In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Jun 2003

Bib

@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
  title = {Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition},
  author = {Tjong Kim Sang, Erik F. and De Meulder, Fien},
  booktitle = {Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003},
  year = {2003},
  url = {https://aclanthology.org/W03-0419/},
  pages = {142--147}
}

2002

Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition

Erik F. Tjong Kim Sang

In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), Jun 2002

Bib

@inproceedings{tjong-kim-sang-2002-introduction,
  title = {Introduction to the {C}o{NLL}-2002 Shared Task: Language-Independent Named Entity Recognition},
  author = {Tjong Kim Sang, Erik F.},
  booktitle = {{COLING}-02: The 6th Conference on Natural Language Learning 2002 ({C}o{NLL}-2002)},
  year = {2002},
  url = {https://aclanthology.org/W02-2024/}
}

2001

On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

Andrew Ng, and Michael Jordan

In Advances in Neural Information Processing Systems, Jun 2001