A bibliography for NLP beginners

Roughly one year ago I started to study Natural Language Processing. During my internship at OTCStreaming (that resulted in this paper) and my current position at Squarepoint, I had the occasion to read a great number of papers and to gather some useful knowledge on this topic. Here I will share some of the must-read papers of the field, which can get anybody with some knowledge in Machine Learning started!

I. Introduction

Our goal here is to perform binary text classification, or sentiment analysis, on different corpora. However, most of this article consist in basis of NLP, so it can be used in many other applications as well.

II. Data Representation

There are not many ways to represent text for classification. We can roughly make 3 categories:

• Bag-of-Words (BoW, wiki), basically a matrix counting the number of occurrences of each word in a text, possibly normalized
• Variants include tf-idf, and the use of ngrams (groups of words, sometimes characters)
• Obviously, the order of the words is completely forgotten, which is not as catastrophic as one might think.
• Word Embeddings, or distributed representations, associate to each word a vector based in its semantic properties. See section III, and Word2Vec.
• Neural Networks, based on word or character-level embeddings, possibly pretrained.
• Neural Networks, contrarily to other models, have the particularity to generate a representation of the data that depends on the classification task.
That is an obvious advantage, but it requires a lot more data. That’s why we can still feed them generic representations to skip this part of the task.
• A good survey of the topic can be found in Lopez et al. (2017)

On top of these representations, we can basically use any traditional machine learning algorithm to obtain a classifier. (more details in section V)

As stated in this paper, here is a comparison of text classification algorithms on different corpora. This paper is about fastText, a Facebook algorithm that uses clever tricks and characters ngrams to provide efficient and – guess what – fast classifiers and embeddings.

We can see that BoW models are often on par with much more complicated Neural Network models, which require a lot more time to train. Less fancy I know, but it works.

The table also shows one of the most influential recent papers on text classification, char-CNN by Zhang and LeCun (2015), that uses character-level 1D Convolutional Neural Networks (CNN) to perform this task. The encoding is done by prescribing an alphabet of size d for the input language, and then quantize each character using 1-of-d encoding (or “one-hot” encoding). The downside of CNNs is that, unlike recurrent networks for example, need fixed-size sequences. That is why the sequence of characters is transformed to a sequence of such d sized vectors with fixed length $l_0$. Any character exceeding length $l_0$ is ignored, and any characters that are not in the alphabet including blank characters are quantized as all-zero vectors. In the paper, they use a 70 characters dictionary including lowercase letters, numbers and other common characters.

III. Word embeddings

Words embeddings (also called distributed representations) have been a real breakthrough in NLP these past 5 years, so they deserve their own paragraph.

The seminal paper of Collobert et al. (2011) presents a first idea of token embeddings, or word features vectors, based on lookup tables in a fixed vocabulary and using neural networks. It also brings a general solution to problems such as Part of Speech (POS), Chunking and Named Entity Recognition (NER) (see IV). The work on word features vectors continued with the classic Word2Vec paper Mikolov et al. (2013) which is now one of the references on the topic, introducing the skip-gram model for text. There, the method consists in associates to each word in a given dictionary a vector, determined by self-supervised learning on large corpora of text. More precisely, Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words, while in the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. This was demonstrated to lead to impressive results, with the vectors being positioned in space relatively to their meaning and with their relationships correctly captured. That means that we can linearize the space of meaning and thus obtain vectorial relationships such as $\vec{Paris} - \vec{France} + \vec{England} = \vec{London}$ (although this is a classic example, not everything works as well). A very clear explanation of the Word2Vec algorithm has been written by Xin Rong (2016).

However, a problem of these approaches are that they rely on a dictionary of words, and that out-of-vocabulary words such as orthographic errors get a generic representation. Each word has one an only associated embedding, regardless its context. This is important because the same words can have different meanings based on their context, for example Apple can be both a company and a fruit. Besides, it will probably lead to Apples being widely unrelated to Apple, because the plural refers exclusively to the fruit. In problems such as information extraction, that is a major issue because the content consists mostly in names that are non standard words, and can evolve in time. Besides, closely related words such as even and uneven should be close in the feature space, which is not guaranteed by these methods. That is why recently the focus has shifted on a study directly on the characters, that mostly solve these questions. Examples can be found in Ling et al. (2015), and Lample et al. (2016) with LSTMs, or in Santos et al. (2014), Chiu et al. (2015) and Kim et al. (2015) with Convolutional Networks. There are other techniques to help with this problem, such as using Named Entity Recognition to separate common nouns and proper nouns beforehand (see V).

Although there is a great variety of models, the seminal paper of Levy et al. (2015) shows that the parameters matter more than the model. We learn that Word2Vec models are not necessarily better than Bag of Words on every task, and they give great advice on how to choose hyperparameters for our models.

Representation of documents

Other techniques from Natural Language Processing aim at obtaining distributed, unsupervised representations for phrases and paragraphs.

Further developments aim to learn vector representations of sentences or documents instead of limiting the models to the words only. This is done with methods similar to those used to get words representations, only with whole sentences or paragraphs as the input.

Li et al. (2015) proposes an autoencoder on paragraphs and documents, that takes precomputed vector representations of tokens as inputs and deals with sentences as sequences of these tokens, using LSTM (Hochreiter et al. (1997)), as does Zhao et al. (2015). Mikolov et al. (2013) also proposes skip-grams representations of phrases in the famous Word2Vec paper, as an extension of the treatment of words. But most importantly, Mikolov et al. (2014) proposes Doc2Vec, which is an extension of Word2Vec to whole documents, using a document embedding as an additional context vector in CBOW or skip-gram.

For a great review of word embeddings, check http://ruder.io/word-embeddings-1/

IV. Other tasks in NLP

Of course, NLP isn’t limited to sentiment analysis. Let’s explain a few other tasks here:

• Machine TranslationSutskever et al. (2014) launched Seq2seq for this problem, and the results are getting better ever since. Bahdanau et al. (2015) made a major contribution, introducing in this paper Attention Models. A recent and fascinating paper is Conneau et al. (2018) that gets state-of-the-art results in translation without parallel data.
• Named Entity Recognition (NER): As Wikipedia puts itto locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, etc. Two recent papers: Chiu et al. (2016) and Lample et al. (2016), that both use character-based networks.
• Part-of-Speech Tagging: Basically, finding what is the grammatical nature of each word (noun, verb…). Can also be extended to other “grammars”, that is to find the function of each element of a “sentence”. My first paper deals with this particular problem, using both characters and context embeddings for each individual token. Santos et al. (2014) also successfully use character-based networks for this task.
• Relation Extraction: That is the act of linking together different named entities (Paris is a city, but also the capital of France). This is a paramount task when it comes to text understanding or information extraction. Interesting papers are Mintz et al. (2009) that uses distant supervision, as well as Nguyen et al. (2015) that relies on CNN.
• And many others…

V. Data Preprocessing

Before any data representation, we must clean the data. That means removing too frequent, uninformative words (a, the, it…) called stop words, but also URLs, and everything that is not a letter. We standardize all text to lowercase. scikit-learn can handle that itself: for example, the TfidfVectorizer function has a stop_words parameter.

Then, you can use the library SpaCy to perform Part-of-Speech tagging. It allows to keep only adjective, verbs and adverbs in order to anonymize the data, and in particular not to use nouns.

For further generalization, we lemmatize the text using SpaCy, which means that we only keep the root of words, the infinitive of verbs for example.

We then can create a bag of words based on this preprocessed text. You can find this process, as well as an interesting survey of sentiment analysis in Medhat et al. (2014).

We can’t use all the words of the vocabulary to create the BoW – for memory and overfitting reasons -, so we focus on the n most frequent ones that are not stop words.

VI. Models

As said earlier, nearly all machine learning classification models can be used.

The most common ones, which have stood the test of time, are:

• Naive Bayes Classifiers (wiki), in particular Multinomial Naive Bayes (MNB)
• Tree Based Classifiers, such as Random Forests. We use the most efficient one, XGBoost
• ElasticNet

As well as more recent models:

• Unsupervised classification using Word2Vec
• fastText
• Ensemble models

You can use Gensim for Word2Vec, FastText and even Doc2Vec embeddings.

Don’t forget that for most of these models, parameters have to be chosen, and first the number of words used for the Bag of Words, as well as model-specific parameters. (remember Levy et al. (2015))

VII. Conclusion

In this article, I have tried to present many papers and basic techniques related to Natural Language Processing. Some really are classics, some are more obscure, but you have to read a lot to find exactly what suits your needs.

And most of all, implement, experiment as much as you can, so you can really understand what you read. In particular,  char-CNN by Zhang and LeCun (2015) is very simple to implement and can be a great introduction to neural networks for beginners. You can find a lot of datasets on Wikipedia.

Voila! I hope this little guide will help you. I may have missed some important papers, do not hesitate to add them in the comments!

Posted in NLP