[Paper-NLP] URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors
Last Updated: 2020-06-24
This paper: URIEL andlang2vec: Representing languages as typological,geographical, and phylogenetic vectors is proposed by researchers from CMU and University of Pittsburgh. It is accepted by EACL 2017. This paper is recommended for introducing lang2vec containing information of languages which helps multilingual NLP research.
In this paper, the authors introduced the URIEL knowledge base for massively multilingual NLP and the lang2vec utility which provides information-rich vector identifications of languages drawn from typological, geographical, and phylogenetic databases that are normalized to have straightforward and consistent formats, naming, and semantics.
lang2vec feature primarily represent binary language facts (e.g., that negation precedes the verb or is represented as a suffix, that the language is part of the Germanic family, etc.) and are sourced and predicted from a variety of linguistic resources including WALS (Dryer and Haspel-math, 2013), PHOIBLE (Moran et al., 2014), Ethnologue (Lewis et al., 2015), and Glottolog (Ham-marstr ̈om et al., 2015).
lang2vec takes as its in-put a list of ISO 639-3 codes and outputs a matrix of [0.0, 1.0] feature values (like those in Table1):
The recent success of “polyglot” models (Hermann and Blunsom, 2014; Faruqui and Dyer, 2014; Ammar et al., 2016; Tsvetkov et al., 2016; Daiber et al., 2016), in which a language model is trained on multiple languages and shares representations across languages, represents a promising avenue for NLP, especially for less-resourced languages, as these models appear to be able to learn useful patterns from better-resourced languages even when training data in the target language is limited.
Tsvetkov et al. (2016) shows that vectors that represent in formation about the language outperform a simple “one-hot” representation where each language is represented by a 1 in a single dimension. Sample results from Tsvetkov et al. (2016) are reproduced in Table 2.
We can see that training on a set of three similar languages, and a set of four similar and dissimilar languages, raises perplexity above the baseline monolingual model, even when the language is identified to the model by a one-hot (id) vector. However, perplexity is lowered by the introduction of phonological feature vectors for each language (the phonology and inventory vector types described in §3.1), giving consistently lower perplexity than even the monolingual baseline.
The initial motivation for the URIEL knowledge base and the lang2vec utility is to make such research easier, allowing different sources of information to be easily used together or as different experimental conditions (e.g., is it better to provide this model information about the syntactic features of the language, or the phylogenetic relationships between the languages?). Standardizing the use of this kind of information also makes it easier to replicate and expand on previous work, without needing to know how the authors processed, for example, WALS feature classes or PHOIBLE inventories into model input.
3. Vector types
General composition: binary vectors
lang2vec** offers a variety of vector representations of languages, of different types and derived from different sources, but all reporting feature values between 0.0 (generally representing the absence of a phenomenon or non-membership in a class) and 1.0 (generally representing the presence of a phenomenon or membership in a class). This normalization makes vectors from different sources more easily interchangeable and more easily predictable for each other (§4).
Different features are not mutually exclusive
As in SSWL (Collins and Kayne, 2011), different features are not held to be mutually exclusive; the features SSVO and SSOV can both be 1 if both orders are normally encountered in the language.
Phylogeny, geography, and identity vectors are complete—they have no missing values.
The typological features (syntax, phonology, and inventory) have missing values, reflecting the coverage of the original sources; missing values are represented in the output as “–”. Predicted typological vectors (§4) attempt to impute these values based on related, neighboring, and typologically similar languages.
All vectors within the syntax, phonology, and inventory categories have the same dimensionality as other types of vectors in the same category, even though the sources themselves may only represent a subset of these values, to allow straightforward element-wise comparison of values.
3.1. Typological vectors
The syntax features are adapted (after conversion to binary features) from the World Atlas of Language Structures (WALS) (Dryer and Haspel-math, 2013), directly from Syntactic Structures of World Languages (Collins and Kayne, 2011) (whose features are already binary), and indirectly by text-mining the short prose descriptions on typological features in Ethnologue (Lewis et al.,2015).
The phonology features are adapted in the same manner from WALS and Ethnologue.
The phonetic inventory features are adapted from the PHOIBLE database, itself a collection and normalization of seven phonological databases (Moran et al., 2014; Chanard, 2006; Crothers et al., 1979; Hartell, 1993; Michael et al., 2012; Maddieson and Precoda, 1990; Ramaswami, 1999). The PHOIBLE-based features in lang2vec primarily represent the presence or absence of natural classes of features (e.g., interdental fricatives, voiced uvulars, etc.), with 1 representing the presence of at least one sound of that class and 0 representing absence. They are derived from PHOIBLE’s phonetic inventories by extracting each segment’s articulatory features using the PanPhon* feature extractor (Mortensen etal., 2016), and using these features to determine the presence or absence of the relevant natural classes.
* About PanPhone: https://github.com/dmort27/panphon
3.2. Phylogeny vectors
The fam vectors express shared membership in language families, according to the world language family tree in Glottolog (Hammarstr ̈om et al., 2015). Each dimension represents a language family or branch thereof (such as “Indo-European” or “West Germanic” in Table 4)
3.3. Geography vectors
Although another component of URIEL (to be de-scribed in a future publication) provides geographical distances between languages, geo vectors express geographical location with a fixed number of dimensions and each dimension representing the same feature even when different sets of languages are considered. Each dimension represents the orthodromic distance—that is, the “great circle” distance—from the language in question to a fixed point on the Earth’s surface. These distances are expressed as a fraction of the Earth’s antipodal distance, so that values will always be in between 0.0 (directly at the fixed point) and 1.0 (at the antipode of the fixed point).
3.4. Identity vectors
The id vector is simply a one-hot vector identifying each language. These vectors can serve as simple identifiers of languages to a system, serve as the control in an experiment in introducing typological information to a system, as in Tsvetkov et al. (2016), or serve in combination with other vectors (such as fam) that do not always identify a language uniquely.
4. Feature prediction
One of the major difficulties in using typological features in multilingual processing is that many languages, and many features of individual languages, happen to be missing from the databases.
The authors efforts towards filling missing values using KNN:
The question of how we can best predict unknown typological features is a larger question (Daum ́e III and Campbell, 2007; Daum ́e III, 2009;Coke et al., 2016) than this article can capture in detail, but nonetheless we can offer a preliminary attempt at providing practically useful approximations of missing features by a k-nearest-neighbors approach.
By taking an average of genetic, geographical, and feature distances between languages, and calculating a weighted 10-nearest-neighbors classification, we can predict feature missing values with an accuracy of 92.93% in 10-fold cross-validation.
While there are many language-information resources available to NLP, their heterogeneity in format, semantics, language naming, and feature naming makes it difficult to combine them, compare them, and use them to predict missing values from each other. lang2vec aims to make cross-source and cross-information-type experiments straightforward by providing standardized, normalized vectors representing a variety of information types.