A metric for textual content

Measuring meaning of terms from simple statistics with very basic linguistic treatment is certainly a painstaking task that will never perform  than best hermeneutic work.  But if one is only interested in defining a metric between two terms or phrases given their distribution of joint apparition in a corpus with other terms and phrases, then the literature is full of dedicated measures.

These measures have been mostly developed for Natural Language Processing tasks such as contextual desambiguation. Other fields such as Scientometrics have also proposed to measure “distance” between linguistic items suggesting that cooccurrences patterns of terms could be used to map a field of science.

Yet co-word analysis has always been using symmetric proximity measures between terms which is a major drawback for producing maps. Indeed, ideally, when comparing two words it is reasonable to consider they should be close given a proximity measure if they are more or less synonymous. However language is full of polysemic words which may introduce ambiguity. Then when measuring a semantic distance between two words, every possible meaning of each words should  be taken into account. Now, two terms may have overlapping but not exactly equal meanings. More a word i can be a perfect synonymous of another word j and yet have different meanings in other contexts. In this case, it is reasonable to think that the proximity measure from i to j would be very high (as you can easily replace i by j in any sentences where i is mentionned) and that the proximity from j to i would be lower.

This is the reason why semantic proximity measure  should be asymmetric. This measure then defines a capacity to replace a linguistic item by another item in its context. In the representation below, the word “forest management” has been surrounded by its closest neighbors given such a proximity measure. They are arranged above and below the target word according to their level of genericity. Key-phrases above the target word like “succession” or “timber” are likely to appear frequently in the set of preferential contexts of “forest management”. It means that when replacing “forest management” by these terms in a sentence it is likely that the sentence general meaning would not have been deeply altered (at least from a  lexical point of view). Accordingly distribution of “favorable” contexts of “forest management”  includes contexts with which words like “wildfire”, “forest patches”, or “tree species richness” are more likely.

In this precise example, the corpus of texts was made of about 10,000 recent publications mentioning the term “sustainable development”, the proximity measure used to measure the proximity between words w1 and w2 given their frequency of being mentionned along with a context c: p(c,w), is adapted from Weed and Weir (2005)Another proximity measure is further discussed in:

  • David Chavalarias, Jean-Philippe Cointet (2008) Bottom-up scientific field detection for dynamical and hierarchical science mapping, methodology and case study, 37-50. In Scientometrics 75 (1). [pdf]

Comments are closed.

Tags: conceptual network, scientific community