When analyzing a knowledge community, a key step is to identify most pertinent terms relevant to the given domain wether you are coping with a media, academic, or blog corpus.
Automatic multi-terms extraction is a classic task in NLP, still it requires fine tuning because one need to resolve the tradeoff between the frequency of the multiterms and their specificity.
This presentation (lexical analysis presentation (MIT – may 2011) ) sums up the different criteria one should meet when trying to perform a lexical extraction over a textual corpus:
- grammatical criterion, candidate terms are usually limited to noun phrases which require using POS-tagging and defining salient grammatical motifs (chunking),
- unithood, phrases should represent a proper semantic unit, C-value algorithm allows to extract multi-terms which are both reasonably frequent and which are not nested into longer terms
- termhood, terms should be domain specific to carry substantial information, specificity measure can be computed to estimate how a term cooccurrence distribution is systematically biased toward certain topics
Herebelow is a semantic map built from a lexical extraction based on european patents in 2010 (thanks to Patricia Laurens for the expert check over the list):