Pages

Wednesday, June 25, 2008

NLP Process overview

NLP is not simply applications, but it is divided in technical methods and theories. The tools (technical methods) are always applied in layers. Each layer receives as the input the results of the processing performed in the previous layer.

Main tools
Tokenizer:
Segments the original text into words (tokens), sentences, paragraphs, etc. Typically includes the named entity extractor in order to solve punctuation ambiguities, such as “Mr. Burns” (which violates the rule saying that a period followed by empty space and a capital terminates a sentence).

Named entity extractor: Detects and annotates or extracts named entities such as names of people, organizations, geographical locations, dates, numbers, currencies, measures, etc. (helping avoid confusions such as the one mentioned before)

Morphological parser: Performs morphological analysis of inflected words in order to determine their stem. It can utilize either a stemming algorithm (such as Porter’s) that automatically strips inflectional affixes from words, or lexicons.

Part‑of‑speech tagger: Annotates text with labels (tags) containing morphosyntactic information, such as part‑of‑speech, gender, tense, etc. (ex: “dog barks” “dog/nn barks/vb) IT is Often combined with the morphological parsers.

Morphological disambiguator: Resolves ambiguous morphological and morphosyntactic interpretations assigned by the morphological parser and part‑of‑speech tagger. These ambiguities occur when a word form may belong to various base forms like “saw” that may either mean to cut, or it could be the past tense of “see”. Morphological disambiguation may be either statistic, or rule based.

Chunk parser: Groups words into syntagms (chunks) of a particular type, such as noun phrases, verb phrases, etc. This grouping is done by applying numerous, and often complex generative rules by which these phrases are formed.

Shallow parser: Utilizes chunk‑parsing output in order to determine syntactic roles in a sentence (such as, subject, predicate, object, etc.), and divide complex sentences into clauses.

Language Resource (LR): refers to data-only resources such as lexicons, corpora, thesauri or ontologies. http://www.proxem.com/Resources/tabid/54/Default.aspx#ress1 Has one of the most complete LR I ever seen.