Pages

Wednesday, June 25, 2008

NLP Process overview

NLP is not simply applications, but it is divided in technical methods and theories. The tools (technical methods) are always applied in layers. Each layer receives as the input the results of the processing performed in the previous layer.

Main tools
Tokenizer:
Segments the original text into words (tokens), sentences, paragraphs, etc. Typically includes the named entity extractor in order to solve punctuation ambiguities, such as “Mr. Burns” (which violates the rule saying that a period followed by empty space and a capital terminates a sentence).

Named entity extractor: Detects and annotates or extracts named entities such as names of people, organizations, geographical locations, dates, numbers, currencies, measures, etc. (helping avoid confusions such as the one mentioned before)

Morphological parser: Performs morphological analysis of inflected words in order to determine their stem. It can utilize either a stemming algorithm (such as Porter’s) that automatically strips inflectional affixes from words, or lexicons.

Part‑of‑speech tagger: Annotates text with labels (tags) containing morphosyntactic information, such as part‑of‑speech, gender, tense, etc. (ex: “dog barks” “dog/nn barks/vb) IT is Often combined with the morphological parsers.

Morphological disambiguator: Resolves ambiguous morphological and morphosyntactic interpretations assigned by the morphological parser and part‑of‑speech tagger. These ambiguities occur when a word form may belong to various base forms like “saw” that may either mean to cut, or it could be the past tense of “see”. Morphological disambiguation may be either statistic, or rule based.

Chunk parser: Groups words into syntagms (chunks) of a particular type, such as noun phrases, verb phrases, etc. This grouping is done by applying numerous, and often complex generative rules by which these phrases are formed.

Shallow parser: Utilizes chunk‑parsing output in order to determine syntactic roles in a sentence (such as, subject, predicate, object, etc.), and divide complex sentences into clauses.

Language Resource (LR): refers to data-only resources such as lexicons, corpora, thesauri or ontologies. http://www.proxem.com/Resources/tabid/54/Default.aspx#ress1 Has one of the most complete LR I ever seen.

Tuesday, May 13, 2008

Eeyeore

Eeyeore, the gloomy donkey

Piglet was walking though the flowers while eEyeore said “Be careful, someone might not see you because of you size and you might end up killed by a wan that just went for a stroll”.

There is something in each of us that wants us to be Unhappy. It creates in our imagination problems that don’t yet exist quite often causing them to come true. It exaggerates problems that are already there. It reinforces low self-esteem and respect for others. It destroys pride in workmanships, order and cleanliness. It turn meetings into confrontations opportunities into danger, stepping stones into stumble Blocks. His grimaces and frowns, which pull the muscles of the face forward and drawn, speed the aging process. It contaminates the mind behind the face and spreads it outward like a disease.

There is a way to overcome the Eeyeore effect within, and therefore begin to counteract its effects .The most important thing to know is that you need to be prepared for a very long and difficult journey.