Thursday, March 5, 2009

Introduction to Natural Language Processing

Information overload
It is great to have access to huge amounts of information, but since we are not reading faster than before, we can not take advantage of this new situation. Therefore the need of a discipline that help human beings deal with all that data is fundamental.

Natural language processing is the process of building computational models for understanding natural language. It studies the problems of automated generation and understanding of natural human languages. NLP includes natural-language-generation systems that convert information from computer databases into normal human language and natural-language-understanding systems that convert samples of human language into more formal representations that are easier for computer programs to manipulate.

NLP also studies the information contained in human generated texts, along with its language structure.NLP is a multidisciplinary field, which studies artificial intelligence techniques, multivariate statistics, linguistics and any other domain that can be used to process, generate or interpret language with computers.

Linguistics is the scientific and philosophical study of language, encompassing a number of sub-fields. At the core of theoretical linguistics is the study of language structure (grammar) and the study of meaning (semantics). The first of these encompasses morphology (the formation and composition of words) and syntax (the rules that determine how words combine into phrases and sentences).

A controlled vocabulary is a list of terms that have been enumerated explicitly. This list is controlled by and is available from a controlled vocabulary registration authority. All terms in a controlled vocabulary should have an unambiguous, non-redundant definition.

Named entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

A taxonomy is a collection of controlled vocabulary terms organized into a hierarchical structure (tree shaped). Each term in the taxonomy is in one or more parent-child relationships. The child kind of thing has by definition the same constraints as the father type ones plus one or more additional constraints. For example, car is a child of vehicle. So any car is also a vehicle, but not every vehicle is a car. There are also specific kind of taxonomies like an “enterprise taxonomy” which contains terms related only to this specific field. Taxonomies are seen as less broad than ontologies because ontologies include logic inference and allow a larger variety of relation types.

An ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain, and may be used to define the domain. They are a form of knowledge representation.

Part-of-speech (POS) tagging is a process whereby tokens are sequentially labeled with syntactic labels, such as "finite verb" or "gerund" or "subordinating conjunction".

Morphology is the study of the internal structure of words.

Lexeme is the distinction between these two senses of "word" is arguably the most important one in morphology. The first sense of "word," the one in which dog and dogs are "the same word," this is called lexeme. The second one is called word-form. We thus say that dog and dogs have a common Lemma. a Stemmer is used to transform words to its Lemma (also called root). ttjere are different forms of the same lexeme. There is a form of a word that is chosen conventionally to represent the canonical form of a Lemma. A Lexicon is the collection of all the lexemes of a language.

Grammar is the field of linguistics that covers the rules governing the use of any given spoken languages. It mainly includes morphology and syntax, but it can be complemented with other linguistic fields.

Syntax is the study of the principles and rules for constructing sentences in natural languages; the term syntax is also used to refer directly to the rules and principles that govern the sentence structure. Semantics is basically the study of the meaning of signs. These studies can be performed at word level, sentence level, paragraph level, and even larger units of discourse levels..

Corpus is a large and structured set of texts used to do statistical analysis, text-mining, validation of linguistic rules, calculate document similarities, etc..


elastichica said...

My sister is a linguist. (lexeme of linguistics).
I'm not but I love the word "morphology".

Doroty J. said...

Your post is not only flowing with would probably benefit from what is on this one page more so than spending 1 month trying to understand a teacher or a professor!

Our English language has always left me in a bewilderment of confusion as to their many rules.

Mariana Soffer said...

Txs a lot D. Glad you liked it. I do not have many posts about nlp, in my math blog you can find 2 entries explaining techniques used for NLP. If you like this kind of scientific divulgation thoughts you are welcome to check and read the post you think you are intrested in. Of course feel free to coment about them.

Mariana Soffer said...

Txs a lot D. Glad you liked it. I do not have many posts about nlp, in my math blog you can find 2 entries explaining techniques used for NLP. If you like this kind of scientific divulgation thoughts you are welcome to check and read the post you think you are intrested in. Of course feel free to coment about them.

christopher said...

You sent me here. This is indeed different (though weirdly related in the background). NLP as initials have been heavily used by specialists in Neurolinguistic Programming. NLP is a set of theories and techniques that fit in the "Brief Therapy" mode of psychological therapies. It overlaps the set of hypnotic therapies with the set of talking therapies and with a certain set of concepts that come out of what might be called systems theory. Thus linguistics is required study to a certain extent. And as your Natural Language Processing is intentional in nature, intending a kind of clarity that allows for computation, a Neurolinguistic Programming communication is designed to bypass the natural defensiveness of patients and so becomes an intentional communication aimed at effective change in dysfunction to another form of healthier function.

When it works, an NLP exchange is remarkable in two ways. It often completely bypasses the varieties of resistance to change and also it does this in remarkably few therapeutic sessions. That is why NLP is considered a brief therapy.