It is great to have access to huge amounts of information, but since we are not reading faster than before, we can not take advantage of this new situation. Therefore the need of a discipline that help human beings deal with all that data is fundamental.
Natural language processing is the process of building computational models for understanding natural language. It studies the problems of automated generation and understanding of natural human languages. NLP includes natural-language-generation systems that convert information from computer databases into normal human language and natural-language-understanding systems that convert samples of human language into more formal representations that are easier for computer programs to manipulate.
NLP also studies the information contained in human generated texts, along with its language structure.NLP is a multidisciplinary field, which studies artificial intelligence techniques, multivariate statistics, linguistics and any other domain that can be used to process, generate or interpret language with computers.
Linguistics is the scientific and philosophical study of language, encompassing a number of sub-fields. At the core of theoretical linguistics is the study of language structure (grammar) and the study of meaning (semantics). The first of these encompasses morphology (the formation and composition of words) and syntax (the rules that determine how words combine into phrases and sentences).
A controlled vocabulary is a list of terms that have been enumerated explicitly. This list is controlled by and is available from a controlled vocabulary registration authority. All terms in a controlled vocabulary should have an unambiguous, non-redundant definition.
Named entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
A taxonomy is a collection of controlled vocabulary terms organized into a hierarchical structure (tree shaped). Each term in the taxonomy is in one or more parent-child relationships. The child kind of thing has by definition the same constraints as the father type ones plus one or more additional constraints. For example, car is a child of vehicle. So any car is also a vehicle, but not every vehicle is a car. There are also specific kind of taxonomies like an “enterprise taxonomy” which contains terms related only to this specific field. Taxonomies are seen as less broad than ontologies because ontologies include logic inference and allow a larger variety of relation types.
An ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain, and may be used to define the domain. They are a form of knowledge representation.
Part-of-speech (POS) tagging is a process whereby tokens are sequentially labeled with syntactic labels, such as "finite verb" or "gerund" or "subordinating conjunction".
Morphology is the study of the internal structure of words.
Lexeme is the distinction between these two senses of "word" is arguably the most important one in morphology. The first sense of "word," the one in which dog and dogs are "the same word," this is called lexeme. The second one is called word-form. We thus say that dog and dogs have a common Lemma. a Stemmer is used to transform words to its Lemma (also called root). ttjere are different forms of the same lexeme. There is a form of a word that is chosen conventionally to represent the canonical form of a Lemma. A Lexicon is the collection of all the lexemes of a language.
Grammar is the field of linguistics that covers the rules governing the use of any given spoken languages. It mainly includes morphology and syntax, but it can be complemented with other linguistic fields.
Syntax is the study of the principles and rules for constructing sentences in natural languages; the term syntax is also used to refer directly to the rules and principles that govern the sentence structure. Semantics is basically the study of the meaning of signs. These studies can be performed at word level, sentence level, paragraph level, and even larger units of discourse levels..
Corpus is a large and structured set of texts used to do statistical analysis, text-mining, validation of linguistic rules, calculate document similarities, etc..