Computational linguistics — glossary

By Marie Lebert, 17 October 2018.

Here is a short, basic glossary based on definitions read on Wikipedia for my colleagues around the world. The simpler the better. Please see also the glossary on artificial intelligence (AI).



algorithm
specification for performing calculation, data processing and automated reasoning tasks

allomorph
variant form of a morpheme

allophone
set of multiple possible phones (or signs for the sign language) used to pronounce a single phoneme in a given natural language

application programming interface / API
set of tools and resources in an operating system in order to create software applications

applied linguistics
study of language-related real-life problems and solutions in education, psychology, communication research, anthropology, sociology and more

artificial intelligence / AI
design of machines capable of intelligent behaviour, meaning behaviour capable of achieving objectives; field originated in the 1960s, and including computational linguistics (originated in the 1950s) [glossary]

augmented reality / AR
technology superimposing a computer-generated image on a user’s view of the real world; “augmentation” of the real-world environment with computer-generated perceptual information (visual, auditory, sensory, olfactory)

big data
datasets that are too complex for standard data-processing application software, for example big data obtained by social media mining from user-generated content on social media sites and apps

character encoding
encoding of textual data with an encoding system such as Unicode

chatbot
web or mobile interface used by a human being to ask questions through text, sound or video, and retrieve information from hard-coded answers or from a larger content base using machine learning

cloud database
database on a cloud computing platform

CMU Pronouncing Dictionary / CMUdict
open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research

code
algorithm used to convert information (letter, word, sound, image, gesture) into another form of representation for communication and storage

command-line interface / CLI
interface with a command in the form of lines of text

command shell
command-line interface program in an operating system

compiled language
programming language whose implementations are compilers (and not interpreters)

compiler
program that converts computer code written in one programming language into another programming language

computational linguistics
field that processes natural languages using computer science and mathematics for analysis and synthesis of language and speech; originated in the 1950s with machine translation; includes applications such as spell and grammar checkers, speech synthesis, speech recognition, virtual assistants and smart speakers

computational science
multidisciplinary field using computing capabilities for science

computer-assisted translation / CAT
language translation in which a human translator uses specific software to support and facilitate the translation process; includes translation memory, language search engines, terminology management, alignment, interactive machine translation and augmented translation

computer science
study of computers (hardware, software, networks) and computing concepts

computer vision
theory behind the artificial systems that extract data from digital images or videos in order to process, analyse and understand such data

content analysis
process of studying digital media (texts, images, audio, video) and communication patterns in a systematic manner

conversational interface
interface that uses natural language processing (NLP) and natural language understanding (NLU) to run a conversation with a human being, for example a voice assistant

corpus linguistics
study of language as expressed in corpora (bodies) of written text; originated in the 1970s to advance discourse analysis

data analysis
process of inspecting, cleaning, transforming and modelling data to find useful information

data manipulation
process of inserting, deleting, modifying and updating data

data mining
process of turning raw data into useful information; used for example for machine learning and statistics programs

data model
abstract model that organises elements of data and standardises how they relate to each other

data modelling
process of creating a data model for an information system by applying formal techniques

data processing
collecting, storing, visualising, searching, querying, analysing, updating, sharing and transferring data

data science
field that uses statistics, data analysis and machine learning to extract knowledge from data

dataset
collection of data

decoding
process of converting code symbols back into information, for example information expressed in a plain natural language

deep learning
machine learning method based on learning from data as opposed to task-specific algorithms

descriptive linguistics
field that analyses and describes how natural language is actually used by a group of people

dictionary
listing in alphabetical order of the lexicon of a natural language (or two or several natural languages), with definitions, usage, etymologies, pronunciations and translations

diphone
adjacent pair of phones; often used for recording the transition between two phones, with better resulting sounds in speech synthesis than if combining two phones

discourse analysis
study of language use in written language, vocal language and/or sign language

encoding
process of converting information into code symbols for communication and storage

expert system
system that emulates the decision-making ability of a human expert

glossary
alphabetical list of terms in a specific field with the definition of those terms

grammar
system of rules which allow for the combination of words into sentences; includes morphology (grammar of word forms) and syntax (grammar of sentence structure)

grammatology
study of the history and theory of writing and writing systems

grapheme
visual character that is the smallest unit of a writing system in a natural language

graphemics
linguistic study of writing systems and their graphemes

graphical user interface / GUI
interface which allows users to interact with other users through graphical icons and visual indicators

hot word
word providing hands-free activation of a voice command device with an integrated virtual assistant; also called wake word

human-computer interaction / HCI
field defining and developing interfaces between users and computers

index
alphabetical list created in order to locate data in a dataset

inference engine
system component that applies logical rules to the knowledge base in order to deduce new information

information system
organised system for collecting, storing, classifying and communicating information

International Phonetic Alphabet / IPA
alphabetical system of phonetic notation based primarily on the Latin alphabet; created by the International Phonetic Association in the late 19th century to standardise the representation of the sounds in spoken language

Internet of Things / IoT
network of physical devices, home appliances, vehicles, wearable devices and other items embedded with electronics, software, sensors, actuators (movers) and connectivity in order to collect and exchange data

interpreter
(a) linguist who translates human speech into another language; (b) computer program that directly executes instructions written in a programming or scripting language

kernel
core of an operating system

knowledge base / KB
base that stores complex structured and unstructured information used by a computer system

knowledge-based system / KBS
computer system that reasons and uses the knowledge base

knowledge management / KM
process of creating, using, sharing and managing the knowledge (information) of an organisation

language change
variation over time in the features of a natural language (phonological, morphological, semantic, syntactic)

language for general purpose / LGP
term used for a general dictionary (called language-for-general-purpose dictionary or LGP dictionary) that provides a description of a natural language

language for specific purpose / LSP
term used for a specialised dictionary (called language-for-specific-purpose dictionary or LSP dictionary) that defines the specialised vocabulary used by experts in a subject field

language usage
manner in which natural language is used by a user or a group of users

lemma
dictionary form used for a set of words, for example “run” for the set of words “run”, “runs”, “ran” and “running”

lexical resource
database offering one or several dictionaries (monolingual, bilingual, multilingual)

lexicalisation
process of adding items to a lexicon, for example words, set phrases and word patterns

lexicography
practice of compiling, writing and editing general or specialised dictionaries; study of the semantic relationships in the lexicon (vocabulary) of a natural language

lexicon
vocabulary of a user, a language or a branch of knowledge; inventory of lexemes

linguist
specialist studying natural language and other languages (artificial, constructed); in a broader sense, language professional such as a translator, interpreter, copy editor and/or proofreader

linguistic corpora
collection of linguistic data, either written text or transcriptions of recorded speech

linguistics
study of language, including language form, language meaning and language in context

linked data
structured data that are interlinked for more or better results in semantic queries

Linux
family of free and open-source software operating systems built around the Linux kernel

Linux kernel
open-source Unix-like operating system kernel, first released in 1991 by Linus Torvalds

locale
set of parameters that defines a user’s language (language identifier) and region (region identifier) in a user interface

localisation
adaptation of a translated product to a specific country, region or language community, in order to take into account its culture, market or customs

machine learning
field that uses statistical techniques for computer systems to learn from data

machine translation
translation of text or speech from one language to another by a computer program

metadata
data that provide information about other data

morpheme
unit of meaning varying in sound without changing the meaning

morphology
study of the internal structure of words (formation and composition)

natural intelligence
intelligence displayed by humans (and animals)

natural language
human language, that basically consists of a lexicon (list of words) and a grammar (for the combination of words into sentences)

natural language generation / NLG
process of generating natural language from a machine representation system such as a knowledge base

natural language processing / NLP
field that uses computer programs to process large amounts of data pertaining to natural language

natural language understanding / NLU
subfield of natural language processing (NLP) for machine reading comprehension; includes search engine optimisation, news gathering, text categorisation, voice activation, large-scale content analysis, automated customer service and online education

natural-language user interface
computer-human interface in which linguistic components (verbs, phrases, etc.) act as UI (user interface) controls for creating, selecting and modifying data in software applications

network science
study of complex networks such as telecommunications, computer, biological, cognitive, semantic and social networks, and study of the connections between their elements or actors

ontology
formal naming and definition of the categories, properties and relations between concepts, data and entities in a domain of discourse

open data
data that are freely available for everyone to use and republish without restrictions from copyright or patents

operating system / OS
system that manages computer hardware and software resources, and provides common services for computer programs

optical character recognition / OCR
electronic conversion of scans or photographs of text (printed, typed, handwritten) into machine-encoded text

paradigm
set of concepts or thought patterns such as theories, research methods, postulates and standards

parsing
analysing a string of symbols from large-scale empirical data in order to annotate the syntactic and/or semantic sentence structure and create a parsed corpus (or treebank)

part of speech / POS
category of words with similar grammatical properties, for example noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, interjection (nine main parts of speech) in English, with 50 to 150 sub-categories for part-of-speech tagging

part-of-speech tagging / POST
process of marking up a word on a particular part of speech in order to study its use in context, i.e. in relationship with adjacent and related words in a phrase, sentence or paragraph

phone
any distinct speech sound (or speech gesture for sign language)

phoneme
unit of sound (or gesture for sign language) that distinguishes one word from another

phonemic transcription
visual representation of phonemes with a phonetic alphabet such as the International Phonetic Alphabet (IPA)

phonetic transcription
visual representation of speech sounds (phones), usually by using a phonetic alphabet such as the International Phonetic Alphabet (IPA)

phonetics
study of human speech sounds; includes articulatory phonetics (production of speech sounds by the speech organ), auditory phonetics (reception of speech sounds from the ear to the brain) and acoustic phonetics (loudness, amplitude and frequency of speech sounds)

phonology
study of how sounds are used in natural language to convey meaning; includes for example stress (emphasis on a given syllable or word) and intonation (variations in spoken pitch)

phonotactics
branch of phonology that deals with restrictions in a natural language on the permissible combinations of phonemes

phrasebook
collection of ready-made phrases, often in the form of indexed questions and answers, for example phrases along with a translation to learn the basics of a foreign natural language

pragmatics
study of the way in which context contributes to meaning

programming
process of designing and building an executable program for a specific computing task

programming language
set of commands, instructions and other syntax use

psycholinguistics
study of the interrelation between linguistic factors and psychological aspects

question answering
system that automatically answers questions asked in a natural language

relational database
database based on a relational model of data, i.e. a model that manages data as a set of relations

SAMPA / Speech Assessment Methods Phonetic Alphabet
computer-readable phonetic alphabet based on the International Phonetic Alphabet (IPA) and using 7-bit printable ASCII characters

script
program written for a special run-time environment to automate the execution of tasks

scripting language
programming language that supports scripts to automate the execution of tasks

semantics
study of meaning in natural languages and programming languages

semiotics
study of meaning-making signs and sign processes

sentiment analysis
process that uses natural language processing (NLP) and text analysis to identify, extract, quantify and study subjective information such as users’ reviews and surveys

shell
user interface for access to the services of an operating system; outermost layer (hence its name shell) around the operating system kernel

shell script
program designed to be run by the Unix shell (a command-line interpreter)

smart speaker
voice command device (VCD) with an integrated virtual assistant, that offers hands-free activation with the help of one hot word (or wake word)

sociolinguistics
study of the effect of society on the way natural language is used by human beings; takes into account gender, age range, race, ethnicity, education, social status and other factors

sociology of language
study of the effect of natural language on society

speaker recognition
identification of users from their voice biometrics

speech
vocal communication using language

speech recognition
process that enables the recognition, interpretation and translation of spoken language by computers, for example in the built-in speech recognition software offered by most operating systems; originated in the late 1970s

speech synthesis
artificial production of human speech by a computer program, with such software included in operating systems since the early 1990s

standard library
library made available across implementations of a programming language

statistics
branch of mathematics dealing with data collection, organisation, analysis, interpretation and presentation

stylistics
study of linguistic factors that place a discourse in context

sublanguage
subset of a natural language, a computer language or a relational database

syntagma
elementary constituent segment within a text, for example a phoneme, word, phrase or sentence

syntax
study of language structure (formation and composition of phrases and sentences) in order to describe how structural relations between elements in a sentence (often depicted in parse tree format) contribute to its interpretation; set of rules that define a structured computer program

taxonomy
classification that improves relevance in vertical search, for example for a web search query

terminology
study of terms (words and compound words) and their use

text corpus
structured set of texts for storage and processing; can be for example a monolingual corpus, a multilingual corpus, a translation corpus (texts and their translations), a parallel corpus (texts alongside their translations) or a comparable corpus (texts covering the same contents)

text processing
creation and manipulation of electronic text, for example reformatting or content change (search and replace, select and move, etc.)

theoretical linguistics
study of the nature of human language and its relation to cognitive processes; includes phonology, morphology, syntax and semantics

thesaurus
listing of words grouped according to similarity of meaning; controlled vocabulary organising semantic metadata for information storage and retrieval

triphone
sequence of three phones; used in natural language processing (NLP) to establish the various contexts of a phone in a given natural language

Unix
family of multitasking, multi-user computer operating systems launched in the 1970s

Unix shell
command-line interpreter providing a Unix-like command-line user interface

user interface / UI
design field of human-computer interaction; can be a command-line interface (CLI) or a graphical user interface (GUI)

virtual assistant
software agent performing tasks and services for a user

virtual reality / VR
replacement of the user’s real-world environment with a computer-generated simulation of a three-dimensional environment that be accessed with electronic equipment, for example a helmet with a screen or gloves with sensors

vocabulary
set of words for communication and knowledge acquisition; can be for example reading vocabulary, listening vocabulary, speaking vocabulary, writing vocabulary, native language vocabulary, second language vocabulary and foreign language vocabulary

voice command device / VCD
device controlled by the human voice, for example a mobile phone with voice-activated dialling or a remote controller

voice tag
short audio phrase used as a command to a voice command device or a voice user interface

voice user interface / VUI
voice/speech platform for computer-human interaction to initiate an automated service or process

wearable device
smart electronic device (device with micro-controllers) that can be incorporated into clothing, worn as an accessory (for example a smart watch or a fitness tracker) or worn on/in the body as an implant

wearable technology
technology behind smart devices and items worn on/in the body

web mining
data mining to discover patterns from the web

writing system
method of visually representing verbal communication by converting spoken language into visual symbols for a wider communication across space and time


Copyright © 2018 Marie Lebert
License CC BY-NC-SA version 4.0

Written by marielebert

2018-10-17 at 23:13

Posted in Uncategorized