Computational linguistics — glossary
By Marie Lebert, 17 October 2018.
Here is a short, basic glossary based on definitions read on Wikipedia for my colleagues around the world. The simpler the better. Please see also our glossary on artificial intelligence (AI).
algorithm
specification for performing calculation, data processing and automated reasoning tasks
allomorph
variant form of a morpheme
allophone
set of multiple possible phones (or signs for the sign language) used to pronounce a single phoneme in a given natural language
application programming interface / API
set of tools and resources in an operating system in order to create software applications
applied linguistics
study of language-related real-life problems and solutions in education, psychology, communication research, anthropology, sociology and more
artificial intelligence / AI
design of machines capable of intelligent behaviour, meaning behaviour capable of achieving objectives; field originated in the 1960s, and including computational linguistics (originated in the 1950s) [glossary]
augmented reality / AR
technology superimposing a computer-generated image on a user’s view of the real world; “augmentation” of the real-world environment with computer-generated perceptual information (visual, auditory, sensory, olfactory)
big data
datasets that are too complex for standard data-processing application software, for example big data obtained by social media mining from user-generated content on social media sites and apps
character encoding
encoding of textual data with an encoding system such as Unicode
chatbot
web or mobile interface used by a human being to ask questions through text, sound or video, and retrieve information from hard-coded answers or from a larger content base using machine learning
cloud database
database on a cloud computing platform
CMU Pronouncing Dictionary / CMUdict
open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research
code
algorithm used to convert information (letter, word, sound, image, gesture) into another form of representation for communication and storage
command-line interface / CLI
interface with a command in the form of lines of text
command shell
command-line interface program in an operating system
compiled language
programming language whose implementations are compilers (and not interpreters)
compiler
program that converts computer code written in one programming language into another programming language
computational linguistics
field that processes natural languages using computer science and mathematics for analysis and synthesis of language and speech; originated in the 1950s with machine translation; includes applications such as spell and grammar checkers, speech synthesis, speech recognition, virtual assistants and smart speakers
computational science
multidisciplinary field using computing capabilities for science
computer-assisted translation / CAT
language translation in which a human translator uses specific software to support and facilitate the translation process; includes translation memory, language search engines, terminology management, alignment, interactive machine translation and augmented translation
computer science
study of computers (hardware, software, networks) and computing concepts
computer vision
theory behind the artificial systems that extract data from digital images or videos in order to process, analyse and understand such data
content analysis
process of studying digital media (texts, images, audio, video) and communication patterns in a systematic manner
conversational interface
interface that uses natural language processing (NLP) and natural language understanding (NLU) to run a conversation with a human being, for example a voice assistant
corpus linguistics
study of language as expressed in corpora (bodies) of written text; originated in the 1970s to advance discourse analysis
data analysis
process of inspecting, cleaning, transforming and modelling data to find useful information
data manipulation
process of inserting, deleting, modifying and updating data
data mining
process of turning raw data into useful information; used for example for machine learning and statistics programs
data model
abstract model that organises elements of data and standardises how they relate to each other
data modelling
process of creating a data model for an information system by applying formal techniques
data processing
collecting, storing, visualising, searching, querying, analysing, updating, sharing and transferring data
data science
field that uses statistics, data analysis and machine learning to extract knowledge from data
dataset
collection of data
decoding
process of converting code symbols back into information, for example information expressed in a plain natural language
deep learning
machine learning method based on learning from data as opposed to task-specific algorithms
descriptive linguistics
field that analyses and describes how natural language is actually used by a group of people
dictionary
listing in alphabetical order of the lexicon of a natural language (or two or several natural languages), with definitions, usage, etymologies, pronunciations and translations
diphone
adjacent pair of phones; often used for recording the transition between two phones, with better resulting sounds in speech synthesis than if combining two phones
discourse analysis
study of language use in written language, vocal language and/or sign language
encoding
process of converting information into code symbols for communication and storage
expert system
system that emulates the decision-making ability of a human expert
glossary
alphabetical list of terms in a specific field with the definition of those terms
grammar
system of rules which allow for the combination of words into sentences; includes morphology (grammar of word forms) and syntax (grammar of sentence structure)
grammatology
study of the history and theory of writing and writing systems
grapheme
visual character that is the smallest unit of a writing system in a natural language
graphemics
linguistic study of writing systems and their graphemes
graphical user interface / GUI
interface which allows users to interact with other users through graphical icons and visual indicators
hot word
word providing hands-free activation of a voice command device with an integrated virtual assistant; also called wake word
human-computer interaction / HCI
field defining and developing interfaces between users and computers
index
alphabetical list created in order to locate data in a dataset
inference engine
system component that applies logical rules to the knowledge base in order to deduce new information
information system
organised system for collecting, storing, classifying and communicating information
International Phonetic Alphabet / IPA
alphabetical system of phonetic notation based primarily on the Latin alphabet; created by the International Phonetic Association in the late 19th century to standardise the representation of the sounds in spoken language
Internet of Things / IoT
network of physical devices, home appliances, vehicles, wearable devices and other items embedded with electronics, software, sensors, actuators (movers) and connectivity in order to collect and exchange data
interpreter
(a) linguist who translates human speech into another language; (b) computer program that directly executes instructions written in a programming or scripting language
kernel
core of an operating system
knowledge base / KB
base that stores complex structured and unstructured information used by a computer system
knowledge-based system / KBS
computer system that reasons and uses the knowledge base
knowledge management / KM
process of creating, using, sharing and managing the knowledge (information) of an organisation
language change
variation over time in the features of a natural language (phonological, morphological, semantic, syntactic)
language for general purpose / LGP
term used for a general dictionary (called language-for-general-purpose dictionary or LGP dictionary) that provides a description of a natural language
language for specific purpose / LSP
term used for a specialised dictionary (called language-for-specific-purpose dictionary or LSP dictionary) that defines the specialised vocabulary used by experts in a subject field
language usage
manner in which natural language is used by a user or a group of users
lemma
dictionary form used for a set of words, for example “run” for the set of words “run”, “runs”, “ran” and “running”
lexical resource
database offering one or several dictionaries (monolingual, bilingual, multilingual)
lexicalisation
process of adding items to a lexicon, for example words, set phrases and word patterns
lexicography
practice of compiling, writing and editing general or specialised dictionaries; study of the semantic relationships in the lexicon (vocabulary) of a natural language
lexicon
vocabulary of a user, a language or a branch of knowledge; inventory of lexemes
linguist
specialist studying natural language and other languages (artificial, constructed); in a broader sense, language professional such as a translator, interpreter, copy editor and/or proofreader
linguistic corpora
collection of linguistic data, either written text or transcriptions of recorded speech
linguistics
study of language, including language form, language meaning and language in context
linked data
structured data that are interlinked for more or better results in semantic queries
Linux
family of free and open-source software operating systems built around the Linux kernel
Linux kernel
open-source Unix-like operating system kernel, first released in 1991 by Linus Torvalds
locale
set of parameters that defines a user’s language (language identifier) and region (region identifier) in a user interface
localisation
adaptation of a translated product to a specific country, region or language community, in order to take into account its culture, market or customs
machine learning
field that uses statistical techniques for computer systems to learn from data
machine translation
translation of text or speech from one language to another by a computer program
metadata
data that provide information about other data
morpheme
unit of meaning varying in sound without changing the meaning
morphology
study of the internal structure of words (formation and composition)
natural intelligence
intelligence displayed by humans (and animals)
natural language
human language, that basically consists of a lexicon (list of words) and a grammar (for the combination of words into sentences)
natural language generation / NLG
process of generating natural language from a machine representation system such as a knowledge base
natural language processing / NLP
field that uses computer programs to process large amounts of data pertaining to natural language
natural language understanding / NLU
subfield of natural language processing (NLP) for machine reading comprehension; includes search engine optimisation, news gathering, text categorisation, voice activation, large-scale content analysis, automated customer service and online education
natural-language user interface
computer-human interface in which linguistic components (verbs, phrases, etc.) act as UI (user interface) controls for creating, selecting and modifying data in software applications
network science
study of complex networks such as telecommunications, computer, biological, cognitive, semantic and social networks, and study of the connections between their elements or actors
ontology
formal naming and definition of the categories, properties and relations between concepts, data and entities in a domain of discourse
open data
data that are freely available for everyone to use and republish without restrictions from copyright or patents
operating system / OS
system that manages computer hardware and software resources, and provides common services for computer programs
optical character recognition / OCR
electronic conversion of scans or photographs of text (printed, typed, handwritten) into machine-encoded text
paradigm
set of concepts or thought patterns such as theories, research methods, postulates and standards
parsing
analysing a string of symbols from large-scale empirical data in order to annotate the syntactic and/or semantic sentence structure and create a parsed corpus (or treebank)
part of speech / POS
category of words with similar grammatical properties, for example noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, interjection (nine main parts of speech) in English, with 50 to 150 sub-categories for part-of-speech tagging
part-of-speech tagging / POST
process of marking up a word on a particular part of speech in order to study its use in context, i.e. in relationship with adjacent and related words in a phrase, sentence or paragraph
phone
any distinct speech sound (or speech gesture for sign language)
phoneme
unit of sound (or gesture for sign language) that distinguishes one word from another
phonemic transcription
visual representation of phonemes with a phonetic alphabet such as the International Phonetic Alphabet (IPA)
phonetic transcription
visual representation of speech sounds (phones), usually by using a phonetic alphabet such as the International Phonetic Alphabet (IPA)
phonetics
study of human speech sounds; includes articulatory phonetics (production of speech sounds by the speech organ), auditory phonetics (reception of speech sounds from the ear to the brain) and acoustic phonetics (loudness, amplitude and frequency of speech sounds)
phonology
study of how sounds are used in natural language to convey meaning; includes for example stress (emphasis on a given syllable or word) and intonation (variations in spoken pitch)
phonotactics
branch of phonology that deals with restrictions in a natural language on the permissible combinations of phonemes
phrasebook
collection of ready-made phrases, often in the form of indexed questions and answers, for example phrases along with a translation to learn the basics of a foreign natural language
pragmatics
study of the way in which context contributes to meaning
programming
process of designing and building an executable program for a specific computing task
programming language
set of commands, instructions and other syntax use
psycholinguistics
study of the interrelation between linguistic factors and psychological aspects
question answering
system that automatically answers questions asked in a natural language
relational database
database based on a relational model of data, i.e. a model that manages data as a set of relations
SAMPA / Speech Assessment Methods Phonetic Alphabet
computer-readable phonetic alphabet based on the International Phonetic Alphabet (IPA) and using 7-bit printable ASCII characters
script
program written for a special run-time environment to automate the execution of tasks
scripting language
programming language that supports scripts to automate the execution of tasks
semantics
study of meaning in natural languages and programming languages
semiotics
study of meaning-making signs and sign processes
sentiment analysis
process that uses natural language processing (NLP) and text analysis to identify, extract, quantify and study subjective information such as users’ reviews and surveys
shell
user interface for access to the services of an operating system; outermost layer (hence its name shell) around the operating system kernel
shell script
program designed to be run by the Unix shell (a command-line interpreter)
smart speaker
voice command device (VCD) with an integrated virtual assistant, that offers hands-free activation with the help of one hot word (or wake word)
sociolinguistics
study of the effect of society on the way natural language is used by human beings; takes into account gender, age range, race, ethnicity, education, social status and other factors
sociology of language
study of the effect of natural language on society
speaker recognition
identification of users from their voice biometrics
speech
vocal communication using language
speech recognition
process that enables the recognition, interpretation and translation of spoken language by computers, for example in the built-in speech recognition software offered by most operating systems; originated in the late 1970s
speech synthesis
artificial production of human speech by a computer program, with such software included in operating systems since the early 1990s
standard library
library made available across implementations of a programming language
statistics
branch of mathematics dealing with data collection, organisation, analysis, interpretation and presentation
stylistics
study of linguistic factors that place a discourse in context
sublanguage
subset of a natural language, a computer language or a relational database
syntagma
elementary constituent segment within a text, for example a phoneme, word, phrase or sentence
syntax
study of language structure (formation and composition of phrases and sentences) in order to describe how structural relations between elements in a sentence (often depicted in parse tree format) contribute to its interpretation; set of rules that define a structured computer program
taxonomy
classification that improves relevance in vertical search, for example for a web search query
terminology
study of terms (words and compound words) and their use
text corpus
structured set of texts for storage and processing; can be for example a monolingual corpus, a multilingual corpus, a translation corpus (texts and their translations), a parallel corpus (texts alongside their translations) or a comparable corpus (texts covering the same contents)
text processing
creation and manipulation of electronic text, for example reformatting or content change (search and replace, select and move, etc.)
theoretical linguistics
study of the nature of human language and its relation to cognitive processes; includes phonology, morphology, syntax and semantics
thesaurus
listing of words grouped according to similarity of meaning; controlled vocabulary organising semantic metadata for information storage and retrieval
triphone
sequence of three phones; used in natural language processing (NLP) to establish the various contexts of a phone in a given natural language
Unix
family of multitasking, multi-user computer operating systems launched in the 1970s
Unix shell
command-line interpreter providing a Unix-like command-line user interface
user interface / UI
design field of human-computer interaction; can be a command-line interface (CLI) or a graphical user interface (GUI)
virtual assistant
software agent performing tasks and services for a user
virtual reality / VR
replacement of the user’s real-world environment with a computer-generated simulation of a three-dimensional environment that be accessed with electronic equipment, for example a helmet with a screen or gloves with sensors
vocabulary
set of words for communication and knowledge acquisition; can be for example reading vocabulary, listening vocabulary, speaking vocabulary, writing vocabulary, native language vocabulary, second language vocabulary and foreign language vocabulary
voice command device / VCD
device controlled by the human voice, for example a mobile phone with voice-activated dialling or a remote controller
voice tag
short audio phrase used as a command to a voice command device or a voice user interface
voice user interface / VUI
voice/speech platform for computer-human interaction to initiate an automated service or process
wearable device
smart electronic device (device with micro-controllers) that can be incorporated into clothing, worn as an accessory (for example a smart watch or a fitness tracker) or worn on/in the body as an implant
wearable technology
technology behind smart devices and items worn on/in the body
web mining
data mining to discover patterns from the web
writing system
method of visually representing verbal communication by converting spoken language into visual symbols for a wider communication across space and time
Copyright © 2018 Marie Lebert
License CC BY-NC-SA version 4.0