Talking to machines
Before I leave the car, I ask ‘Turing’ to remind me to stop off at the garage on the way to work tomorrow morning. At the door to my house, I continue, “Turing, open the door, and turn the heating onto the winter setting. And then can you order my favorite takeout for 8 pm, enough for 2.”
This isn’t the future; we are already here – well, almost. Just think of what we can do with Alexa, Siri, and Google Assistant. Natural language processing, combined with ‘The Internet of Things’ (IoT), has voice-activated much of our lives and connected all our tech to the web.
While there are many valid and valuable reasons for implementing natural language processing in computers, the following 3 are perhaps the most common. Firstly, to successfully communicate with humans – natural language is much easier to use than formal language when asking a computer to do something. Secondly, understanding language helps AIs to learn from the vast amount of human-created information captured on the web and in books. And finally, to advance scientific knowledge, combining AI with cognitive science, linguistics, neuroscience, and genetics to better understand the process of natural language comprehension and production.
Language is complex
Formal languages are precise – think of logic ‘if a then b,’ or the maths golden oldie ‘a2 + b2 = c2 ‘, or a chemical equation ‘N2(g) + 3H2(g) → 2NH3(g)’. They are neither messy, vague, nor ambiguous – rather they are clean and clear and managed by strict rules that must be adhered to to make sense.
A grammar defines the structure or syntax of the language and refers to the set of rules belonging to a language, offering the potential to create a seemingly infinite set of sentences. The language’s semantic rules suggest the meaning behind the words. There is no doubt with formal languages, and that is why we use them to instruct computers. Take a command in the programming language Python:
>>> for i in range (10, 20): print I’ lists all the numbers from 10 to 19
The statement can’t be misinterpreted. No one will misunderstand it as ‘what time is it in Karachi?’
But what about how we speak? Natural language is a challenge. No one can agree on what’s right or wrong – there is no single definitive structure or meaning. Is ‘To be not invited is sad’ grammatically correct? And what does ‘I opened the door in my dressing gown’ really mean?
Models of language
If natural language rules are ambiguous and vague, how can we teach computers to comprehend and produce well-structured, meaningful sentences? Well, if we can’t definitively say what sentences are right or wrong, grammatical and ungrammatical, we need a model that can tell us their likelihood.
The ‘Bayes’ or ‘bag-of-words’ model looks for specific words in a sentence and then classifies them into categories. Once the category is known, it can more easily predict the next potential word. And yet, it’s not easy; even with a large body, or ‘corpus,’ of text, we can end up with gibberish as each word need not be linked to the previous.
‘N-gram word models’ attempt to answer the problem by making the probability of the next word dependent on the previous n -1 words – or even down to character-level. Valid predictions are typically high for words or phrases that are common, such as ‘the’ and lower for less frequently used ones, think ‘enigma,’ and extremely challenging for those not previously encountered.
To avoid extremely low or zero probabilities, smoothing algorithms are sometimes applied.
Predicting and categorizing language models
N-gram language models are powerful. They are able to predict word sequences such as ‘a white swan’ are more common than ‘swan a white’ based on counts from a corpus. And yet, that’s not how it works for you and me. English speakers know instinctively that an article-adjective-noun, and other more complex patterns, are likely even if certain words are unfamiliar.
Some structured open-source word models have been hand-created. And some of them are huge. WordNet, for example, covers more than 200 languages and includes more than 155,000 words grouped into semantically acceptable patterns or synsets. It contains entries such as “kitten” (“young domestic cat”) IS A young_mammal
And yet the AI still remains limited to what it has been told.
Another approach, known as ‘part of speech’ tagging, categorizes words according to their lexical category, including verb, noun, and adjective, and allows for generalized rule setting, such as an adverb typically occurs before a verb.
But again, there are challenges, due to a lack of a definitive list of categories assigned to each word. To help, the Penn Treebank has been created and contains over 3 million words of text annotated with appropriate tags.
Language rule definitions
A grammar contains the set of rules, sometimes represented as a tree structure, that define allowable phrases, while a language includes the sentences that follow those rules.
However, it’s not always that simple. Informal, natural languages, such as English, Turkish, or Berber don’t have a single definitive tree structure or a distinct boundary between what is allowable and what is not. Syntactic categories such as noun phrase (the smelly old cheese) and verb phrase (drops the smelly old cheese) offer some help – but are still often not enough.
A lexicon, like a dictionary, is often used to contain all permitted words, yet new words are continually being added to open class categories such as nouns, verbs, and adjectives – they are always open to new members. For example, the word ’whataboutism’ was only added to the dictionary in October 2021. By contrast, close class categories like prepositions (on, under, after, etc.), determiners (the, my, many, etc.), conjunctions (and, but, if, etc.), and pronouns (I, you, she it, etc.) will remain fixed forever, allowing the AI to deal with them more easily.
To overcome some of the problems of natural language, AI and language researchers sometimes adopt probabilistic context-free grammars where each string of words is assigned a ‘context-free’ probability. They can be good at handling grammatical mistakes as they receive a low priority and allow parsers to learn, supervised, from parse trees created by human linguists.
Uncovering phrase structures
Parsing is essential to understanding natural language. It involves the process of analyzing word strings using the rules of grammar to uncover their phrase structure, such as noun phrases, like ‘the red balloon’ or verb phrases ‘he let go of the red balloon.’
The ultimate aim is to break down a sentence into a valid parse tree where its ‘leaves’ are the words of the sentence.
And yet there are challenges. When reading from left-to-right, the AI may not know the category of a word until nearing the end, for example: ‘Have the class revise algebra!’ and ‘Have the class revised algebra?’
‘Dynamic programming’ can help reduce or avoid such inefficiency by storing away the results of earlier string analysis to avoid having to re-analyze them again later.
Another widely used approach for syntactic analysis is called ‘dependency parsing’ – it assumes a binary relationship between lexical items without the need for syntactic constituents. Dynamic parsing is particularly helpful for parsing languages with mostly free word order, such as Latin, while languages such as English, with a fixed word order, are better suited to phrase structure trees.
Learning from examples
Building a grammar is a monumental task that will likely result in errors. Learning the grammatical rules and building the priorities from examples may be better.
The Penn Treebank contains over 100 thousand sentences and their parse trees and is a popular resource for the purposes of supervised learning. By counting the number of times each node type appears in a tree, we can create probabilistic context-free grammars.
And yet, no treebank can be totally perfect or error-free. Another approach, called ‘unsupervised learning,’ also uses a large body of example sentences but, this time, without parse trees. And it is possible to combine both methods, beginning supervised with a small number of trees to build the initial grammar before going unsupervised.
We can improve our model by recognizing that noun phrases aren’t equally likely in every situation. ‘The bold monkey’ is not expected to appear in a sentence about quantum mechanics.
What’s so difficult about natural language?
Natural language offers a range of problems to the poor AI attempting to make sense of what we say or write.
Known as ‘quantification,’ a single parse of sentences can still leave us with semantic ambiguity. ‘He shot the alligator in his trousers.’ Was it only the man in the trousers, or did the alligator get in there too?
Complete understanding of natural language involves more than semantic interpretation; it requires context-dependent information – or ‘pragmatics.’
Sentences often have gaps that result from ‘long-distance dependencies.’ We use noun phrases such as ‘him,’ ‘her,’ ‘them,’ ‘it,’ etc., to link back to early missing subjects and objects. And yet, it can be tough to work out who or what they are.
In the English language, we use verb tenses to represent the timing of events – ‘Sam loves horse riding’ versus ‘Sam loved horse riding.’ An AI must not only unpick the syntax and semantics of the sentence but also how one event sits relative to others.
Cutting-edge natural language analysis
Understanding natural language has far-reaching implications: speech recognition, machine translation, information retrieval, and question and answering, to name a few.
Teams at DeepMind recognize that AIs that can predict and generate text offer great potential in summarizing information, offering expert advice, and following instructions given in natural language.
Gopher is a 280 billion parameter language model trained on a 10.5TB corpus called MassiveText, and outperforms existing state-of-the-art models on 100 of 124 evaluation tasks. Such ‘foundation models’ are trained on a broad range of data and are having huge successes in large-scale deep learning across a wide range of tasks.
Gopher is particularly impressive because it offers surprising coherence when used in direct interaction or chat mode, making significant strides toward human expert performance. However, as with all models, it has its limitations, including a tendency for repetition, the risk of propagating incorrect information, and the potential for stereotypical bias. Indeed, AI can unintentionally pick up prejudice from online text and internalize it without anyone realizing its influence on decision-making.
As a result of the risks from large language models, DeepMind has created a taxonomy of areas of concern for ethical and social consideration for making responsible decisions.
Using natural language to create vast datasets for deep learning
While areas of AI are often tackled separately, there can be benefits from combining approaches.
After all, deep learning requires a lot of data. Indeed, creating the ImageNet database for training vision models took 25,000 workers to annotate 14 million images for 22,000 object categories.
So, CLIP (Contrastive Language–Image Pre-training) does something very different; it uses publicly available text-image pairs taken from the internet and combines the power of natural language processing and computer vision to name them.
CLIP attempts zero-shot performance by searching the internet for images and accompanying text for supervised learning. Once trained, the system is tested on a new set of appropriate sampled images to predict those that meet a description, such as ‘a photo of a cat’ or ‘a photo of a dog.’
CLIPs training set includes noisy, or meaningless, data, so, as a result, it is much more flexible than other existing trained models at recognizing everyday objects.
Natural language processing, therefore, offers incredible potential to make accessible a vast amount of training datasets for supervised learning based on what is already available online.