Some basics and some terminology about generative AI (for now).
Have you been in conversation with someone who uses the term “AI” and felt a mounting sense of panic that you don’t know the first thing about it? Have you been tempted to explore it, perhaps in teaching and learning, but worry that you don’t know how it works or that your students will know more than you? How does the “black box” of a generative AI function? Why do they get things wrong? What is with the unnerving images they can sometimes make? Why does ChatGPT say “certainly” so much?
There’s a large void in most people’s understanding of generative AI and this article isn’t going to fix that. No one article could. Rather, it provides a basic exploration of how these tools work and defines some key terms. This is not an exhaustive definition, but it should provide you with a greater understanding of how these tools work, what their limitations are, and garner you with enough information to have an explorative conversation with someone more informed. However, I should provide you with a big caveat: this is not a fixed space by any means. The technology is changing very fast. There are already multimodal Generative AIs and – moreover – new innovations happening all of the time. You’ll likely find that this information will become out of date quickly. Nevertheless, it will still serve as a good introduction, even if it will quickly become historical context.
AI-Powered Conversational Agents (ChatGPT, Claude, Gemini, etc):
When someone says “AI”, this is likely what you think of: a chat-based interactive system that can generate text and other outputs in response to written ‘prompts’ that you provide. An AI-powered conversational agent is a more specific term than “generative AI” which may also include tools that produce images, audio, or video, and appropriately differentiates these tools from chatbots more generally, which may include home assistants such as Alexa, or tools that post to social media automatically. However, calling an AI-powered conversational agent “an AI” is something of a simplification. In reality, there are several parts and processes involved in an AI-powered conversational agent.
All AI-powered conversational agents have been trained with huge volumes of data, often with a process of reinforcement learning provided by human contributions and emendations. Some of this data has a questionable pedigree, perhaps coming from online posts and interactions on social media sites, whilst other aspects may be from more reputable sources, including peer-reviewed publications. In either instance, one may question if fair attribution has been given in the production of a data set and its subsequent use in the generation of outputs. Data sets tend to come from two main sources. Firstly, commercially produced or publicly available data sets, which are often made up of material gathered from others who have agreed to specific privacy conditions (eg Facebook, Reddit, some publications, etc). Secondly, a data set may be produced through web scraping, where a tool or script gathers the content of internet sites and parses it for particular forms of content, be that text, images, code, or something else.
This data is then passed through a process of Machine Learning (ML). ML can be understood as an AI’s first attempt to internalise the knowledge provided in a data set, often with guidance from a human agent. There are many forms of ML, so a broad definition is difficult, but they all provide a method for a computer or tool to execute specific tasks without having been explicitly programmed for them. Initially, this process of learning is limited in that it often relies upon data first being labelled and classified by a human agent, enabling the AI to classify and differentiate between forms of information. It is, to use a metaphor, like interpreting a sign or symbol you have never encountered before: the symbols have no meaning until meaning is provided by some outside source. With some guidance, you can come to understand and then apply your understanding of these symbols to others encountered in a broader context.
If Machine Learning (ML) is learning symbols as a pre-literate, then Deep Learning (DL) is an AI coming to play with those symbols independently, exploring novel combinations and finding patterns with less or no intervention or signaling. DL is an extension of ML, but one that is layered in complexity (deeply so), involving the use of deep neural networks to find and form representative patterns in information. As a consequence, DL algorithms can handle raw data (such as data sets or the prompts you write) without human intervention, finding patterns and forming patterns in a manner comparable to an independent learner.
Several architectures can be used in DL models, but one of the most influential and important is a ‘Transformer’: the “T” bit of ChatGPT, where GPT stands for Generative Pretrained Transformer. First introduced by Vaswani, et al in their 2017 paper “Attention is all you need”, a transformer processes (transforms) data it receives by attending to sequences within that data and selecting aspects of greater importance to that overall sequence. That is to say, a transformer has an attention mechanism that enables the contextual assessment of information in relation to other information around it. This allows the AI to place “its attention” where it is needed. For example; what you write into a chat interface is interpreted as ‘tokens’ by the AI. Tokens may be a word, several words, a sentence, or punctuation. The importance or relevance of these tokens in relation to each other is assessed by a transformer to respond meaningfully to the prompt you entered. In turn, this influences the tone, focus, clarity, and nature of the output you receive in response.
There are two, slightly different types of DL at work in most AI-powered interactive agents. The first, and most widely known, is a Large Language Model or LLM. An LLM can be considered as the AI’s internalised knowledge rather than its ‘intellect’. It is a composite ‘model’ of language, not stored in a database but as a range of parameters and potentialities, all formed from the initial data sets the AI was trained on.
The second kind of DL in an interactive chat agent is Natural Language Processing (NLP). This is the bit that puts the “Chat” in “ChatGPT”. NLPs process, understand, and then replicate human language, again using transformers as part of the core interactive and generative technology, allowing the AI to appear human-like in the textual outputs it provides and interpret the prompts you give it.
All of this is important to understand as it clarifies what an AI-powered chat agent does. It produces probabilistically generated outputs based on user inputs and trained data, both of which are interpreted by similarly probabilistic, comparative and agitative assessments. That is to say, AI chat agents are random and productive more than they are creative. They’re not entirely random, of course, because they’ve learned patterns and had reinforcement training from humans, but where they are creative it is based on probability rather than insight or instinct. So-called ‘hallucination’ is a feature of how AI works rather than a bug then, with probabilities occasionally leading to inaccurate or outright strange outputs. How frequent these are depends on how constrained the AI has been by subsequent human instruction or reinforcement training, or how successfully its DL techniques have created and formed meaningful patterns. Nevertheless, there is no critical capability and no truly creative capability without human input. This is crucial to remember when we are concerned with academic integrity or how AI may threaten a student’s (or our own) sense of creative and intellectual agency: they are incapable of higher-order thinking, limited in their capacity to analyse, and only capable of reproducing or transposing acts of critical thought.
If you’d like further information on how AI-powered Conversational Agents work, OpenAI provides a detailed breakdown of all of ChatGPT’s capabilities. Whilst this is specific to ChatGPT, many of the general principles apply to other AI Conversational Agents, too.
AI Image Generators (DALL-E, Stable Diffusion, ImageFX, etc.):
As the name would suggest, an AI-powered image generator focuses on generating visual content and, just like their literary counterparts, are trained on vast amounts of data to learn patterns and features. With image generators, of course, the data is visual rather than textual, including images, videos, and graphics. Still, the same questions can be raised about the providence of the data and the potential lack of accountability and creative attribution. Equally, the parsing of this data works as it does in an interactive chat agent. There is an initial, governed process of Machine Learning (ML) followed by a more independent process of Deep Learning (DL), which again relies on transformers and attentional mechanisms to parse, sort, and create patterns with the original data. However, where image generators differ is in the inclusion of additional or alternative forms of DL.
Firstly, they use Convolutional Neural Networks (CNN) to form and capture spatial patterns and features from image data. There are several layers to this process, beginning with identifying the edges or boundaries between one thing and another in an image, but gradually developing this toward helping the AI to classify images and objects (eg humans, paperweights, coffee cups, etc).
Two, different types of CNN are used to form a Generative Adversarial Network or GAN, which purposefully competes against itself through a process of generation and discrimination. The generator part does as you may expect: generates. Initially, this is very basic shapes, patterns, and vectors, but over time it works toward patterns and increasingly realistic images. The discriminator helps this development, taking information from the training data and comparing it with potential outputs from the generative aspect of the network for improvements and distinctions between generated and real images.
Again, creativity here should be understood more as probabilistic productivity that responds to prompts and inputs from a human agent.