Images produced by DALL-E when given the text prompt "a professional high quality illustration of a giraffe dragon chimera. a giraffe imitating a dragon. a giraffe made of dragon." | |
Original author(s) | OpenAI |
---|---|
Initial release | 5 January 2021 |
Type | Transformer language model |
Website | www |
Part of a series on |
Artificial intelligence |
---|
DALL-E stylized DALL·E, is an artificial intelligence program that creates images from textual descriptions, revealed by OpenAI on January 5, 2021.[1][2] It uses a 12-billion parameter[3] version of the GPT-3 transformer model to interpret natural language inputs (such as "a green leather purse shaped like a pentagon" or "an isometric view of a sad capybara") and generate corresponding images.[1] It is able to create images of realistic objects ("a stained glass window with an image of a blue strawberry") as well as objects that do not exist in reality ("a cube with the texture of a porcupine").[3][4][5][6][7][8] DALL-E's name is a portmanteau of WALL-E and Dalí.[1][3]
Many neural nets from the 2000s onwards have been able to generate realistic images,[1] DALL-E, however, is able to generate them from natural language prompts, which it "understands [...] and rarely fails in any serious way".[1]
DALL-E was developed, and announced to the public, in conjunction with another model, CLIP (Contrastive Language-Image Pre-training),[2] whose role is to "understand and rank" its output.[1] DALL-E's raw output is curated by CLIP, which presents the highest-quality images for any given prompt.[2] OpenAI has refused to release source code for either model; a "controlled demo" of DALL-E is available on OpenAI's website, where output from a limited selection of sample prompts can be viewed.[3]
According to MIT Technology Review, one of OpenAI's objectives through DALL-E's development was to "give language models a better grasp of the everyday concepts that humans use to make sense of things".[2]
Architecture
The Generative Pre-trained Transformer (GPT) model was first developed by OpenAI in 2018,[9] using the transformer architecture. The first iteration, GPT, was scaled up to produce GPT-2 in 2019.[10] In 2020, GPT-2 was augmented similarly to produce GPT-3,[11] of which DALL-E is a implementation.[3][12] It uses zero-shot learning to generate output from a description and cue without further training.[13]
DALL-E's model is a 12-billion parameter version of GPT-3[3] (scaled down from GPT-3's parameter size of 175 billion)[11] which "swaps text for pixels", trained on text-image pairs from the Internet.[2]
DALL-E generates large amounts of images in response to prompts; another OpenAI model, CLIP, was developed in conjunction (and announced simultaneously) with DALL-E to "understand and rank" its output.[1] CLIP was trained on over 400 million pairs of images and text.[3] CLIP is an image recognition system[2]; however, unlike most classifier models, CLIP was not trained on curated datasets of labeled images (like ImageNet), but rather on images and descriptions scraped from the Internet.[2] Rather than learn from a single label, CLIP learns to associate images with entire captions.[2] CLIP was trained to predict which caption (out of a "random selection" of 32,768 possible captions) was most appropriate for an image, allowing it to subsequently identify objects in a wide variety of images outside its training set.[2]
Performance
DALL-E is capable of generating imagery in a variety of styles, from photorealistic imagery[3] to paintings and emoji. It can also "manipulate and rearrange" objects in its images.[3] One ability noted by its creators was the correct placement of design elements in novel compositions without explicit instruction: "For example, when asked to draw a daikon radish blowing its nose, sipping a latte, or riding a unicycle, DALL·E often draws the kerchief, hands, and feet in plausible locations."[14]
While DALL-E exhibited a wide variety of skills and abilities, on the release of its public demo, most coverage focused on a small subset of "surreal"[2] or "quirky"[15] output images. Specifically, DALL-E's output for "an illustration of a baby daikon radish in a tutu walking a dog" was mentioned in pieces from Input, NBC, Nature, VentureBeat, Wired, CNN, New Scientist and the BBC;[16][6][17][3][18][19][20][21] its output for "an armchair in the shape of an avocado" was reported on by Wired, VentureBeat, New Scientist, NBC, MIT Technology Review, CNBC, CNN and the BBC.[18][3][20][6][2][15][19][21] In contrast, DALL-E's unintentional development of visual reasoning skills sufficient to solve Raven's Matrices (visual tests often administered to humans to measure intelligence) was reported on by machine learning engineer Dale Markowitz in a piece for TheNextWeb.[22]
Nature introduced DALL-E as "an artificial-intelligence program that can draw pretty much anything you ask for".[17] TheNextWeb's Thomas Macaulay called its images "striking" and "seriously impressive", remarking on its ability to "create entirely new pictures by exploring the structure of a prompt — including fantastical objects combining unrelated ideas that it was never fed in training".[23]ExtremeTech said that "sometimes the renderings are little better than fingerpainting, but other times they’re startlingly accurate portrayals";[24] TechCrunch noted that, while DALL-E was "fabulously interesting and powerful work", it occasionally produced bizarre or incomprehensible output, and "many images it generates are more than a little… off":[1]
Saying “a green leather purse shaped like a pentagon” may produce what’s expected but “a blue suede purse shaped like a pentagon” might produce nightmare fuel. Why? It’s hard to say, given the black-box nature of these systems.[1]
Despite this, DALL-E was described as "remarkably robust to such changes" and reliable in producing images for a wide variety of arbitrary descriptions.[1] Sam Shead, reporting for CNBC, called its images "quirky" and quoted Neil Lawrence, a professor of machine learning at the University of Cambridge, describing it as an "inspirational demonstration of the capacity of these models to store information about our world and generalize in ways that humans find very natural". He also quoted Mark Riedl, an associate professor at the Georgia Tech School of Interactive Computing, as saying that DALL-E's demonstration results showed that it was able to "coherently blend concepts", a key element of human creativity, and that "the DALL-E demo is remarkable for producing illustrations that are much more coherent than other Text2Image systems I’ve seen in the past few years."[15] Riedl was also quoted by the BBC saying that he was "impressed by what the system could do".[21]
DALL-E's ability to "fill in the blanks" and infer appropriate details without specific prompts was remarked on as well. ExtremeTech noted that that a prompt to draw a penguin wearing a Christmas sweater produced images of penguins not also wearing sweaters, but also thematically related Santa hats,[24] and Engadget noted that appropriately-placed shadows appeared in output for the prompt "a painting of a fox sitting in a field during winter".[13] Furthermore, DALL-E exhibits broad understanding of visual and design trends; ExtremeTech said that "you can ask DALL-E for a picture of a phone or vacuum cleaner from a specified period of time, and it understands how those objects have changed".[24] Engadget also noted its unusual ability of "understanding how telephones and other objects change over time".[13]
Implications
OpenAI has refused to release the source code for DALL-E, or allow any use of it outside a small number of sample prompts;[3] OpenAI claimed that it planned to "analyze the societal impacts"[23] and "potential for bias" in models like DALL-E.[15] Despite the lack of access, at least one potential implication of DALL-E has been discussed, with several journalists and content writers mainly predicting that DALL-E could have effects on the field of journalism and content writing. Sam Shead's CNBC piece noted that some had concerns about the then-lack of a published paper describing the system, and that DALL-E had not been "opened sourced" [sic].[15]
While TechCrunch said "don’t write stock photography and illustration’s obituaries just yet",[1] Engadget said that "if developed further, DALL-E has vast potential to disrupt fields like stock photography and illustration, with all the good and bad that entails".[13]
In a Forbes opinion piece, venture capitalist Rob Toews said that DALL-E "presaged the dawn of a new AI paradigm known as multimodal AI", in which systems would be capable of "interpreting, synthesizing and translating between multiple informational modalities"; he went on to say that DALL-E demonstrated "it is becoming harder and harder to deny that artificial intelligence is capable of creativity". Based on the sample prompts (which included clothed mannequins and items of furniture), he predicted that DALL-E might be used by fashion designers and furniture designers, but that "the technology is going to continue to improve rapidly".[25]
References
- ^ a b c d e f g h i j k Coldewey, Devin (5 January 2021). "OpenAI's DALL-E creates plausible images of literally anything you ask it to". Retrieved 5 January 2021.
- ^ a b c d e f g h i j k Heaven, Will Douglas (5 January 2021). "This avocado armchair could be the future of AI". MIT Technology Review. Retrieved 5 January 2021.
- ^ a b c d e f g h i j k l Johnson, Khari (5 January 2021). "OpenAI debuts DALL-E for generating images from text". VentureBeat. Retrieved 5 January 2021.
- ^ Grossman, Gary (16 January 2021). "OpenAI's text-to-image engine, DALL-E, is a powerful visual idea generator". VentureBeat. Retrieved 2 March 2021.
- ^ Andrei, Mihai (8 January 2021). "This AI module can create stunning images out of any text input". ZME Science. Retrieved 2 March 2021.
- ^ a b c Ehrenkranz, Melanie (27 January 2021). "Here's DALL-E: An algorithm learned to draw anything you tell it". NBC News. Retrieved 2 March 2021.
- ^ Walsh, Bryan (5 January 2021). "A new AI model draws images from text". Axios. Retrieved 2 March 2021.
- ^ "For Its Latest Trick, OpenAI's GPT-3 Generates Images From Text Captions". Synced. 5 January 2021. Retrieved 2 March 2021.
- ^ Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (11 June 2018). "Improving Language Understanding by Generative Pre-Training" (PDF). OpenAI. p. 12. Retrieved 23 January 2021.
- ^ Radford, Alec; Wu, Jeffrey; Child, Rewon; Luan, David; Amodei, Dario; Sutskever, Ilua (14 February 2019). "Language models are unsupervised multitask learners" (PDF). 1 (8). Retrieved 19 December 2020. Cite journal requires
|journal=
(help) - ^ a b Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (July 22, 2020). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs.CL].
- ^ Ramesh, Aditya; Pavlov, Mikhail; Goh, Gabriel; Gray, Scott; Voss, Chelsea; Radford, Alec; Chen, Mark; Sutskever<ref name="newscientist">, Ilya (24 February 2021). "Zero-Shot Text-to-Image Generation". arXiv:2101.12092 [cs.LG].
- ^ a b c d Dent, Steve (6 January 2021). "OpenAI's DALL-E app generates images from just a description". Engadget. Retrieved 2 March 2021.
- ^ Dunn, Thom (10 February 2021). "This AI neural network transforms text captions into art, like a jellyfish Pikachu". BoingBoing. Retrieved 2 March 2021.
- ^ a b c d e Shead, Sam (8 January 2021). "Why everyone is talking about an image generator released by an Elon Musk-backed A.I. lab". CNBC. Retrieved 2 March 2021.
- ^ Kasana, Mehreen (7 January 2021). "This AI turns text into surreal, suggestion-driven art". Input. Retrieved 2 March 2021.
- ^ a b Stove, Emma (5 February 2021). "Tardigrade circus and a tree of life — January's best science images". Nature. Retrieved 2 March 2021.
- ^ a b Knight, Will (26 January 2021). "This AI Could Go From 'Art' to Steering a Self-Driving Car". Wired. Retrieved 2 March 2021.
- ^ a b Metz, Rachel (2 February 2021). "A radish in a tutu walking a dog? This AI can draw it really well". CNN. Retrieved 2 March 2021.
- ^ a b Stokel-Walker, Chris (5 January 2021). "AI illustrator draws imaginative pictures to go with text captions". New Scientist. Retrieved 4 March 2021.
- ^ a b c Wakefield, Jane (6 January 2021). "AI draws dog-walking baby radish in a tutu". British Broadcasting Corporation. Retrieved 3 March 2021.
- ^ Markowitz, Dale (10 January 2021). "Here's how OpenAI's magical DALL-E image generator works". TheNextWeb. Retrieved 2 March 2021.
- ^ a b Macaulay, Thomas (6 January 2021). "Say hello to OpenAI's DALL-E, a GPT-3-powered bot that creates weird images from text". TheNextWeb. Retrieved 2 March 2021.
- ^ a b c Whitwam, Ryan (6 January 2021). "OpenAI's 'DALL-E' Generates Images From Text Descriptions". ExtremeTech. Retrieved 2 March 2021.
- ^ Toews, Rob (18 January 2021). "AI And Creativity: Why OpenAI's Latest Model Matters". Forbes. Retrieved 2 March 2021.
Notes