PreprintPDF Available

A very preliminary analysis of DALL-E 2

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The DALL-E 2 system generates original synthetic images corresponding to an input text as caption. We report here on the outcome of fourteen tests of this system designed to assess its common sense, reasoning and ability to understand complex texts. All of our prompts were intentionally much more challenging than the typical ones that have been showcased in recent weeks. Nevertheless, for 5 out of the 14 prompts, at least one of the ten images fully satisfied our requests. On the other hand, on no prompt did all of the ten images satisfy our requests.
A very preliminary analysis of DALL-E 2
Gary Marcus, New York University, gary.marcus@nyu.edu
Ernest Davis, New York University, davise@cs.nyu.edu
Scott Aaronson, University of Texas at Austin, scott@scottaaronson.com
25 April 2020
Abstract: The DALL-E 2 system generates original synthetic images corresponding to an input
text as caption. We report here on the outcome of fourteen tests of this system designed to assess
its common sense, reasoning and ability to understand complex texts. All of our prompts were
intentionally much more challenging than the typical ones that have been showcased in recent
weeks. Nevertheless, for 5 out of the 14 prompts, at least one of the ten images fully satisfied
our requests. On the other hand, on no prompt did all of the ten images satisfy our requests.
The DALL-E 2 system, created by OpenAI, generates original synthetic images corresponding to
an input text as caption (Ramesh, Dhariwal, Nichol, Chu, & Chen, 2022)1. One of us was
recently granted access to DALL-E 2. We asked fourteen questions, designed to assess common
sense, reasoning, and its capacity to assess complex utterances. We give the full results below.
Whether results of this kind should be considered as successes for the program – what is the
proper measure to use in evaluating success – depends on the intended use of the program. If the
goal is to generate candidate images that a graphic artist will choose from, or choose from and
edit, then the system can reasonably be measured in terms of the quality of the best result out of
ten or out of one hundred. To the extent that the goal is to develop artificial intelligence that can
be trusted in safety-critical applications (Marcus & Davis, 2019), a much higher standard must
be applied.
While our investigation was only preliminary, some tentative conclusions can be drawn:
• The visual quality of the images is stunning. We were particularly impressed with the ability to
capture the top-down perspective we requested in Example 6. A commercial artist might have
trouble getting DALL-E2 to deliver the exact results that they or their client require; an amateur
looking for clipboard-like art with less strict expectations may well get something striking and
close enough to what is needed with very little effort.
• DALL-E 2 is unquestionably extremely impressive in terms of image generation. The system
succeeds in applying many diverse artistic styles to the specified subject with extraordinary
fidelity and aptness, and capture their spirit: cartoons are light-hearted, impressionist paintings
are peaceful and evocative, photographs of everyday scenes are naturalistic, noir photographs are
subtly disturbing. Images in realistic styles are almost always physically plausible (we note
exceptions in example 9 and 12 below); images in non-realistic styles conform to the particular
1 The DALL-E 2 system is called “unCLIP” in this article.
norms of the style. Many of the images that have been published demonstrate DALL-E 2's
remarkable ability to create striking surrealist images, such as the half-human, half-robotic face
of Salvador Dali (after whom the program is named) included in (Ramesh, Dhariwal, Nichol,
Chu, & Chen, 2022). (Our experiments did not include any captions that would particularly lend
themselves to a surrealistic portrayal.)
• Some aspects of the system's language abilities seem to be quite reliable. If a caption specifies
only two or three objects, the system almost always shows all of them. If a caption specifies a
feature of an object, then the image will generally show that feature somewhere, though, as we
will discuss below, not necessarily on the correct object. The examples that have been published
elsewhere demonstrate that DALL-E 2 can reliably follow stylistic instructions (we did this in
only one of our experiments). In examples 7 and 14 below, the system reliably follows viewpoint
specifications, even though 7 requires a non-canonical view of the scene.
• Compositionality, in the Fregean sense of meaning derived from parts, appears to be
particularly problematic.
• Results are often incomplete (examples 2, 3, 4, 7 below).
• Relationships between entities are particularly challenging (examples 1-5, 7, 9). The
failures in DALL-E 2 in correctly associating specified properties with objects and in placing
objects in their correct relation in phrases like “a red cube on top of a blue cube” has been
discussed in (Ramesh, Dhariwal, Nichol, Chu, & Chen, 2022). Similarly, Thrush et al. (2022)
tested the visuo-linguistic compositional reasoning of current state-of-the-art vision and language
models using the Winoground benchmark. They found that none of these, including CLIP
(Radford, et al., 2021) which forms part of the basis for DALL-E 2, does significantly better than
chance.
• Anaphora and phrases like "in a similar posture" that require connecting parts of a
sentence across a discourse may pose particular problems (example 1).
• Numbers may be poorly understood (example 6).
• Negation has been problematic for large language models (Ettinger, 2020) and our preliminary
probing suggests similar problem here (example 12).
• Common sense. Two of our examples were designed to probe DALL-E 2 commonsense
knowledge. It succeeded in example 9, but failed in example 10. One of the images in example
12 also shows a failure of common sense reasoning.
• The content filters need improvement; in example 5, we were unable to test an item with the
phrase tug-of-war (describing the children's game), apparently because the substring "war" was
flagged as a policy violation.
• Generalization is difficult to directly assess, because no details of the training set have been
made available. What might or might not count as distribution shift is not obvious.
Shortly after DALL-E 2 was released, OpenAI CEO Sam Altman tweeted "AGI is gonna be
wild.”2 However, it is reasonable to question whether DALL-E 2 constitutes progress toward
solving the deep challenges of commonsense reasoning, comprehension, reliability, and so forth
that would be needed for a truly general-purpose AI (Marcus & Davis, 2019). Our results
provide a clearer picture of what remains to be done before a system like DALL-E 2 could
genuinely be said to understand what it strikingly renders in images.
Methods
DALL-E 2 was run at default settings. Ten images (the default) were generated for each caption.
Examples 6 and 12 were run twice with changes to the caption because of technical issues
discussed below.
The examples were specifically designed to probe what we conjectured might be weaknesses in
DALL-E 2. We screened three proposed captions by running them through Google image search
to see how easily images could be retrieved; as a result we rejected two proposed caption and
modified another. It may be noted that the authors have no specific knowledge of the internals,
the training set, or the abilities of DALL-E 2 beyond what is published in (Ramesh, Dhariwal,
Nichol, Chu, & Chen, 2022) and the examples that have circulated in social media.
Example 1:
Caption: a red basketball with flowers on it, in front of blue one with a similar pattern
Images:
2 https://twitter.com/sama/status/1511735572880011272
Discussion: Most of these have omitted some aspect of the description. Image 10 is correct. (In
all these discussions, images are numbered 1-5 left to right in the top row and 6-10 left to right in
the bottom row.)
Example 2:
Caption: a red ball on top of a blue pyramid with the pyramid behind a car that is above a toaster.
Images:
Discussion: None of these are correct. The inability of DALL-E 2 to deal properly with complex
specifications is discussed in (Ramesh, Dhariwal, Nichol, Chu, & Chen, 2022).
Example 3:
Caption: Abraham Lincoln touches his toes while George Washington does chin-ups. Lincoln is
barefoot. Washington is wearing boots.
Images:
Discussion: All of the images show two men. Several feature toes and/or boots. All show the
men in rather unusual postures, suggestive of exercise, but only image 3 shows (rather strangely)
a man touching his toes, and none show a man doing chin-ups. Images 1 and 6 show Lincoln;
none of the images seems to show Washington.
In all our examples, the ten images generated for a single caption were in a single artistic style,
though that was not specified in our caption. It is particularly conspicuous in this case, because
the particular style of early nineteenth-century cartoonists such as John Doyle and George
Cruikshank used here is quite distinctive and unusual. Apparently, the associations between
captions and images that DALL-E 2 uses (the CLIP representation (Radford, et al., 2021)) tends
to associate a caption with a collection of images of one particular genre..
Example 4:
Caption: Supreme Court Justices play a baseball game with the FBI. The FBI is at bat, the
justices are on the field.
Images:
Discussion: The images mostly show people, who could well be government officials, some
carrying bats, many near recognizable government buildings. But none comes very close to
showing the specified situation. The test captions that have been inserted are all gibberish; this
failing of DALL-E 2 is noted in (Ramesh, Dhariwal, Nichol, Chu, & Chen, 2022).
Example 5:
Caption: A donkey is playing tug-of-war against an octopus. The donkey holds the rope in its
mouth. A cat is jumping over the rope.
Discussion: DALL-E 2 refused to accept this caption and gave a warning that it violated policy,
probably because it contains the word “war”. We therefore rephrased it as example 5.A.
Example 5.A:
Caption: A donkey and an octopus are playing a game. The donkey is holding a rope on one end,
the octopus is holding onto the other. The donkey holds the rope in its mouth. A cat is jumping
over the rope.
Images:
Discussion: All of the image contain all of the components of the caption a donkey, a cat, a
rope, and a creature that is multi-legged, though none of the images show an octopus with eight
legs. None gets more than one of the stated relations right.
Example 6:
Caption: A pear cut into seven pieces arranged in a ring.
Images:
Discussion: All of the images show pieces of a pear and all but 2 and 4 show those pieces
arranged as a ring. Image 9 successfully shows seven pieces, although it seems impossible for
them to be pieces of a single pear. (By contrast, image 10 is a single pear cut into eight pieces.)
When we gave this caption to Google Images, all of the results returned showed a ring with a
pear-cut diamond, and we thought that DALL-E 2 might make the same mistake, but it avoided
that trap.
Example 7:
Caption: A tomato has been put on top of a pumpkin on a kitchen stool. There is a fork sticking
into the pumpkin. The scene is viewed from above.
Images:
Discussion: DALL-E 2 got the viewpoint specification consistently which we had thought would
be challenging. Only images 7 and 10 show both the tomato and the pumpkin, and only images 5
and 6 show the fork stuck into the pumpkin.
Example 8:
Caption: An elephant is behind a tree. You can see the trunk on one side and the back legs on the
other.
Images:
Discussion: This example was based on an actual photograph shown to the authors by Georgia
Giokxari. All the images correctly show an elephant and a tree, and in images 3, 5, 6, 8, and 10,
the elephant is behind the tree, or at least further from the viewer. Only image 5 shows it in the
specified position.
Example 9:
Caption: In late afternoon in January in New England, a man stands in the shadow of
a maple tree.
Images:
Discussion: DALL-E 2 correctly inferred that a maple tree in January has no leaves, which we
had thought might be challenging. All the images show a man, and shadows, and, except in
image 6, the shadows seem to align properly with one another and with the implicit lighting.
However, in none is the man in the shadow of the tree, as specified. Moreover, image 3 seems
unclear as to whether what is being shown is the man or his shadow; the coloring in the image in
part reflects the man and in part reflects his background, as if he were translucent.
Example 10:
Caption: An old man is talking to his parents.
Images:
Discussion: In none of these images did DALL-E successfully infer that image should show an
old man with two even older people (which was intended as a test of commonsense reasoning).
Example 11:
Caption: A grocery store refrigerator has pint cartons of milk on the top shelf, quart cartons on
the middle shelf, and gallon plastic jugs on the bottom shelf.
Images:
Discussion: The images all show a grocery store refrigerator, and most show three shelves, but
none has the correct relative sizes.
Example 12:
Caption: A couple in formal evening wear going home get caught in a heavy downpour with no
umbrellas
Discussion: DALL-E 2 chose to generate these in a photographic style, perhaps using images of
actual people. OpenAI’s policy prohibits publishing photographic images produced by DALL-E
2 that show people who might be recognizable. All ten images show a man and woman in formal
evening wear in a downpour. In two of the images, the couple are underneath an umbrella but not
in the remaining eight. In one image the couple seem to be indoors; this is a failure of
commonsense physical reasoning., In one the streets are wet but it is not clear that it is still
raining. However, six of the ten images unquestionably conform to the specification.
We ran the example a second time, specifying that the image should be an oil painting to avoid
this problem.
Example 12.A:
Caption: An oil painting of a couple in formal evening wear going home get caught in a heavy
downpour with no umbrellas.
Images:
Discussion: Five of the ten images are exactly correct. In the other five, however, DALL-E2 has
missed the specification of 'no umbrella.' " (We also ran a third attempt at this, with a caption
that was probably identical and certainly very similar, but inadvertently not recorded. The
outcome of that attempt was very similar in style; two of the images were correct but the other
eight erroneously showed the couple with an umbrella.)
Example 13:
Caption: Paying for a quarter-sized pizza with a pizza-sized quarter.
Images:
Discussion: This caption is taken from an example of Rips (1989). (There is a charming line
drawing of it by Maayan Harel in (Marcus & Davis, 2019).) All the images here show a pizza
and a quarter, and the pizzas are mostly quite small; though not certainly not the size of a quarter,
perhaps they are a quarter of the size of a regular pizza, in terms of diameter. But none shows a
pizza-sized quarter.
Example 14:
Caption: Wild turkeys in a garden seen from inside the house through a screen door.
Images:
Discussion: The challenge here was to indicate that the turkeys were being viewed through a
screen door. DALL-E2 succeeded in images 1 and 2 in indicating the screen or something
similar. The remaining images looks as though the turkeys were being viewed through glass.
None of the images clearly indicate that it is being viewed through a door rather than a window.
Acknowledgements
Thanks to OpenAI for permission to use these images and to Aditya Ramesh for helpful
clarifications. Scott Aaronson’s research is supported by a Vannevar Bush Fellowship from the
US Department of Defense, a Simons Investigator Award, and the Simons 'It from Qubit'
collaboration.
References
Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for
language models. Transactions of the Association for Computational Linguistics, 34-48.
Marcus, G. F., & Davis, E. (2019). Rebooting AI: Building Artificial Intelligence We Can Trust. Pantheon
Press.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., . . . Sutskever, I. (2021). Learning
transferable visual models from natural language supervision. International Conference on
Machine Learning, (pp. 8748-8763).
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical Text-Conditional Image
Generation with CLIP Latents. arXiv preprint arXiv:2204.06125. Retrieved from
https://arxiv.org/abs/2204.06125
Rips, L. (1989). Similarity, typicality, and categorization. In S. Vosniadou, & A. Ortony, Similarity and
Analogical Reasoning (pp. 21-59). Cambridge: Cambridge University Press.
Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., & Ross, C. (2022). Winoground: Probing
Vision and Language Models for Visio-Linguistic Compositionality. arXiv preprint
arXiv:2204.03162. Retrieved from https://arxiv.org/abs/2204.03162
... Loos et al. [49] Machine learning can be used to predict function but not pain after surgery for thumb carpometacarpal osteoarthritis Kim et al. [50] Predicting the severity of postoperative scars using artificial intelligence based on images and clinical data 2023 [42] A machine learning framework for automated diagnosis and computer-assisted planning in plastic and reconstructive surgery 2019 Scientific Reports (1) Develop a ML framework for automated diagnosis, risk stratification, and treatment in PRS (2) Enhance precision and efficiency in MLassisted surgical planning to improve clinical decision making and outcomes ...
... In an early study by Geisler et al., neural networks have successfully achieved an overall testing accuracy of 90.6% for the detection of craniosynostosis, opening the field for earlier diagnosis and minimizing the need for CT scans [41] . Another study by Knoops et al. described a computer-assisted model framework involving supervised learning for diagnostic, predictive outcome and treatment stimulation in craniofacial surgery [42] . The algorithm was trained on non-ionizing 3D face scans of healthy faces and orthognathic patients, and it provided an accurate classification with a 95.5% sensitivity and 95.2% specificity [42] . ...
... Another study by Knoops et al. described a computer-assisted model framework involving supervised learning for diagnostic, predictive outcome and treatment stimulation in craniofacial surgery [42] . The algorithm was trained on non-ionizing 3D face scans of healthy faces and orthognathic patients, and it provided an accurate classification with a 95.5% sensitivity and 95.2% specificity [42] . The algorithm was also able to stimulate patient-specific postoperative outcomes with a mean accuracy of 1.1 +/-0.3 mm. ...
Article
Full-text available
Artificial intelligence (AI) and machine learning (ML) involve the usage of complex algorithms to identify patterns, predict future outcomes, generate new data, and perform other tasks that typically require human intelligence. AI tools have been progressively adopted by multiple disciplines of surgery, enabling increasingly patient-specific care, as well as more precise surgical modeling and assessment. For instance, AI tools such as ChatGPT have been applied to enhance both patient educational materials and patient-surgeon communication. Additionally, AI tools have helped support pre- and postoperative assessment in a diverse set of procedures, including breast reconstructions, facial surgeries, hand surgeries, wound healing operations, and burn surgeries. Further, ML-supported 3D modeling has now been utilized for patient-specific surgical planning and may also be combined with 3D printing technologies to generate patient-customized, implantable constructs. Ultimately, the advent of AI and its intersection with surgical practice have demonstrated immense potential to transform patient care by making multiple facets of the surgical process more efficient, precise, and patient-specific.
... Perhaps the most interesting aspect of this system is that somewhere in the middle of the system both the text prompt and the image are cast into representational objects of the same type, more specifically n-dimensional vectors. This simplifies the lucru simplifică morfismul matematic dintre reprezentările de text și reprezentările de imagini (a se vedea, de exemplu, Marcus et al., 2022;OpenAI, 2021). ...
... Astfel, trebuie să tratăm aceste artefacte ipotetice ca fiind fundamental invalide, dar (uneori) puternic evocatoare. Devine imediat evident că o anumită cantitate de muncă mathematical morphism between representations of text and representations of images (see, for example, Marcus et al., 2022;OpenAI, 2021). ...
... A foundation model is a large-scale machine learning model trained on diverse datasets, allowing it to process various information types. Examples include GPT for text [27] and DALL-E for images [28]. These models undergo two stages: pre-training on general data and fine-tuning for specific tasks. ...
Article
Full-text available
This paper proposes an end-to-end pipeline to detect broken eggs in a holder without extensive training, employing a two-step image segmentation and processing approach using saliency scores, all without relying on a large amount of labeled data. The process begins by inputting an egg image with text prompts into Grounding DINO, which returns an egg bounding box. This is followed by the segment anything model (SAM), which extracts the egg’s segmented region. The segmented region is then divided into two crucial components for detection: a binary mask image and a background-removed egg image. The innovation in our method lies in using the saliency score of the estimated anomaly region by employing image processing techniques to effectively distinguish between intact and broken eggs. To validate our approach, we compare it to well-known models such as SVM, XGBoost, and YOLOv8, and we also conduct zero-shot experiments with CLIPSeg, Florence-2, and SAA. In our experimental setup, we utilize 50 egg holder images, each containing both intact and broken eggs. We carefully cropped and processed 30 eggs (arranged in a 6×5 grid) from each holder, resulting in a comprehensive testing dataset totaling 1,500 images. Our results demonstrate the robustness of our method, achieving an impressive 99.56% accuracy in detecting both intact and broken eggs. This breakthrough promises significant advancements in the field of broken egg detection, with broad applications across diverse industries, including food safety, quality control, and automated packaging systems.
... DALL-E is a machine learning technology that generates images based on user inputs and was made widely available to the public, preceding ChatGPT in its availability (Marcus et al., 2022). It uses Artificial Neural Networks with multimodal neurons to understand and create novel images. ...
Preprint
Full-text available
This paper discusses OpenAIs ChatGPT, a generative pre-trained transformer, which uses natural language processing to fulfill text-based user requests (i.e., a chatbot). The history and principles behind ChatGPT and similar models are discussed. This technology is then discussed in relation to its potential impact on academia and scholarly research and publishing. ChatGPT is seen as a potential model for the automated preparation of essays and other types of scholarly manuscripts. Potential ethical issues that could arise with the emergence of large language models like GPT-3, the underlying technology behind ChatGPT, and its usage by academics and researchers, are discussed and situated within the context of broader advancements in artificial intelligence, machine learning, and natural language processing for research and scholarly publishing.
... In this article, we focus on creating concept images via supervised learning tools such as DALL-E and Midjourney. All of these tools generate images from input text using multimodal generatives, but can reinforce biases from training data (Mishkin et al., 2022;Marcus, Davis and Aaronson, 2022;King, 2022). ...
Article
Full-text available
This article aims to explore the integration of artificial intelligence (AI) as a design tool in interior design education. The research examines the students' interior design studio project outcomes over the usage of AI in creating conceptual images, and the implementation of the AI-created concept to the overall space. In the research, students' projects are divided into two groups of 5 according to sufficient or insufficient prompts for the "AI generated" conceptual images. Barnard's (1992) CAIDC (Consensual Assessment of Interior Design Creativity) scale was used for the assessment. Mann-Whitney U Test was conducted for the results. We understand that there is no significant difference between writing sufficient or insufficient prompts in the concept development phase of interior design projects according to the Barnard (1992)’s design merits. It has been confirmed that the main factor that influences this regard is the need for an appropriate "concept analysis" to adapt the concept generated with AI to the specified project spaces.
... Using also transformers, models for translating regular text into images have also been developed. Most known and documented examples of these last ones are DALL E [6,7] and Stable Diffusion [8]. But other examples have quickly been released like: OpenArt [9], ImagineArt [10], Adobe Firefly [11] and many others. ...
Preprint
Full-text available
Generative AI has gained enormous interest nowadays due to new applications like chatGPT, DALL E, Stable Difussion and Deep Fake. Particularly DALL E, Stable Difussion and others (Adobe Firefly, ImagineArt...) are able to create images from a text prompt and are also able to recreate real photographs. Due to this fact, intense research has arisen to create new image forensics applications able to distinguish between real captured images and videos and artificial ones. Detecting forgeries made with Deep Fake is one of the most researched issues. This paper is about another kind of forgery detection. The purpose of this research aims to detect photo realistic AI created images versus real photos coming from a physical camera. For this purpose, techniques that perform a pixel level feature extraction are used. First one is Photo Response Non-Uniformity (PRNU). PRNU is a special noise due to imperfections on the camera sensor that is used for source camera identification. The underlying idea is that AI images will have a different PRNU pattern. Second one is Error level analysis (ELA). This is other type of feature extraction traditionally used for detecting image editions. In fact, ELA is being used nowadays by photographers to detect manually AI created images. Both kinds of features are used to train Convolutional Neural Networks to differentiate between AI images and real photographs. Good results are obtained achieving accuracy rates over 95%. Both extraction methods are carefully assessed by computing precision/recall and F1-score measurements.
Chapter
We illuminate the synergy between AI and artistry with DALL-E, OpenAI’s groundbreaking text-to-visual platform. We explore the fusion of diffusion models and the CLIP neural network in transforming textual prompts into vivid images. Through insights from centauric artist \mathrm {Remo}{\times }\mathrm {DALL}\mbox{-}\mathrm {E}, we show how nuanced prompts nurture imitative creativity and confront the complexities of depicting intricate spatial relationships. This chapter underscores the integral role of generative platforms in driving artistic evolution and anticipates future advancements.
Preprint
Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
Article
Full-text available
Az MI (Mesterséges Intelligencia) és nagy nyelvi modellek, mint például a GPT (Generative Pre-trained Transformer) gyors és mélyreható változásokat hoztak a vállalatok és a gazdaság területén. Ugyanakkor fontos megérteni, hogy a GPT és az MI általános alkalmazása önmagában nem garantálja a vállalkozás versenyelőnyét. Számos tényezőt kell figyelembe venni, és alaposan mérlegelni kell az előnyöket és hátrányokat. Az etikai és jogi kérdések, például az adatvédelem és a diszkrimináció elkerülése kritikus fontosságúak. Emellett a technológiai fejlődésnek és a változó környezetnek való alkalmazkodás is kulcsfontosságú a hosszú távú sikerhez. A tanulmány első részében ismertetésre kerül a GPT, valamint néhány kulcsfontosságú fogalom, amelyek kapcsolódnak a ChatGPT-hez. A második részben a GPT vállalkozásokra vonatkozó hatásai, lehetőségei és korlátai következnek. A harmadik részben pedig példaként a ChatGPT 3.5 verziójának feltett interjúkérdésre adott válasz mutatja be a GPT jelenlegi képességeit.
Article
Full-text available
Pre-training by language modeling has become a popular and successful approach to NLP tasks, but we have yet to understand exactly what linguistic capacities these pre-training processes confer upon models. In this paper we introduce a suite of diagnostics drawn from human language experiments, which allow us to ask targeted questions about information used by language models for generating predictions in context. As a case study, we apply these diagnostics to the popular BERT model, finding that it can generally distinguish good from bad completions involving shared category or role reversal, albeit with less sensitivity than humans, and it robustly retrieves noun hypernyms, but it struggles with challenging inference and role-based event prediction— and, in particular, it shows clear insensitivity to the contextual impacts of negation.
Conference Paper
Similarity and analogy are fundamental in human cognition. They are crucial for recognition and classification, and have been associated with scientific discovery and creativity. Successful learning is generally less dependent on the memorization of isolated facts and abstract rules than it is on the ability to identify relevant bodies of knowledge already stored as the starting point for new learning. Similarity and analogy play an important role in this process - a role that in recent years has received much attention from cognitive scientists. Any adequate understanding of similarity and analogy requires the integration of theory and data from diverse domains. This interdisciplinary volume explores current developments in research and theory from psychological, computational, and educational perspectives, and considers their implications for learning and instruction. Well-known cognitive scientists examine the psychological processes involved in reasoning by similarity and analogy, the computational problems encountered in simulating analogical processing in problem solving, and the conditions promoting the application of analogical reasoning in everyday situations.
Rebooting AI: Building Artificial Intelligence We Can Trust
  • G F Marcus
  • E Davis
Marcus, G. F., & Davis, E. (2019). Rebooting AI: Building Artificial Intelligence We Can Trust. Pantheon Press.
Learning transferable visual models from natural language supervision
  • A Radford
  • J W Kim
  • C Hallacy
  • A Ramesh
  • G Goh
  • S Agarwal
  • . . Sutskever
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S.,... Sutskever, I. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning, (pp. 8748-8763).
Hierarchical Text-Conditional Image Generation with CLIP Latents
  • A Ramesh
  • P Dhariwal
  • A Nichol
  • C Chu
  • M Chen
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125. Retrieved from https://arxiv.org/abs/2204.06125
  • T Thrush
  • R Jiang
  • M Bartolo
  • A Singh
  • A Williams
  • D Kiela
  • C Ross
Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., & Ross, C. (2022). Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality. arXiv preprint arXiv:2204.03162. Retrieved from https://arxiv.org/abs/2204.03162