Text-To-Image Generation Using AI

International Journal of Research Publication and Reviews, Vol 4, no 4, pp 4932-4937 April 2023

International Journal of Research Publication and Reviews

Journal homepage: www.ijrpr.com ISSN 2582-7421

Pavithra V

, Rosy S

, Srinishanthini R B

, Prinslin L

Computer Science and Engineering, Agni College of Technology

Assistant Professor, Computer Science, and Engineering Department, Agni College of Technology

DOI: https://doi.org/10.55248/gengpi.234.4.38568

ABSTRACT

This model is suggested to produce the photos that are provided in the text. This has the power to turn abstract images into actual ones. DALL-E is necessary for

the conversion. Making artistic, realism-based graphics from the description will be enjoyable. Kotline and Java were used to create this Android application

project. For this project, we employed a natural language description prompt. Through prompts, it generates visuals. This will help incorporate our many ideas

and thoughts into the diagrammatic presentation. DALLE can be used for business endeavors such as advertising, publishing, selling, etc. It will display images of

our choice, resulting in anthropomorphic imagery and the collaboration of unconnected thoughts. It is viable and produces believable items. We can turn our

inventive thoughts into reality using this Android application project. It is user-friendly software with no visual problems. And we can't discover this fictitious

picture generator on any search engine. This Android application project will generate an image in whatever size you want. And it does not affect image quality.

A picture's quality size will be 256*256, 512*12, and 1024*1024. Based on the quality of our network, we may select a quality size. We must first ensure that the

network infrastructure is in place before we can use this application.

Keywords: Diffusion Models, Energy-based Models, Visual generation.

I. INTRODUCTION

OpenAI's DALL-E is a ground-breaking artificial intelligence program that can generate high-quality photographs from textual descriptions. This

cutting-edge technology is a game changer in the world of picture production, with limitless applications in advertising, design, and even medicine.

The procedure for employing DALL-E is straightforward. DALL- E creates a comparable picture based on a textual description provided by users. For

example, if you wanted an image of a blue cat playing with a ball of yarn, you would describe it to DALL-E, and it would generate a unique image that

matched your parameters.

DALL-E's technology is built on powerful neural networks and machine learning algorithms that allow it to analyze text and create related visuals in

seconds. The program is always learning and developing, so it will continue to create increasingly realistic and detailed photos over time.

Overall, DALL-E is a power that has the potential to change the way we think about image production and design. Because of its ability to generate

high-quality pictures from simple text descriptions, it opens up new avenues for innovation and creative expression in several enterprises.

1.1 OBJECTIVE OF THE STUDY

Examine the accuracy and quality of the pictures produced by DALL-E using a variety of textual descriptions. Compare DALL-E's performance to

those of other cutting- edge image synthesis models. Examine DALL-E's possibilities in a range of fields, including advertising, design, and the arts.

Discuss the moral ramifications of producing fake pictures that are difficult to tell apart from genuine ones.

II. LITERATURE SURVEY

[1]. "DALL-E: Creating Images from Text" by Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark

Chen, Ilya Sutskever, and OpenAI. This paper introduces DALL-E and describes its architecture and capabilities.

[2]. "Visualizing and Understanding DALL-E" by Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, and Ilya Sutskever. This paper presents an

analysis of DALL-E's performance and visualizations of its internal representations.

[3]. "Generating Images from Text using Invertible Generative Networks" by Swami Sankaranarayanan and Aravind Srinivasan. This paper

explores the use of invertible generative networks for image synthesis, including a comparison to DALL-E.

International Journal of Research Publication and Reviews, Vol 4, no 4, pp 4932-4937 April 2023 4933

[4]. "Learning to Generate Images from Text with StyleGAN" by Vincent Dumoulin and Ethan Perez. This paper describes a modified version

of StyleGAN, a popular image synthesis model, that can be trained on textual descriptions using DALL-E as a benchmark.

[5]. "The Ethics of DALL-E and GPT-3" by Tim Hwang. This article discusses the ethical implications of DALL- E and other language and

image generation models, including issues of bias, ownership, and potential misuse.

[6]. "DALL-E 2 and the Future of AI Art" by Juliet Helmke. This article discusses the impact of DALL-E and its successor, DALL-E 2, on the

field of AI-generated art and the future of artistic expression.

III. ANALYSIS

A. EXISTING SYSTEM

In the present year, We can't generate an imaginary picture into a realistic one. So, A website application was developed to generate pictures. Our

understanding of the world is highly compositional. We can rapidly understand new objects from their components or compose words into complex

sentences to describe the world states we encounter. Existing text-conditioned diffusion models such as DALLE-2 have recently made remarkable

strikes towards compositional generation and are capable of generating photorealistic images given textual descriptions. However, such systems are not

fully compositional in generating correct images.

B. PROPOSED SYSTEM

We suggest factorizing the compositional generation problem in the proposed system and employing several diffusion models to capture various

subsets of a compositional specification. And there will be no lag in this application and we can download it easily. These diffusion models are then

explicitly composed together to generate an image. Here, we are using the Android Application project to generate imaginary pictures into real ones.

Our method will generate high-quality images containing all the concepts and outperforms baselines by a large margin.

IV. ARCHITECTURE

Figure 4.1

Figure 4.1: Architecture and Block Diagram

In Figure 1, the dataflow can be seen and the working process is mentioned.

The process starts with the text encoder, moves to the clip object, finds the image, and then decodes text into the correct image formate

International Journal of Research Publication and Reviews, Vol 4, no 4, pp 4932-4937 April 2023 4934

Figure 4.2

Figure 4.2: Architecture and Block Diagram

Figure 4.2: There will be text. The text encoder converts the image into a variety of formats, including 256x256, 512x512, and 1024x1024, depending

on the format chosen.

V. METHODOLOGY

1. Text Encoding

2. Image Generation

3. Contrastive Learning

4. Training Data

5. Fine Tuning

A. Text Encoding

This application's text encoding function encodes input text using a transformer-based language model, more precisely the GPT-3 model. Using the

GPT-3 model, a state-of-the-art language model that can provide excellent text output and has been pre-trained on a big corpus of text data.

This application tokenizes the input text into a series of sub- words or tokens, which are then sent through the GPT-3 model to encode the text. Each

token in the input sequence is given a dense numerical representation by the GPT-3 model, which captures its semantic value based on its context.

To create an overall embedding for the whole input text sequence, these individual token embeddings are then added together using a weighted sum.

The image-generating portion of this application receives data from this text embedding, which captures the semantic meaning of the input text.

B. Image Generation

The BigGAN generative adversarial network architecture serves as the foundation for this application's image generation. Modern GAN architecture

known as BigGAN is capable of producing high-resolution images up to 1024x1024 pixels.

This application initially runs the text embedding through a fully connected layer to create a noise vector before using it to create an image from the

encoded text representation. Tocreate a high-resolution image, this noise vector is next combined with the text embedding and put through some layers

of convolutional and up-sampling procedures. This application is trained to create images that are both visually and semantically coherent with the

International Journal of Research Publication and Reviews, Vol 4, no 4, pp 4932-4937 April 2023 4935

supplied text during the training process. This is accomplished by a technique known as adversarial training, in which a GAN's generator component is

trained to deceive a discriminator component into believing that the images it generates are real

C. Contrastive Learning

The main idea behind contrastive learning is to learn a feature space where similar examples are brought closer together and dissimilar examples are

pushed further apart. This is typically achieved by learning a representation for each example such that the representations of similar examples are more

similar to each other than to the representations of dissimilar examples. Contrastive learning is employed in this application to guarantee that the output

image is semantically coherent with the input text. The model is trained to distinguish between pairs of examples that are semantically compatible with

each other and those that are not by presenting them with pairs of photos and text descriptions during training.

D. Training Data

A semi-automated workflow was used to gather the training data for this application. It comprised searching the internet for written descriptions,

eliminating any that were unimportant or of low quality, and then producing corresponding images using a combination of hand-built 2D models and

3D models. The final dataset is made up of millions of text-image pairs that depict a variety of ideas and settings.

The text descriptions are first transformed into dense numerical representations of each description using a transformer-based language model, more

precisely the GPT- 3 model, to train this application model. The generative adversarial network (GAN) is then trained using these text embeddings to

produce corresponding images that are both visually and semantically coherent with the input text.

The model is taught to minimize a set of perceptual and adversarial losses during training, which promotes the output images to match the input text

aesthetically and semantically. Additionally, the model is trained using a method known as contrastive learning, which promotes the generated images

to be semantically consistent with the input text even for complicated and abstract notions that might not have been explicitly present in the training

data

E. Fine Tuning

A pre-trained model, in this case, This Application model, is trained on a fresh dataset or task as part of the fine-tuning process. Only the weights of the

additional layers added for the new task are trained during fine-tuning, with the pre- trained model's weights being frozen.

For this Application, fine-tuning might be utilized to modify the model to provide images for particular domains or jobs. For instance, if this

Application model was initiallyfine-tuned on a smaller dataset of medical text descriptions and accompanying photos to generate medical images, it

could have been trained on a general dataset of text-image pairs.

A new dataset of text-image pairs tailored to the task is needed to fine-tune this Application model for a new task. To guarantee that the model can

generalize successfully to new examples, the text descriptions in the new dataset should be similar in style and language to the text descriptions used in

the original Application training data.

VI. RESULT AND ANALYSIS

International Journal of Research Publication and Reviews, Vol 4, no 4, pp 4932-4937 April 2023 4936

B. C.

VII. FUTURE ENHANCEMENT

A higher level of image quality would be ideal, although these application-generated photographs frequently have a striking appearance. New methods

for producing photographs with improved clarity, color accuracy, and overall realism could be the focus of future development.

VIII. CONCLUSION

In conclusion, This Application is a cutting-edge AI program that creates high-quality images from text descriptions by utilizing the newest

developments in neural networks and machine learning. It has the potential to completely transform a variety of industries, from advertising to health,

because of its capacity to produce original and lifelike visuals in a matter of seconds.

International Journal of Research Publication and Reviews, Vol 4, no 4, pp 4932-4937 April 2023 4937

Even more advanced and potent image-generation capabilities will probably start to show up as this program learns and develops. This might

potentially change the way we think about design and visual communication and open up new avenues for creative expression and innovation across a

variety of industries Overall, this application is an outstanding success in the field and a significant advancement toward the creation of intelligent

computers that can accurately comprehend and interpret human language.

IX.REFERENCES

1. Austin, J., Johnson, D.D., Ho, J., Tarlow, D., van den Berg, R.: Structured denois-ing diffusion. models in discrete state-spaces. In:

Advances in Neural Information Processing Systems (2021) .

2. Bau, D., Andonian, A., Cui, A., Park, Y., Jahanian, A., Oliva, A., Torralba, A.:Paint by word. arXiv preprint arXiv:2103.10951 (2021)

3. Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: Wavegrad: Estimating gradients for waveform generation. arXiv preprint

arXiv:2009.00713 (2020)

4. Chomsky, N.: Aspects o f the Theory of Syntax. The MIT Press, Cam-bridge (1965), http://www.amazon.com/Aspects-Theory-Syntax-

Noam-Chomsky/dp/0262530074

5. DCGM: Gender, age, and emotions extracted for flickr-faces-hq dataset (ffhq)..https://github.com/DCGM/ffhq-features-dataset (2020)

6. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34 (2021)

7. Du, Y., Li, S., Mordatch, I.: Compositional visual generation with energy-based models. Advances in Neural Information Processing

Systems 33, 6637–6647 (2020)

8. Du, Y., Li, S., Sharma, Y., Tenenbaum, J., Mordatch, I.: Unsupervised learning of compositional energy concepts. Advances in Neural

Information Processing Systems 34 (2021)

9. Du, Y., Li, S., Tenenbaum, J., Mordatch, I.: Improved contrastive divergence training of energy-based models. arXiv preprint

arXiv:2012.01316 (2020)

10. Du, Y., Mordatch, I.: Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689 (2019).