Skip to content

face:LIFT: Using Text Descriptions to Create Photo-Realistic Faces

Facebook
Twitter
LinkedIn

face:LIFT, a collaborative effort between UM and Threls Ltd., is a project exploring the generation and editing of face images from textual descriptions. THINK speaks with two of its team members, Prof. Adrian Muscat and Dr Marc Tanti, about the process and aims of this project.

face:LIFT is a research project focused on generating photo-realistic face images based on textual descriptions. Subsequent to an EU-COST action, a research group named RiVaL (Research in Vision and Language) was established. The group was initiated by Affiliate Senior Lecturer in Artificial Intelligence Mr Mike Rosner. ‘We started working on generating descriptions of faces, and at one point, some of us thought of considering the reverse direction, meaning generating images of faces from textual descriptions. This is how the face:LIFT project came about,’ specify Prof. Adrian Muscat (Department of Communications and Computer Engineering) and Dr Marc Tanti (Institute of Linguistics and Language Technology).

Behind the Code

face:LIFT uses Generative Adversarial Networks (GANs) to translate text into images. GANs are a type of Artificial Intelligence (AI) that uses two competing neural networks, the generator and the discriminator, to create realistic fake data. The discriminator learns to discriminate between real and fake images created by the generator, while the generator learns to create images that the discriminator fails to discriminate. They work together to produce convincing synthetic images.

The face:LIFT team compiled their own dataset of stock images with custom-written textual descriptions. (Image: An artistic collage of faces)

The initial stages of this project required the collection of a dataset of images and textual descriptions. Since such a dataset did not exist, the team collected their own by combining stock images with custom-written textual descriptions, totalling 13,000 examples. 

From thereon, two avenues were explored. The first approach involves the experimental exploration of factors influencing facial features. This means training a neural network by feeding it a description, generating an image, comparing the resulting image with the image in the dataset, and then updating the neural network to output a more similar image. However, this proved to be notably challenging because of the relatively small size of the team’s dataset. 

Instead, the team opted for an alternative method. An existing neural network that generates realistic images of faces was used, and this was combined with another existing neural network that measures the similarity between images and descriptions. For this combination, the team leveraged NVIDIA’s StyleGAN model and OpenAI’s CLIP. By comparing how well the generated image matched the description, they adjusted the models to improve the similarity. 

StyleGAN works by turning a set of random numbers into a realistic image. As Muscat and Tanti note, ‘by repeatedly changing the random numbers slightly, such that CLIP’s matching score between the resulting image and given description is increased, we can find a set of numbers that give a desirable image. The trick is to find a way to optimise the random numbers so as to influence the direction of generation.’

In essence, the team wrote a program that considered a user’s description and tweaked the random values fed into the AI. By tweaking the values, which represent the different features of the face, the program improved how well the generated face matched the description. The advantage of this method is that CLIP is freely available to everyone, which provides the necessary resources for the project, as opposed to commercial models, like OpenAI’s DALL-e, which immediately generate images from descriptions but are not free.

Training their own GANs proved difficult because it required a lot of data, took a long time, and needed very powerful computers. Nonetheless, figuring out the best settings for training the models was a complex and time-consuming process.

The project was developed by UM in partnership with Threls Ltd., a software development company based in Gozo. Threls Ltd. was the developer of the project’s application FaceComposer, while the AI backend was carried out at UM.

These people don’t exist. They were generated from the FaceComposer app using the description ‘An attractive Mediterranean woman/man looking happy.

Editing with Text

The novelty of this project lies in not only generating images from textual descriptions but also offering the ability to edit these generated images post-generation. ‘In addition to generating a face from a textual description, in face:LIFT, we also opted to go one step further and try to not only generate a face from a textual description but also to edit that initially generated face with another textual description,’ explains Muscat. Until recently (such as with ChatGPT 4), this was unlike other applications that focused on generating images from text.

The team found this particularly challenging since past efforts focused on modifying real images as opposed to fake-generated ones. ‘Since the initial image was already fake, getting an edited image based on a textual description with higher resolution and minimal object artefacts proved to be a big challenge. Here is where there is room for further improvement from the work we did up to now,’ notes Muscat.

In the early stages of this project, research on visual GANs for generating face images was just beginning, with minimal exploration into conditioning GANs on text. Further to this, face:LIFT was introduced as a model for generating and editing face images using free text, which was considered innovative and absent in any commercial product. The application also allows for modifications even after the initial image has been generated.

Face-lifting Aims

Despite making a playful reference to forensic sketch artists, the application was initially developed purely for entertainment purposes and personal use. ‘Currently, typical end users would be professionals working in advertising and entertainment,’ expresses Muscat. However, as Muscat goes on to explain, ‘with improved models, such systems should also be useful in forensics and education.’ Having said that, the application does provide an advantage to forensic image creation. 

The forensic sketch artist’s process typically entails a witness describing the face of a suspect, after which the sketch artist iteratively creates a sketch based on the witness’s description and feedback. In this case, face:LIFT would provide a real-looking photograph as opposed to a sketch with which the witness can associate better, resulting in the potential of more criminals being caught. Some sketch artists already use software to help with the process of creating their sketches. With face:LIFT, they could also include feedback from the witness to make further edits to the application’s generated images. 

Looking ahead, Muscat and Tanti expressed their vision for enhancing the application’s speed and refining its editing capabilities. They also aim to improve automatic evaluation metrics for further accuracy.

Thus, while initially designed for entertainment purposes, face: LIFT’s potential reaches far beyond. The advancements of models could significantly impact fields like forensics and education. As this project continues evolving, its contributions to various sectors promise to be both innovative and impactful. The application FaceComposer is currently available for personal entertainment purposes and can be freely accessed here.

The idea for the face:LIFT project was originally proposed by Associate Professor at the Institute of Linguistics and Language Technology Albert Gatt and Associate Professor in Communications and Computer Engineering Reuben Farrugia, whose roles in the project were succeeded by subsequent members during the project’s initial stages. The research support officers who contributed to this project are Mohammed Abbas, Aaron Abela and Asma Fejjari.

The face:LIFT project is funded by the Malta Council for Science and Technology’s MCST-FUSION R&I Grant.

Author

More to Explore

Budding Biennale

Matthew Attard, chosen to represent Malta in the 2024 Venice Art Biennale, reimagines ship drawing by melding it with emerging eye-tracking technology.

Comments are closed for this article!