Mastering the Art of Deep Fakes: Crafting Realism with Open-Source Tools

9 min readNov 18, 2023

Deep fakes are a form of synthetic media that involves replacing a person’s likeness with someone else’s image. It also encompasses editing a person’s behavior and what they say in a video. With the evolution of digital media, deep fakes are becoming increasingly popular. This Medium article delves into the fascinating world of creating hyper-realistic deep fakes using accessible and open-source tools, all from the comfort of your computer.

The tutorial provides a comprehensive guide on how to select and use the best open-source software for generating deep fakes. It walks you through each step of the process, from finding suitable material to fine-tuning the details that bring these digital creations to life. This guide caters to various levels of technical proficiency, whether you’re a tech enthusiast, a creative professional, or simply curious about the mechanics behind these captivating illusions.

But as we embrace the power of this technology, it’s imperative to understand its implications and the necessity for responsible use. This is where ‘FakeCatcher’ enters the narrative. FakeCatcher is an innovative tool that detects deep fakes by analyzing PPG (photoplethysmography) signals—subtle changes in skin color that correspond to the heartbeat. This article will explore how FakeCatcher works as a critical countermeasure, ensuring authenticity and integrity in a world where seeing might no longer be believed.

As we refine the art of creating deep fakes, the line between reality and digital fabrication becomes increasingly blurred. Thus, the endgame of this exploration is a compelling paradox: improving deep fakes to such an extent that the only reliable way to distinguish them from reality is through sophisticated detection tools like FakeCatcher.

Join us on this fascinating journey as we navigate deep fakes’ technical, ethical, and artistic aspects and explore the frontier where technology meets creativity and authenticity.

Chapter 2: The Two Pillars of Deep Fakes — Voice Cloning and Lip Syncing

The Art of Voice Cloning

Voice cloning is the first critical component in creating a convincing, deep fake. This fascinating process involves using advanced algorithms to analyze and replicate a person’s voice with stunning accuracy. The technology examines various aspects of the voice, such as tone, pitch, cadence, and inflections, creating a digital voice model that can articulate new sentences while retaining the original voice’s unique characteristics.

In this chapter, we delve into the technical nuances of voice cloning. We explore how open-source software has democratized access to this technology, making it possible for hobbyists and professionals to experiment with voice synthesis. The process typically involves training a neural network with a substantial dataset of the target voice. This training enables the software to ‘understand’ the nuances of the voice and generate new speech that sounds remarkably similar to the original.

Synchronizing Lip Movements

The second crucial aspect of creating a deep fake is lip-syncing. This process ensures that the cloned voice is perfectly synchronized with the movements of the person’s lips in the video. Achieving realistic lip-syncing is paramount; even minor discrepancies can break the illusion, making the deep fake easy to spot.

This chapter provides a step-by-step guide on using open-source tools to achieve precise lip synchronization. It talks about methods like connecting the different sound units in speech, or phonemes, to the movements of the mouth and using machine learning models to guess and match these movements with the sound.

The ‘Hello World’ of Deep Fakes: A Case Study

To illustrate these concepts in action, we present a case study often called the ‘Hello World’ of deep fakes. This involves modifying a video of a well-known public figure, such as Barack Obama, to make him say things he never actually said, using his voice. This example is a powerful demonstration of the capabilities of deep fake technology, showing how voice cloning and lip-syncing come together to create a video that can easily be mistaken for genuine.

Through this example, we aim to highlight the potential of deep-fake technology as a tool for creativity and innovation. However, it’s also a stark reminder of the need for ethical considerations and the importance of tools like FakeCatcher in maintaining the integrity of digital media.

Chapter 3: A Technical Deep Dive into Tortoise-TTS for Voice Cloning

Unveiling the World of Tortoise-TTS

Tortoise-TTS is a beacon in the voice cloning domain, offering a powerful yet accessible platform for synthesizing human-like speech. This chapter takes a deep technical dive into Tortoise-TTS, exploring how this open-source tool harnesses the power of advanced machine-learning algorithms to replicate human voices with remarkable precision.

Understanding the Mechanics

The core of Tortoise-TTS is its sophisticated machine-learning models, which are trained on extensive datasets of human speech. These models are designed to capture the nuances of human vocal patterns, including intonation, emotion, and speech rhythm. The process begins with extracting voice features from the training data, which are then used to train the neural network.

We’ll explore the specifics of the model architecture used in Tortoise-TTS, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and the newer Transformer models. Each of these architectures plays a vital role in modeling different aspects of human speech, contributing to generating natural-sounding voice clones.

Training and Fine-Tuning

An essential aspect of working with Tortoise-TTS is the training process. This involves feeding the model with a large dataset of a specific voice to be cloned. The more varied and comprehensive the dataset, the more accurate the voice clone will be. We’ll delve into the best practices for collecting and preparing voice data, discussing balancing quantity and quality.

Fine-tuning the model for specific voices or accents is another critical step. This section will guide readers through the fine-tuning process, demonstrating how to adjust parameters to capture the unique characteristics of the target voice.

Practical Implementation

To put theory into practice, this chapter will follow a step-by-step tutorial on setting up and using Tortoise-TTS for voice cloning. We’ll cover the installation process, setting up the environment, loading the pre-trained models, and running the voice cloning pipeline. This hands-on guide will empower readers to experiment with their voice cloning projects using Tortoise-TTS.

Ethical Considerations and Use Cases

While delving into the technicalities, we cannot overlook the ethical implications of voice-cloning technology. This section will discuss the responsible use of Tortoise-TTS, highlighting both the creative and potentially harmful applications of voice cloning.

We’ll also explore legitimate use cases for Tortoise-TTS, such as creating synthetic voices for people who have lost their ability to speak, generating voiceovers for content creation, and more, emphasizing the positive impact this technology can have when used ethically.

Chapter 4: Installing Tortoise-TTS: A Step-by-Step Guide

Getting Started with Installation

To begin using Tortoise-TTS, an open-source tool for voice cloning, you can install it directly using Python’s package manager pip. This is the simplest method:

pip install tortoise-tts

Alternatively, for the latest development version, install directly from the GitHub repository:

pip install git+https://github.com/neonbjb/tortoise-tts

Setting Up on a Local Machine

For a local setup, especially on Windows, it’s recommended to use the Conda environment due to its management of dependency issues. The process involves:

Installing Miniconda.
Creating a Conda environment with specified dependencies
Activating the environment.
Installing PyTorch.
Cloning the Tortoise-TTS repository
Running the setup installation script

The required commands are:

conda create --name tortoise python=3.9 numba inflect
conda activate tortoise
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install transformers=4.29.2
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
python setup.py install

Docker Installation

For those preferring Docker, Tortoise-TTS provides a Dockerfile. This method offers an environment ready for text-to-speech tasks. The installation steps are:

Cloning the repository
Building the Docker image.
Running the Docker container with the necessary volume mappings.

Here are the commands:

git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
docker build . -t tts
docker run --gpus all -e TORTOISE_MODELS_DIR=/models -v /mnt/user/data/tortoise_tts/models:/models -v /mnt/user/data/tortoise_tts/results:/results -v /mnt/user/data/.cache/huggingface:/root/.cache/huggingface -v /root:/work -it tts

Installation on Apple Silicon

For macOS users with M1/M2 chips, the installation differs slightly due to the need for the nightly version of PyTorch. The steps are:

Installing the nightly version of PyTorch
Setting up a Python environment
Installing the necessary Python packages.
Cloning and installing Tortoise-TTS

Commands for Apple Silicon:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
python3.10 -m venv .venv
source .venv/bin/activate
pip install numba inflect psutil
pip install transformers
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
pip install .

Chapter 5: Utilizing Tortoise-TTS: From Voice Selection to Cloning

Selecting a Voice for Cloning

The first step in using Tortoise-TTS is to select the voice you wish to clone. This involves finding a suitable audio sample of the desired voice. Several online repositories and databases offer voice samples across various languages and accents. It’s essential to choose a clear, high-quality audio sample for the best results.

Preparing the Audio File

Once you have your voice sample, it’s often necessary to convert it to a format compatible with Tortoise-TTS. This is where ffmpeg, a powerful multimedia framework, comes into play. You can use ffmpeg to convert your audio file into the required format. The conversion command typically looks like this:

ffmpeg -i input.mp3 -ar 22050 -ac 1 output.wav

This command converts an MP3 file to a WAV file, ensuring the correct sample rate and mono audio channel, which are often prerequisites for voice cloning software.

Generating the Voice Clone

With the audio file ready, you can now use Tortoise-TTS to generate the voice clone. The command line interface of Tortoise-TTS offers various options and parameters to customize the voice cloning process. The basic command structure for generating a voice clone is:

python tortoise/do_tts.py --voice <voice_name> --text "Your text here"

Replace <voice_name> with the identifier of the voice model you are using, and "Your text here" with the text you want to be spoken in the cloned voice.

Using the Pre-made App on HuggingFace

For those who prefer a more user-friendly interface, there is a pre-made application available on HuggingFace. You can access it at HuggingFace Tortoise-TTS. This web-based application provides a convenient way to use Tortoise-TTS without delving into command-line operations. It’s an excellent option for beginners or for quick demonstrations.

Chapter 6: Syncing Voice with Video Using Wav2Lip

Installation and Setup

To use Wav2Lip for lip-syncing your cloned voice to a video, start by setting up the environment:

Python Version: Ensure you have Python 3.6 installed.

FFmpeg Installation: Wav2Lip requires FFmpeg, which can be installed on Unix systems with:

sudo apt-get install ffmpeg

Download https://github.com/Rudrabha/Wav2Lip

Dependency Installation: Install the required Python packages:

pip install -r requirements.txt

Face Detection Model: Download the pre-trained face detection model and place it in the specified directory (face_detection/detection/sfd/s3fd.pth).

Using Wav2Lip for Lip-Syncing

With Wav2Lip installed, you can synchronize the cloned voice with a video:

Prepare the Audio: Use the cloned voice audio file generated by Tortoise-TTS. Ensure it’s in a format supported by FFmpeg (e.g., .wav, .mp3).

Run the Lip-Syncing Command: Use Wav2Lip’s pre-trained models to sync the audio with the video. The basic command structure is:

python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <audio-file>

Replace <ckpt> with the path to the Wav2Lip checkpoint, <video.mp4> with the path to the video file, and <audio-file> with the path to the cloned voice audio file.

Output: The result is saved in results/result_voice.mp4 by default, though this can be changed using additional command line arguments.

Conclusion: The Ease of Creating Deepfakes and the Crucial Role of Detection Technologies

The chapters we’ve explored underscore how technological advancements have made it remarkably easy to create almost perfect deepfakes. With tools like Tortoise-TTS for voice cloning and Wav2Lip for lip synchronization, creating convincing deepfakes of anyone is becoming increasingly accessible. This ease of creation poses a significant challenge, highlighting the urgent need for effective deepfake detection mechanisms.

Deepfakes pose a growing threat in today’s world, and technologies like Intel’s FakeCatcher are essential to combat the potential harm and erosion of trust caused by them. FakeCatcher is a prime example of responsible AI work, boasting a 96% accuracy rate in real-time deepfake detection. Unlike other detectors that focus on inauthenticity signs, FakeCatcher seeks out authentic human elements, precisely the subtle “blood flow” signals in the face. By utilizing Intel hardware and optimized software tools, this innovative approach allows for instant differentiation between real and fake videos.

FakeCatcher not only helps to maintain the integrity of media content, but it also offers various applications. For instance, it can be used in social media moderation to ensure the authenticity of user-generated content. News organizations and non-profit organizations can also benefit from FakeCatcher’s capabilities by using it to verify the authenticity of media content.

Thus, as we navigate the complexities of digital authenticity, the balance between creation and detection becomes paramount, ensuring a safer and more trustworthy digital environment.