Bridging the Gap in Generative A.I. for Audio Generation

Blog, Journal for High Schoolers, Journal for High Schoolers 2023, Uncategorized


Bridging the Gap in Generative A.I. for Audio Generation

By: Pranav Battini, Kaley Chung, Navaneeth Dontuboyina, Sude Ozkaya, Kedaar Rentachintala

Mentors: Mert Pilanci, Rajarshi Saha, Zachary Shah, Indu Subramanian, Fangzhao Zhang

Abstract

Generative A.I. has come a long way in recent years, popularized by OpenA.I.’s ChatGPT and DALL-E, with advancements such as diffusion models paving the way to the future. However, while there has been much progress made in the realm of natural language processing and image generation, there is still much work to be done in the fields of A.I.-generated audio and video. The advancement of generative A.I. in audio generation in particular would lead to a plethora of innovation and development, from music generation to improving training data for autonomous cars.

Our project seeks to fill in the gap in A.I.’s ability to generate audio by refining an easily modifiable model to produce unique media that can ultimately be used for additional model training and possible commercial use through the implementation of Stable Diffusion and Riffusion libraries. We experimented and built our solution using Jupyter Notebooks through either Google Colab or locally with Anaconda, installing dependencies as needed. Our first milestone in our project was running inferences with Zachary Shah’s riff-pix2pix model to gain an understanding of the current progress in the field by generating audio samples. Via riff-pix2pix, we ran inferences to generate short audio clips based on our prompts. Our second milestone was to utilize Jupyter Notebooks such as generate_splices.ipynb and splice_together.ipynb to generate longer audio than what was previously possible in riff-pix2pix, with the goal being to generate a 1-minute audio sample by seamlessly splicing together up to 5 seconds samples that Shah’s riff-cnet model could generate. Our third and final milestone was to curate an expanded dataset using technologies such as Spleeter and Librosa, to retrain our model with, hypothetically resulting in a vast improvement in its audio generation capability.

Background

To introduce the concept of audio generation using A.I., first text-to-image generation must be discussed, as it is the basis upon which our solution is built, as the most up-to-date technologies of audio generation, such as Riffusion, have reframed the task of audio generation from a text prompt as an image generation task using spectrograms utilizing methodology found in Stable Diffusion, and then afterwards converting those spectrograms to audio.

The idea of providing a text input to produce an image output is not novel. However, the processes through which to execute such a pipeline are fascinating, with new developments from Generative Adversarial Models to Denoising Diffusion Model Pipelines leading to novel methods of text-to-image generation as research in generative A.I. matures as a whole.

Imagic and Dreambooth are both modern text-to- image models that utilize the Diffusion pipeline, demonstrating their effectiveness in adjusting and altering the appearance of media to satisfy a user’s need. However, these models are fraught with limitations, as for optimal text-to-image generation they require extensive fine-tuning for every input image and detailed descriptions of desired outputs, hindering the creativity and flexibility a text-to-image model should possess.

However, three researchers at UC Berkeley (Brooks et al.) addressed these limitations by designing a model derived from Stable Diffusion, called InstructPix2Pix, that can receive a quick one-sentence description of an image and generate a high-definition product from that description. Where before a user had to input a detailed description to get their desired output, now a user could enter a potentially more vague description and still receive the desired image output. Brook et al.’s approaches eliminated the need for extensive input to enhance the user experience while expanding the creativity and uniqueness of the generated images.

On the topic of audio generation, KOCREE, a company based in the University of Illinois and Urbana-Champaign has made great progress in addressing the limitations of ai-generated music through the development of a platform known as Muosaic. Muoasic is a social media platform that “facilitates musical co-creation,” allowing users to cut arrangements, merge samples with different music snippets, and form innovative compositions. Through its efforts in perfecting humans and A.I. interaction in creating original music, KOCREE’s Muosaic has the potential to revolutionize the music creation domain as a whole.

To build our solution, we will be utilizing portions of Brook et. al’s InstructPix2Pix tool in our experimentation for the purpose of generating audio in a musical context through the merging of text-to-image and text-to-audio generation by storing audio data in spectrograms (more on that later). However, our model shall be further engineered to support the generation of media in all genres.

Methods

The data for the audio portion of this project is housed in spectrograms, or two dimensional graphs that represent spectrum frequencies in a signal over time (British Library 2018). By analyzing frequencies rather than purely a score or sound, researchers can identify microscopically small time and frequency ranges and modify them. The incredible preciseness of spectrograms makes them optimal for our purposes, as we aim to add appropriate vocals to small snippets and splice them together to form coherent, longer audio segments. Below (Figure 1) are two spectrograms, with the first housing only the background audio and the second adding unique vocals to that audio. The difference is easily identifiable.

Figure 1: Side-by-Side Depiction of Audio With and Without Vocals – Sude Ozkaya

The riff-pix2pix model by Shah et al. uses ControlNet to isolate the background audio and prepare it for the addition of unique vocals. ControlNet, coined by Stanford’s Zhang and Agrawala, tunes the model’s diffusion process and makes sure the output audio has semblance to the input audio. Riff-pix2pix needs a detailed spectrogram to guide text-conditioned diffusion, otherwise the input image cannot be preserved in output audio thanks to events during the forward pass of denoising during its denoising diffusion model pipeline.

Milestone 1: Shah’s riff-pix2pix Model

Riff-pix2pix is Shah et al. ‘s implementation of Brook et al.’s Pix2Pix tool in the context of generating audio. In the riff-pix2pix Github repository, Shah et al. places the purpose of this model to “suggest the feasibility of training an InstructPix2Pix model for the task of audio inpainting”. InstructPix2Pix is a model constructed by three researchers, Tim Brooks and colleagues, from UC Berkeley. They have constructed this model based on the knowledge from Standard Diffusion and GPT-3 models on language processing and prompt-to-picture generation. Fine-tuned from Stable Diffusion, InstructPix2Pix serves as a tool to easily revise images via prompts. It was from this InstructPix2Pix tool that Shah et al. built their riff-pix2pix model for audio generation. Riff-pix2pix is based on the InstructPix2Pix but integrates the riffusion library to edit (or audio inpaint) spectrograms. Using MUSDB18, a dataset with a variety of royalty-free pre-stemmed and stemmed music, a generated edit prompt and the riffusion library to convert the audio from MUSDB18 to get the pair of spectrograms, sets of this data can be fed to train Intructpix2pix to inpaint any given audio file of approximately 5 seconds of length.

The training process for riff-pix2pix utilized GPU accelerated computing and the MLOps platform, Spell. Spell, founded by Serkan Piantino, is a platform designed to enable individuals to conduct machine learning experiments without the necessity for advanced hardware components, such as high-performance GPUs.

In brief, the training process for riff-pix2pix was to generate multi-modal training data using the SD and GPT-3 models. At the end of setting up the training data, we had a set of (1) paired input captions and generated instructions via GPT-3, (2) generated pairs of images based on the captions using SD and prompt2prompt image editing. The model then took those generated example data pairs and performed edits in a forward pass – giving a new image in a matter of seconds based on vague general text instructions “as opposed to text labels, captions or descriptions of input/output images”. Essentially, riff-pix2pix is a denoising diffusion model pipeline, and as such, in the most basic sense, it was trained through the noising and denoising of images, in this case spectrograms.

Shah’s riff-pix2pix model gave us fast, reliable audio files from vague text instructions which is a great base for our project. However, there are no true real world applications for a model that can only generate up to 5-second snippets of audio; as such, we began devising a method for generating longer audio samples from this model, with the goal to be able to produce at least a minute of generated audio. For this next milestone of the project, we utilized Jupyter Notebooks, either through Google Colaboratory or Anaconda, and set up the environment by installing and importing packages and libraries as needed. Relevant libraries that we used throughout this project’s milestones that should not go unnoticed are the Pytorch, Matplotlib, Image and Cv2 libraries for the models image processing/analyzation, Numpy for basic array manipulation and reading, and Pydub to split and merge the audio files in .wav format.

Figure 2: Visualization merging (Shah et al.)

Milestone 2: Generating Longer Audio

Figure 3: Visualization of splicing and merging (Shah et al.)

During our experimentation with generating snippets of audio longer than 5 seconds, we utilized Shah et al.’s riff-cnet model, a prototype of the riff-pix2pix model that we ran inference on in milestone 1 due to the fact that the riff-cnet model had been trained with an expansive dataset, making its generated audio samples have a higher quality than the riff-pix2pix model’s as that was more of a proof-of concept trained on a more limited dataset, making it more practical for us to utilize the snippets generated from riff-cnet for milestone 2 to yield the most sonically pleasing results. In order to collect splices from riff-cnet, we utilized the Jupyter Notebook “generate_splices.ipynb”. To implement “generate_splices”, we imported all the necessary libraries and set up the local environment. This set up the helper functions for image processing, image resizing, and generating sample(s) from the riff-cnet model. After, a pair of example samples are created from the dataset and a generated sample from riff-cnet, which are compared to one another to tune the model and produce a better result or a clearer target image spectrogram. The last couple of frames of the audio sample are used as the first frames of a newly generated spectrogram, etc. Even though these spectrograms came from one another, they still need to be stitched back together seamlessly with minimal choppiness. These spectrograms generated from “generate_splices” would be spliced together to create audio snippets longer than the 5-second duration-cap of riff-cnet and riff-pix2pix. In order to undertake this task, we implemented another Jupyter Notebook, “splice_together.ipynb”, to experiment with methods on stitching riff-cnet’s generated audio samples together. This was done by iterating through the spectrograms, chopping off the last half of one spectrogram, then merging it back to its predecessor spectrogram. We would then proceed to utilize a variety of overlaying and fading methods to transition from one cut of a splice to the next. In the end, we found that adaptive fading worked the best due to its consideration of the spectral centroid difference (a measure of how “high” or “low” the song is as well as how loud each note is; the adaptive fading takes this into account so it can make a smooth transition from centroid difference of one spectrogram to the next). We were also able to control the length of the audio sample produced by “splice_together” by adjusting the range for which the function that handles adaptive fading computed the amount of blocks to stitch together. In order to get a desired duration in seconds, the upper limit of the python range function with which the number of blocks was computed and the frames were therefore prepared with had to first be divided by 5, and then doubled, with a one minute sample having to have the upper limit of the range function as 24 as (60 / 5) * 2 yielded 24 as the number of blocks required. With this information, we were now able to generate audio samples from the riff-cnet model to any duration of our choosing.

Figure 4: Code used to generate a one-minute sample of audio (Shah et al.)

Milestone 3: Curating a Dataset

While our audio inpainting model’s limitation of duration of audio produced was now effectively negated through our efforts in milestone 2, the actual quality of the audio produced by the model left much to be desired. Since our audio-inpainting model was essentially an expansive neural network utilizing the latest denoising diffusion model pipeline technologies, there was but one surefire way that we hypothesized could drastically improve the quality of audio our artificial intelligence could generate: feeding it a diverse and much more expansive dataset to learn from. By retraining our model with a much larger dataset, the model would in theory be able to better understand how to create pleasing audio through its increased frame of reference. In addition, it would also be able to better understand the similarities and differences between musical instruments, tempo, and even musical genres, allowing the retrained model to be truly capable of producing art as opposed to the mediocre audio snippets it can currently produce.

In order to curate a novel and more expansive dataset, we first had to procure royalty-free audio files that we could safely use for the purposes of research without fear of copyright. One such source we were able to get was Incompetech, a royalty-free music site that had a plethora of genres from Latin to Reggae to even more obscure genres like Afrostyle. In addition, we also utilized MUSDB18, a dataset of 150 royalty-free songs from a variety of genres. From the MUSDB18 dataset, we utilized portions of the MUSDB18 train subset, a set of 100 royalty files intended for model training, as the rest 50 files of the test subset were intended for comparison purposes of the outputs of generative audio models. We then proceeded to download the miscellaneous audio files we procured to our local drives, and eventually moved all the files to one big drive, in order to store them for model training. However, it was not enough to simply document each song and feed it into our dataset, as while our model would be able to generate a wider variety of music, it would not be able to truly understand the individual components of what actually went into making a song, and as such, in order to truly augment our dataset, we became invested in the implementation of two technologies: Spleeter, and Librosa. After we augmented our dataset, we then had to convert our audio files into spectrograms to make them compatible for model training.

Spleeter: Spleeter is an open-source project created by Deezer Research with the objective of creating a viable artificial intelligence that can parse audio files and separate them into their individual components, or stems. Written in python and utilizing Google’s Tensorflow libraries, Spleeter has three main settings with which it can split audio files: 2stems, 4stems, and 5stems. When using the 2stems command, Spleeter separates an audio file into its “vocals” and “accompaniment”. When using 4stems, Spleeter increases the amount of stems it generates to “vocals”, “drums”, “bass”, and “other”, which is the rest of the accompaniment excluding the drums and bass. Spleeter’s 5stems command further increases the amount of stems generated, with it now creating a “piano” stem along with the stems the 4stems command creates, allowing for greater manipulation of the individual parts of an audio file for augmenting our dataset.

Since one of our objectives was to train our model to learn the differences between as many instruments as possible and for it to truly understand how the arrangement of different instruments created differing final products, we utilized the 5stems command for the audio files we procured from Incompetech. However, since MUSDB18 was curated to be easily separable into 4 stems, with all of its files being stored in the native stems (.mp4) format, we utilized 4stems when separating MUSDB18 files. In fact, Spleeter has direct support for MUSDB18 when using 4stems, with the official documentation stating that Spleeter’s performance is greatly increased when running 4stems on MUSDB18, making the usage of 4stems over 5stems in MUSDB18’s case not a drawback due to a loss of potential files but an advantage due to its increased efficiency.

Spleeter is run on an audio file as such:

!Spleeter separate -o output_5stem -p Spleeter:5stems audio.mp3

It would be a very grueling and tedious process to run this line for each and every file in the hundreds of audio files collected for our dataset, and as such, we had to devise a method to automate the process through the usage of a shell script. Below is an example of a Shell script that we used to run Spleeter commands on the entirety of Incompetech’s “Latin” genre library:

Figure 5: An example of Shell script to run Spleeter – Pranav Battini

This Shell script was a loop that allowed the audio files that we intended to use Spleeter on to be passed through as essentially variables, allowing us to utilize this single piece of code for a plethora of audio files. In order for this code to work, however, we first had to specify the directories of the files to be Spleetered, SRC_DIR, and where to deposit the Spleetered files, OUT_DIR. Afterwards, the code looped through all of the necessary files in SRC_DIR and used Spleeter on them, generating the outputs in the output directory OUT_DIR. To run this code, all one had to do was to use the command “!bash musdb_latin_pb.sh” and the entire Spleeter pipeline of our project was fully automated.

Another way to automate this process can be done purely in Google Colaboratory without the use of shell scripts. As can be seen in Figure 6, this technique decomposes mixed audio into its distinct components as piano, drums, bass, vocals, and other. The code operates on a collection of audio files within a specified input folder; initializes the Spleeter separator for a five component separation and processes each audio file. The filenames are transformed to lowercase and spaces are replaced with underscores for uniformity. Output folders are named after the modified filenames. The Spleeter tool is then employed to separate the audio components, saving them in their corresponding output folders. The code serves as a practical application for audio separation.

Figure 6: An example of running Spleeter in Google Colab – Sude Ozkaya

Librosa: Librosa is a Python library that can be used to mix different audio files to create different mixed auditory files. In this project, we used this tool to mix and match different output stems produced by Spleeter within the same original audio file. As shown in the figure below, the code steps in order to mix and match Spleetered components of an audio file in order to create new files to feed into the dataset began with loading the file(s) in librosa. Afterwards, we took the sampling rates of the audio files we wanted to combine, and compared the sampling rates via the s variable. We then mixed the two different signals by adding them together and normalized the combined sampling rate by setting the max amplitude to 1.0. Finally, we exported the mixed audio using a write function from Soundfile, an audio python library, adding it to the dataset.

Figure 7: Librosa implementation to create new files from Spleetered components (top panel), A portion of the dataset augmented with Librosa (bottom panel) – Navaneeth Dontuboyina

After utilizing Librosa to mix and match the Spleetered segments of our initial dataset, we now possess a vast array of isolated combinations that a model can train on to learn and emulate what different musical components sound like at different genres.

Cleaning the Data and Documentation: However, we must then clean this dataset because not all audio files have all 5 output stems – some may not have drums or a piano component etc. Thus Librosa will hallucinate results that are not actually present in the audio translating to faulty results with our model prediction down the line when training. For this scenario, we used a dictionary containing the music descriptions for each audio file from Incompetech which include what instruments were used to iterate through the Librosa generated files and remove and rename them correctly. In addition, since we needed to have all of our data in the format of .wav in order to convert the data into spectrograms for model training, we had to convert the files we procured from Incompetech and MUSDB18 (stored as .mp3 and .mp4 respectively) into the correct .wav encoding. Now with the dataset cleaned up, we had to document our data with edit prompts so that our model can parse our dataset and train itself accordingly. To accomplish this task, we made a csv file sheet for each prompt and on each sheet contained three columns – the original file name, the Librosa edit file names, and an example edit prompt – as can be referred to in Figure 8. As can be seen in Figure 9, we used a script to automate this process by going to the dataset and adding to the documentation these details and the edit prompt via generation through a Chat-GPT TSLM.

Figure 8: Example of cleaned and documented data in a csv file – Pranav Battini

Figure 9: An example approach for paraphrasing edit prompts – Navaneeth Dontuboyina

Converting the Data into Spectrograms: The final step of our dataset curation was to convert the audio files generated for the dataset into a visual form, with spectrograms being decided as the mode of choice. The reason for the step is because our ai model, at its core, is an image generation model adapted for audio generation that needs to be training using a denoising diffusion model pipeline; it slowly noises input images for training and regenerates new creations from the noised images to learn how to generate the audio. As such, it is not audio files that we need to train the model with, but image files that contain the sonic information of the audio files, being spectrograms.

The process of converting audio (wav) files into spectrograms began with the cloning of Zachary Shah’s riff-pix2pix github repository in order to gain access to the necessary libraries for conversion. From Shah’s riff-pix2pix repository, we implemented python code that would take a .wav file, convert it to a Numpy array, and from that array create a spectrogram containing the original file’s sonic information. Since our dataset consisted of original files we curated from sources such as Incompetech and edited files which were the result of utilizing Spleeter and Librosa on these original files, we separated our spectrograms into different folders in our dataset directory based on whether the spectrogram was of an original file or edited file. In addition, we generated spectrograms of varying sizes for each audio file, with spectrograms containing 5, 10, 30, 60, 90, and so on seconds of audio up until the duration of the audio file being converted to both further augment our dataset by now having more spectrograms to feed into the model and in order to allow the spectrogram to more readily recognize differences in audio duration when it is generating audio from prompts post-training.

Figure 10: Code to automate audio file to spectrogram conversion given a directory of wav files – Pranav Battini

Now that we had curated, augmented, and prepared our dataset, we were now ready to begin model training.

Results

In our first milestone, we were able to produce 5 second audio files from Shah et al.’s riff-pix2pix model. Given an audio file serving as a background, a vocal melody could be added into this background audio, as prompted by the user. These audio inputs were first converted to spectrograms as can be seen in Figure 1, which represent the original spectrogram and the edited one, respectively. In the original audio image, there is the background audio, whereas in the edited image, the model adds the prompted melody as it preserves the original background audio. For example, Figure 1 “background+vocals” represents the audio image of the prompt “add a heavy metal scream vocal part, borderline satanic vocals”. Comparing the original image as can be seen in Figure 11, the given original background audio, with the edited image spectrogram, the difference in these image pairs can be seen clearly: the model has added the prompted melody as well as preserved the given background audio structure. This approach worked 80% of the time “given the new input” as well as the edit instructions.

Figure 11: Original Image (left panel), Edited Image (right panel) – Sude Ozkaya

In our second milestone, the code we implemented in the “generate_splices” and “splice_together” notebooks demonstrated promising results, as we were able to successfully get rid of the 5 second duration cap of the model’s generative ability by successfully splicing together clearer edited audio pieces, up to a duration of our own choice, based on text prompts as in our first milestone.

In the notebook “generate_splices.ipynb”, we utilized the final frames from the audio sample to use as the first frames of a newly generated spectrogram, looping this process over and over until we had approximately 46 different splices of audio, converted into spectrograms, generated from the original prompt. Afterwards, these spectrograms were stitched back together in the notebook “splice_together.ipynb” in order to create snippets longer than 5 seconds. In order to minimize the dissonance of the transitions between different splices, various techniques for smoothly stitching the spectrograms together were experimented with, including overlaying and fading methods. Adaptive fading was found to be working the best as it takes the spectral centroid difference into account – which measures the loudness and pitch of each note. As a result, smoother transitions were achieved with adaptive fading. We were also able to control the duration of the audio produced by preparing the the upper limit of the python range function with which the number of blocks was computed and the frames were therefore prepared with as such: divide the number of intended seconds of duration by 5, and then double the resulting value, with the upper limit of the python range function having to be 24 based on this computation to generate a one-minute audio sample.

The results demonstrated progress in audio generation and editing based on textual prompts. We have incorporated prompted vocal melodies into given background audio files while preserving this audio structure. These results show the potential of our approach for creating high-quality audio musical compositions based on text prompts.

After we commence training our A.I. model with the newly augmented dataset created in milestone 3, the model’s performance in generating prompt-based audio pieces should in theory be greatly improved due to its newly expanded understanding of the components of music and how these components stack upon each other to create a cohesive audio piece along with a greatly increased reference point to draw from due to our efforts with Spleeter and Librosa. Instead of producing lower-quality snippets built from a very limited dataset, the new model should be able to truly produce art, showcasing its generative audio capabilities and opening the door to a plethora of future implementations.

Conclusions

Our audio inpainting system’s potential is limitless, and while we are still in early development, its propensity to impact consumers is already apparent. By achieving our first two milestones, we have been able to generate small 5 second audio samples and splice them together into a cohesive one minute audio sample. In and of itself, that technology is groundbreaking, as it allows anyone to generate an assortment of music snippets from any genre. In our third milestone, we curated a diverse and expanded dataset in order to retrain our model, in theory leading to a model that is able to produce higher fidelity audio outputs. However, our model does not have to be limited to audio generation. With more varied data types, our technology would provide an even larger tangible impact. If, for example, an autonomous vehicles company wanted unique climate information about a certain terrain (such as how it would look when there is snow), our model, if trained with examples of areas before and after snow, could produce a detailed habitat from which researchers and developers can enhance the performance of their product. Furthermore, those developers could receive that information through a straightforward text input, eliminating the hassle and barrier to entry that many people feel exists with A.I. technology.

We aim to collaborate with the KOCREE lab and delve deeper into the field of music audio splicing as we attempt to commercialize our tool and expand the realms in which we can make an impact. Through our second milestone, we have begun merging segments together to produce larger samples, but further down the line, we aim to train the model from the top down, allowing for a wide range of output data that expands our use cases to the realm outside just music production. Therefore, our work is crucial to the advancement of diffusion models and projects related to pix2pix. By allowing for easy training and development of generative audio and video, we aim to make generation efficient and accessible for all, improving the quality of existing A.I. and expanding the possibilities for more models down the line.

Acknowledgments

The authors would like to thank and express their gratitude to Dr. Mert Pilanci, Rajarshi Saha, Zachary Shah, Indu Subramanian, and Fangzhao Zhang for their invaluable guidance and mentorship throughout the project. Additionally, the authors would like to thank Mason Wang for his assistance in implementing milestone 2 of this project.

References

Brooks, Tim, et al. “InstructPix2Pix: Learning to Follow Image Editing Instructions.” University of California, Berkeley, 18 Jan. 2023. Accessed 14 July 2023.

Creswell, Antonia et al. “Generative Adversarial Networks: An Overview.” IEEE Signal Processing Magazine 35 (2017): 53-65.

DreamBooth, https://dreambooth.github.io/. Accessed 14 July 2023.

Hatmaker, Taylor. “Reddit Is Buying Machine Learning Platform Spell.” TechCrunch, 17 June 2022, techcrunch.com/2022/06/16/reddit-spell-machine-learning/.

Hérault, Aurélien. “Deezer/Spleeter: Deezer Source Separation Library Including Pretrained Models.” GitHub, 3 Sept. 2021, github.com/deezer/Spleeter. Accessed 15 July 2023.

“InstructPix2Pix.” Tim Brooks, https://www.timothybrooks.com/instruct-pix2pix/. Accessed 14 July 2023.

Kevin MacLeod (incompetech.com)

Licensed under Creative Commons: By Attribution 3.0

http://creativecommons.org/licenses/by/3.0/

“Kocree.” Research Park, https://researchpark.illinois.edu/tenant_directory/kocree/. Accessed 14 July 2023.

Liutkus, Antoine, and Fabian Robert Stöter. “MUSDB18.” SigSep, Creative Commons, 2019, sigsep.github.io/datasets/musdb.html. Accessed 15 July 2023.

Rafii, Zafar, et al. “MUSDB18 – A Corpus for Music Separation.” MUSDB18, 17 Dec. 2017, zenodo.org/record/1117372.

Shah, Zachary. “Zachary-Shah/Riff-Pix2pix: Train Instructpix2pix Audio Editing Model on Musdb18 Datset, Fine-Tuned from Riffusion v1.1 SD Model.” GitHub, github.com/zachary-shah/riff-pix2pix. Accessed 14 July 2023.

Shah, Z., Ramachandran, N., & Wang, M. L. (2023). Audio Inpainting by Generative Decomposition. California. Accessed 14 July 2023.

“Text-to-Image Editing Evolves.” DeepLearning.AI, 17 June 2023, http://www.deeplearning.ai/the-batch/instructpix2pix-for-text-to-image-editing-explained/#:~:text=What%27s%20new%3A%20Tim%20Brooks%20and,the%20area%20that%20contained%20oranges.

Tipp, Cheryl. “Seeing sound: What is a spectrogram? – Sound and vision blog.” Blogs, 19 September 2018, https://blogs.bl.uk/sound-and-vision/2018/09/seeing-sound-what-is-a-spectrogram.html. Accessed 14 July 2023.

Yoo, Dongphil. “How to Train Pix2pix Model and Generating on the Web with Ml5.Js.” Medium, 20 Dec. 2018, medium.com/@dongphilyoo/how-to-train-pix2pix-model-and-generating-on-the-web-with-ml5-js-87d879fb4224.

Zhang, Lvmin, and Maneesh Agrawala. “[2302.05543] Adding Conditional Control to Text-to-Image Diffusion Models.” arXiv, 10 February 2023, https://arxiv.org/abs/2302.05543. Accessed 14 July 2023.

“[2210.09276] Imagic: Text-Based Real Image Editing with Diffusion Models.” arXiv, 17 October 2022, https://arxiv.org/abs/2210.09276. Accessed 14 July 2023.

 

Leave a Reply