lunes, 3 de octubre de 2016

Implementación en TensorFlow de la red neuronal generativa (WaveNet) enfocada en la generación de texto

Enfocado en el reciente trabajo teórico de Google DeepMind (WaveNet), he desarrollado e implementado una versión en TensorFlow de dicha arquitectura, pero realizando modificaciones para enfocar el diseño en la generación automática de textos (el paper original es: https://arxiv.org/pdf/1609.03499.pdf).

El modelo ideado originalmente por DeepMind fue divulgado con detalle en este artículo: https://deepmind.com/blog/wavenet-generative-model-raw-audio/, y mi implementación del mismo (con las modificaciones necesarias para enfocarlo todo en la generación de texto), se encuentra en el siguiente repositorio de mi cuenta personal en GitHub: https://github.com/Zeta36/tensorflow-tex-wavenet. Se trata como podéis ver, de un desarrollo Python usando el framework de Google, TensorFlow.

Os dejo a continuación toda la información técnica sobre mi trabajo, aunque siento no tener tiempo para traducirlo y simplemente copiaré a continuación la introducción que hice del mismo en inglés (prometo más adelante escribir una entrada dedicada a divulgar de manera más sencilla el potencial de esta red neuronal, y el indicio que puede contener sobre el modo en que nuestro propio cerebro biológico realiza ciertas tareas relacionadas con la memorización):

A TensorFlow implementation of DeepMind's WaveNet paper for text generation.


This is a TensorFlow implementation of the WaveNet generative neural network architecture for text generation.

Previous Work

Originally, the WaveNet neural network architecture directly generates a raw audio waveform, showing excellent results in text-to-speech and general audio generation (see the DeepMind blog post and paper for details).
This network models the conditional probability to generate the next sample in the audio waveform, given all previous samples and possibly additional parameters.
After an audio preprocessing step, the input waveform is quantized to a fixed integer range. The integer amplitudes are then one-hot encoded to produce a tensor of shape (num_samples, num_channels).
A convolutional layer that only accesses the current and previous inputs then reduces the channel dimension.
The core of the network is constructed as a stack of causal dilated layers, each of which is a dilated convolution (convolution with holes), which only accesses the current and past audio samples.
The outputs of all layers are combined and extended back to the original number of channels by a series of dense postprocessing layers, followed by a softmax function to transform the outputs into a categorical distribution.
The loss function is the cross-entropy between the output for each timestep and the input at the next timestep.
In this repository, the network implementation can be found in wavenet.py.

New approach


This work is based in one implementation of the original WaveNet model (Wavenet), but applying some modifications.
In summary: we are going to use the WaveNet model as a text generator. We'll use raw text data (characters), instead of raw audio files, and once the network is trained, we'll use the conditional probability found to generate samples (characters) into an self-generating process.

Only printable ASCII characters (Dec. 0 up to 255) is supported right now.

Results


Pretty interesting results are reached!! Feeding the network with enough text and training, the model is able to memorize the probability of the characters disposition (in a lenguage), and generate later even a very similar text!!

For example, using the Penn Tree Bank (PTB) dataset, and only after 15000 steps of training (with low set of parameters setting) this was the self-generated output (the final loss was around 1.1):

"Prediction is: 300-servenns on the divide mushin attore and operations losers nis called him for investment it was with as pursicularly federal and sotheby d. reported firsts truckhe of the guarantees as paining at the available ransions i 'm new york for basicane as a facerement of its a set to the u.s. spected on install death about in the little there have a $ N million or N N bilot in closing is of a trading a congress of society or N cents for policy half feeling the does n't people of general and the crafted ended yesterday still also arjas trading an effectors that a can singaes about N bound who that mestituty was below for which unrecontimer 's have day simple d. frisons already earnings on the annual says had minority four-$ N sance for an advised in reclution by from $ N million morris selpiculations the not year break government these up why thief east down for his hobses weakness as equiped also plan amr. him loss appealle they operation after and the monthly spendings soa $ N million from cansident third-quarter loan was N pressure of new and the intended up he header because in luly of tept. N million crowd up lowers were to passed N while provision according to and canada said the 1980s defense reporters who west scheduled is a volume at broke also and national leader than N years on the sharing N million pro-m was our american piconmentalist profited himses but the measures from N in N N of social only announcistoner corp. say to average u.j. dey said he crew is vice phick-bar creating the drives will shares of customer with welm reporters involved in the continues after power good operationed retain medhay as the end consumer whitecs of the national inc. closed N million advanc"

This is really wonderful!! We can see that the original WaveNet model has a great capacity to learn and save long codified text information inside its nodes (and not only audio or image information). This "text generator" WaveNet was able to learn how to write English words and phrases just by predicting characters one by one, and sometimes was able even to learn what word to use based on context.

This output is far to be perfect, but It was trained in a only CPU machine (without GPU) using a low set of parameters configuration in just two hours!! I hope somebody with a better computer can explore the potential of this implementation.

If you want to check this results, you just have to type this in a command line terminal (this will use the trained model checkout I uploaded to the respository):
python generate.py --text_out_path=output.txt --samples 2000 
./logdir/train/2016-10-02T10-45-10/model.ckpt-14999

Requirements


TensorFlow needs to be installed before running the training script. TensorFlow 0.10 and the current master version are supported.

Training the network


You can use any text (.txt) file.

In order to train the network, execute
python train.py --data_dir=data

to train the network, where data is a directory containing .txt files. The script will recursively collect all .txt files in the directory.
You can see documentation on each of the training settings by running
python train.py --help

You can find the configuration of the model parameters in wavenet_params.json. These need to stay the same between training and generation.

Generating text


You can use the generate.py script to generate audio using a previously trained model.

Run
python generate.py --samples 16000 model.ckpt-1000
where model.ckpt-1000 needs to be a previously saved model. You can find these in the logdir. The --samples parameter specifies how many characters samples you would like to generate.
The generated waveform can be stored as a .txt file by using the --text_out_path parameter:
python generate.py --text_out_path=mytext.txt --samples 1500 model.ckpt-1000
Passing --save_every in addition to --text_out_path will save the in-progress wav file every n samples.

python generate.py --text_out_path=mytext.txt --save_every 2000 --samples 1500 model.ckpt-1000

Fast generation is enabled by default. It uses the implementation from the Fast Wavenet repository. You can follow the link for an explanation of how it works. This reduces the time needed to generate samples to a few minutes.

To disable fast generation:
python generate.py --samples 1500 model.ckpt-1000 --fast_generation=false

Missing features

Currently, there is no conditioning on extra information.

No hay comentarios:

Publicar un comentario