The core of this project is based around a simple task — performing sentiment analysis with the IMDB dataset given here:
There are 50,000 documents in the IMDB corpus. Split these into the following ratio for analysis:
We have already done a lot with this dataset, so this assignment should at its core naturally allow you to move beyond our notes. Below I’ve set out a number of different comparisons that you are to perform. You can think of these as being individual chunks of analysis. These do not necessarily build on each other. It is up to you to decide whether these are done in isolation or not. The key point is that you understand what you are doing and that this understanding is reflected both in a well-documented code and in your report.
Your models should be designed to minimize overfitting as appropriate. In all cases, you should record your results as graphs for Training and Validation data and report the test result after training has been completed. Please select whatever fewer functions or metrics you think are appropriate based on notes provided in class, or more widely based on what is appropriate in a text classification task.
The first sub-assignment is to compare performance on the classification task across Recurrent Network Variants. Specifically compare LSTM and Basic RNN models. You are free to choose your own state size for the recurrent network, however please use the same state size for both RNN variants. Also compare a single layer LSTM implementation to a multi-layer LSTM implementation.
Distributed embeddings provide a lot of power in text classification, but there are many different Embeddings types that can be used. Compare classification between Embeddings learned on the fly to any pre-trained word embedding available from the Tensorflow Hub.
As mentioned in the lecture notes, CNNs are designed to model local features while LSTMs are very good at handling long range dependencies. Investigate the use of CNNs with multiple and heterogeneous kernel sizes both as an alternative to an LSTM solution, and as an additional layer before a LSTM solution.
From the various models above, save the best model. A link to the best performing model should be included in your submission. You will also be using this saved model in Part 3 below. There are many ways in which models can be saved. I’m not prescribing a specific way this is to be done. You are free to use whichever method you find most suitable. As always clearly document your design.
One problem with libraries which provide wrappers for well-known datasets is that they can make the task of using the dataset so easy, that we do not realise what is required in the construction and use of data in Deep Learning. Related to this, in real world problems you will have your own data and will often want to build on pre-trained models to make use of the learning that has already been achieved with an existing model — doing this is called Transfer Learning.
Given these issues, in this part of the assignment you will collect your own dataset and use it to train a model that is based on your own existing pre-trained model constructed in Part 1.
Your first task is to construct a labelled dataset and encode so that it can be used again for processing. We will adopt the actual IMDB movie database as the source of information — not the dataset used in Part 1.
To do this, randomly select 30 movies from the year of your birth. You can use IMDB’s title based search functionality to do this. https://www.imdb.com/search/title/
For each movie, select at least one good and one bad review. The reviews should be in English, and I’ll let you decide what is a good and a bad review, but for instance a good review might be 7/10 stars or higher and a bad review might be 4/10 stars or lower.
For each review that you select, record whether this was a positive or negative review and also capture the text for that review. You will probably get better results if you include the title of the review in the document you record for that review. How you record your reviews initially is up to you, but options include an excel file, a CSV file, a set of individual text files. It is very much up to you. Keep in mind though that the raw data will have to be supplied along with your source code.
Using the best performing model from Part 1, load the data — split 70/30 between training and validation — no testing data as you don’t have enough. Then, build a model for this new novel data that is based on a previously trained model from Part 1. This implies that you need to fine tune the existing model to your new data and test its performance. This does not mean that you add your data to the original IMDB data. This fine tuned model should start from the model that you saved in Part 1.
Report Training and Validation scores for this fine-turned model. Save this model. It needs to be supplied along with the original model it is based on as part of your assignment.
Finally, build a “from scratch” model for your novel data that uses the exact same architecture as your best performing model from Part 1. Compare the performance of this “from scratch” model to your fine-tuned pre-trained model.
Practical language processing tasks aren’t just about classifying. In the second part of the assignment you will put your skills in RNNs and related technologies to work to generate some original text and benchmark your model against a more classical implementation.
For this work make use of the IMDB dataset but let’s split the data differently. For each of the examples below, build one model with negative reviews, one model with positive reviews, and one model with all reviews included. Keep in mind that we do not need to use a training validation and testing split of the data in this case.
Your core model should be based on the use of LSTMs, but beyond this you are free to explore whatever architecture and hyper-parameter variants that you find results in the best performance in the language generation task.