# Introduction to Data Science
## Lab 9: Introduction to Natural Language Processing
### About the 'Sarcasm' data set
This dataset contains about 1 million sarcastic comments from the Internet commentary website [Reddit](https://www.reddit.com/).
The dataset was generated by scraping comments by the scientists [Mikhail Khodak, Nikunj Saunshi and Kiran Vodrahalli](https://arxiv.org/abs/1704.05579) containing the \s (sarcasm) tag.
This tag is often used by users of Reddit to indicate that their comment is in jest and not meant to be taken seriously, and is generally a reliable indicator of sarcastic comment content.

The dataset is balanced, i.e., it contains equal parts of sarcastic and non-sarcastic comments, while the true ratio is about 1:100.

The data can be found [here](https://nlp.cs.princeton.edu/SARC/0.0/), the notebook is based on [this source](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit-solution).

### Part A: Downloading and importing the data set

Before we start, we import the necessary modules.

In [None]:
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from matplotlib import pyplot as plt

**Task**: Download the files `train-balanced.csv.bz2` and `train-balanced_pol.csv.bz2` from the webpage.
You don't have to unzip the files manually, it can be done using, e.g., `pandas` `read_csv` function.
The file `train-balanced.csv.bz2` is fairly large, as it contains about 1 million samples.
The file `train-balanced_pol.csv.bz2` contains only a subset of the data.
You should use this file (`train_balanced_pol.csv.bz2`) to set up the options for the `pd.read_csv` function correctly.

Once you've sure that everything works as expected, you can switch to the other file (`train_balanced.csv.bz2`).
Then, import the file `train-balanced.csv.bz2` as `df`.

**Note**: The names for the colomns are as follows:
 
 ['label', 'comment', 'author', 'subreddit', 'score', 'ups', 'downs', 'date', 'created_utc', 'parent_comment']

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Task**: Use the methods `head()` and `info()` to get an overview of the data set.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

You should find out that some comments are missing.

**Task**: Delete them using the `dropna` method with appropriate options.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df.shape == (1010772, 10)
assert abs(df.score.mean() - 6.88600396528594) < 1e-8

**Question**: How many sarcastic comments are now in the data set? Store your answer in the variable `ans_1`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'ans_1' in locals()
assert ans_1 == 505368

Altough we could use all the data provided to us in the `train_df`, we only want to use the column containing the `'comment'`s.

**Task**: Extract the `'comment'` column as variable `X` and the `'label'` column as variable `y`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert abs(y.mean() - 0.4999821918296114) < 1e-8
assert X.dtype == 'O'
assert X.shape == (1010772,)
assert type(X) == pd.core.series.Series

**Task**: Next, we want to split the data set into a training and validation set.
Use the function `train_test_split` to split the data into
- the training data set `Xtrain` with labels `ytrain`
- the validation data set `Xtest`with labels `ytest`

Use the option `random_state = 1`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

From now on, we only work with the training data `Xtrain` and `ytrain`, and keep the validation data set for testing after training.

Let's explore whether the length of the comment might already indicate if it's sarcastic or not.

With

 Xtrain.str.len()

we get a `Series` object which contains the lengths of the comments.

Unfortunately, plotting a histogram of the lengths is not insightful at all, even if you increase the number of bins.

In [None]:
Xtrain.str.len().hist()

The problem is that most of the comments are rather short, only some contain more than 1000 characters.
Fortunately, applying the logarithm to the lengths helps to represent the data more clearly.

You can to this by the method `apply(np.log)`.
In general, the method `apply(some_fun)` applies the function `some_fun` to all elements in the `Series`.

**Task**: Generate one figure containing two histograms (one for the sarcastic and one for the non-sarcastic comments) of the log-lengths of the comments.
Use the options `label` to name your histograms as well as `alpha = 0.5` to draw the histogramms semi-transparent.
Finally call `plt.legend()` to show the legend.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Wordclouds

Next, we want to find out which words occur most often in the sarcastic and non-sarcastic comments.
We can do this using a word cloud.

The following code cell does this for the sarcastic comments:

In [None]:
# Import necessary stuff
from wordcloud import WordCloud, STOPWORDS

# Set up the word cloud generator
wordcloud = WordCloud(background_color='black',
 stopwords = STOPWORDS,
 max_words = 200,
 max_font_size = 100,
 random_state = 1,
 width=600,
 height=400)

# Generate wordcloud
plt.figure(figsize=(16, 12))
wordcloud.generate(str(Xtrain[y==1]))
plt.imshow(wordcloud);

**Task**: Generate a second word cloud for the non-sarcastic comments.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Aggretation functions

Now, we want to investigate whether sarcastic comments are more prone to occur in particular `subreddit`'s.

Here, we have to use our whole data frame `df` again.
The command 

 sub_df = df.groupby('subreddit')['label'].agg([np.size, np.sum, np.mean])
 
returns a data frame which contains the size, the sum and the mean of the `label`'s grouped by the `subreddit`'s.

Since the `'label'` columns marks a sarcastic comment with a `1`, a non-sarcastic with a `0`, mean value gives the proportion of sarcastic comments.

**Task**: Use the `sort_values()` method together with `head()` to display the ten `subreddits` with the highest number of sarcastic comments. Store this data frame as `agg_df`.

In [None]:
sub_df = df.groupby('subreddit')['label'].agg([np.size, np.sum, np.mean])
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert agg_df.shape == (10,3)
assert agg_df['size'].sum() == 250442
assert abs(agg_df['mean'].mean() - 0.5397175526479832) < 1e-8

**Task**: Generate a data frame which contains all `subreddit`'s with more than `1000` comments (both sarcastic and non-sarcastic), and sort it by its mean values in descending order.
Store this data frame as `large_df`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert large_df['sum'].mean() == 3516.7
assert abs(large_df['mean'].std() - 0.04852678950867806) < 1e-8

You should observe that there are indeed a lot of `subreddit`'s with significantly more than 50 % of sarcastic comments.

Now, instead of grouping by the `subreddit`, we want to group by the `author` to find out whether there are some extraordinarily sarcastic `author`'s.

**Task**: Similar to the generation of the data frame `sub_df` you should set up a data frame `author_df` which contains:
- the number of comments,
- the number of sarcastic comments as well as
- the proportion of sarcastic comments
grouped by the `author`'s.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert abs(author_df['size'].mean() - 3.939710009354537) < 1e-8

**Task**: 
Let's analyse only the authors with more than 200 comments and print both the 10 authors with highest proportion of sarcastic comments as well as the 10 authors with the lowest proportion.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Here you should find out the data seems to be pre-selected to contain equal fractions of sarcastic and non-sarcastic comments.

### Training a logistic regression model.

In order to train a logistic regression model, we have to convert our `string`-valued data into some numerical values.
One way to accomplish this task is by using a [**term frequency–inverse document frequency** (short **tfidf**) measure](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Fortunately, this method is already part of `scikit-learn`, we can use the `TfidfVectorizer` to convert an array of strings to a sparse matrix, i.e., a matrix with a particular storage pattern which is used often for matrices containing mostly zero's.

Let's test the function on behalf of the list `x`

 x = np.array(['This is the first document.',
 'This document is the second document.',
 'And this is the third one.',
 'Is this the first document?'])

You can set up a standard TfidfVectorizer by setting 

 vectorizer = TfidfVectorizer()

and then calling the `fit_transform()` method, i.e.

 s = vectorizer.fit_transform(x)
 
**Task**: Execute the commands from above.
Print the variable `s`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Since `s` is a sparse matrix, you will only see the position of a value as a tuple `(i,j)` together with its value `s_{i,j}`.
You can print the full array using the `toarray()` method.

**Task**: Print `s` as an array!

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Task**: With `print(vectorizer.get_feature_names())`, you can see `names` belonging to the columns in the array `s`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Task**: Describe by your own words: What kind of information is contained in the i-th row, j-th column of the array `s`, i.e., $s_{i,j}$.

YOUR ANSWER HERE

Now, we want to apply the vectorizer to our training data set `(Xtrain, ytrain)` and train a logistic regression model and the transformed data.

With

 tf_idf = TfidfVectorizer(ngram_range=(1, 2), max_features=50000, min_df=2)
 
we set up a `TfidfVectorizer`.
With
 
 logit = LogisticRegression(C=1, n_jobs=4, solver='lbfgs', random_state=1, verbose=1)
 
we set up a Logistic regression model with $\ell^2$-regularization (`C = 1`).

Since the pre-processing is necessary for both the training and test data, we create a full model using a so-called `Pipeline`:

 full_model = Pipeline([('tf_idf', tf_idf), ('logit', logit)])

which consists of a list of $n$ `scikit-learn` objects, whose:
- first (n-1) elements have a built-in `fit_transform` method
- last element has a built-in `fit` method.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Task**: With
 
 %%time
 full_model.fit(Xtrain,ytrain)
 
we can train the model (this can last about a minute)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Now we can use our trained model to predict the labels on the validation data `Xtest`:

 %%time
 ypred = full_model.predict(Xtest)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Task**: Determine the accuracy of the model, i.e., the percentage of correct predictions, for the validation data set.
Implement a function by yourself, or use the function `accuracy_score()`.
Store the value as `acc_score`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert abs(acc_score - 0.7208707799582893) < 1e-8

**Task**: Print the confusion matrix for our model for the validation data set which contains the numbers of

 [[True positives, False positives],
 [False negatives, True negatives]].
 
**Hint**: Use an appropriate function.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()