CDS Insights - Fake News Challenge (FNC-1)

James Chen, Max Chen, Shalin Mehta, Brandon Truong, Danny Yang

1. Introduction


Detecting fake, inaccurate, and/or misleading news poses a huge challenge in today’s society. There are far too many articles and blog posts published for human screeners to review. Furthermore, not everyone agrees on the definition of fake news, making it an even more difficult task to approach. One way to help human screeners use their time more efficiently is to build machine learning models that can flag articles which are more likely to be misleading and therefore, have a higher probability of reporting “fake news”.

More specifically, the problem presented in the Fake News Challenge is to determine the relevance and stance of a news article in relation to a specific article headline. The results of this project can be used to aid human screeners by flagging articles that disagree with known trustworthy and unbiased news sources.

More details on the FNC-1 can be found at: http://fakenewschallenge.org

2. Project Overview


The overall task is to classify a news article headline and body pair into one of four categories: unrelated, agree, disagree, or discuss.

We decided to split this problem into two parts — a relevance detection task and a stance detection task — and train different models for each task, with our final prediction derived from the output of both models. Most submissions to the competition trained a single classifier for both relevance and stance, so we hoped to achieve better results by training more specialized models for each task.

2.1 Relevance Detection

The relevance detection task is a binary classification problem to decide if a given headline and body pair are correlated or similar in content (i.e. they are about the same topic). We ended up obtaining good results by using simpler models trained on features derived from bag-of-words and dependency trees. To aid in this task, we made a custom implementation of decision trees using the C4.5 algorithm to use as a comparison against CART trees, and created a tool to visualize the trained random forests.

2.2 Stance Detection

The stance detection task is more complex, involving classifying a headline and body’s relative stances as agree, disagree, or discuss. We used a more intricate model, which was trained on a custom word embedding (derived from Glove embeddings) combined with part of speech, sentiment, and negation features. The training data we used was a subset of the data excluding the pairs that were deemed unrelated. Our results for this task were not as good as our results for relevance detection; however, they still surpassed the baseline accuracy.

3. Dataset Overview/Exploration


3.1 Dataset

The training set includes 1683 unique article bodies and 49772 headline-article body pairings. The distribution of the labels is fairly unbalanced as shown in the table below, so to help alleviate this imbalance we decided to create two datasets, one for each task.

The relevance classification dataset includes the whole dataset (n = 49772), but the

agree, disagree, and discuss labels are all transformed into a single related label. The stance classification dataset only includes subset of article-body pairs that are related to each other (n=13427), resulting in a less unbalanced training set for that task.

Label

Count

Percentage

Pct. Related

agree

3678

7.3%

27.4%

discuss

8909

17.8%

66.35%

disagree

840

1.7%

6.25%

unrelated

36545

73.1%

n/a

3.2 Exploratory Visualizations

A variety of exploratory data visualizations were created from the dataset to compare the word structure distributions between articles that agree with the headline and articles that disagree with the headline. We plotted the log frequency of words, compared the 25 most frequent words, and the distribution of word lengths. None of the plots showed any significant differences between the two classes, so we concluded that there were no obvious differences in structure or word distribution between the articles belonging to those classes.

Disagree Word Cloud                                  Agree Word Cloud 

Our exploratory visualizations indicated that our feature engineering should be focused on deriving features that directly compare the headline and the article body, as opposed to coming up with features for the headlines and bodies separately.

We plotted the distribution of word lengths for headlines and bodies and confirmed that both distributions are right-skewed. This meant that truncation was necessary to deal with outliers. The median headline length was 10 words and the median body length was 293 words, while the 90th percentile was 16 words and 639 words, respectively.

 

4. Feature Engineering


4.1 Relevance Features

For relevance detection, we preprocess the texts by removing all non-alphanumeric characters, lemmatizing each token, and removing stop words and tokens of length 1. The features we used for relevance detection are mostly derived from bag of words and the heuristic that the first sentence has increased importance in news articles. Our feature engineering for this task followed a three-step preprocessing pipeline outlined below.

  1. Build an IDF from article bodies used in the training set. During lookup, any out-of-vocabulary words that we encounter are assigned the average IDF score.
  2. Preprocess each article body and store the results in a dictionary to save time. We derive features for each body such as most frequently occurring tokens/bigrams, parts of speech, most significant words/sentences based on TF-IDF score.

  1. Process each headline-body pairing and generate similarity features by processing the headline and comparing with information about the body. Since each unique body is used multiple times, having the dictionary step saves a lot of time.

Feature Name

Description

shared_nouns

shared_verbs

shared_tokens

Number of the 10 words in the article body with the highest IDF scores  that are present in the headline.

shared_bigrams

Number of the 10 most common bigrams in the article body that are present in the headline.

shared_nouns_fst

shared_verbs_fst

shared_bigrams_fst

shared_tokens_fst

Number of shared nouns/verbs/bigrams/tokens between the headline and the first sentence of the article body.

shared_nouns_sig

shared_verbs_sig

shared_bigrams_sig

shared_tokens_sig

Number of shared nouns/verbs/bigrams/tokens between the headline and the most significant sentence of the article body. Significance is determined by the average IDF score among the tokens in a sentence.

svo_s_fst

svo_v_fst

svo_o_fst

Binary value that is 1 if the first sentence in the article body  and the headline share the same subject/verb/object, 0 otherwise. The subject, verb, and object for each sentence is extracted using Spacy’s dependency tree parser.

svo_s_sig

svo_v_sig

svo_o_sig

Binary value that is 1 if the most significant sentence in the article body and the headline share the same subject/verb/object, 0 otherwise.

cos_nouns_sig

cos_bigrams_sig

cos_tokens_sig

Cosine similarity between bag of nouns/bigrams/tokens for the most significant sentence in the article body  and the headline.

cos_nouns_fst

cos_bigrams_fst

cos_tokens_fst

Cosine similarity between bag of nouns/bigrams/tokens for the first sentence in the article body and the headline.

4.2 Stance Features

The first of our stance detection models takes in 2x100x100 input values — two 100x100 matrices each representing 100-dimensional GLOVE embeddings of the first 100 words of the headline and body, appropriately truncated or padded.

Feature

Size

Description

Word embedding

100

Pretrained GLOVE embedding (Wikipedia/Gigaword)

Our second stance detection models takes in a 70x92 input values — the first 20 tokens of the headline, 10 rows of padding, and the first 40 tokens of the body. We encode each token as a 92-dimensional vector outlined below.

Feature

Size

Description

Word embedding

50

Pretrained GLOVE embedding (Wikipedia/Gigaword)

Part of speech

36

One-hot encoding of part of speech tag

Sentiment

4

‘Pos’, ‘neg’, ‘neu’, and ‘compound’ parts of token’s sentiment, derived from VADER

Negation

1

Boolean flag if the word is negating (using the negating word bank from qdap package for R)

Doubt

1

Boolean flag if the word is doubting (set of doubting words derived from baseline FNC-1 implementation)

5. Models


5.1 Relevance Detection

We chose to implement a random forest using C4.5 trees for our relevance detection model, and evaluated its performance against the baseline of sklearn’s logistic regression and CART trees. The C4.5 algorithm is advantageous in that it can use both discrete and continuous values, whereas CART trees cannot accept discrete values. It also incorporates tree pruning to eliminate overfitting and it deals with noise. The algorithm is outlined below.

  1. Check for base cases
  1. All training set examples belong to the same class
  2. The training set is empty
  3. The attribute list is empty
  1. Find the attribute with the highest information gain (entropy)
  2. Partition the set according to the best attribute
  3. Repeat steps 1-3 for each partition

5.2 Stance Detection

Our first stance detection model is a bidirectional conditionally encoded LSTM with two fully connected layers at the end. Each LSTM is identical with a hidden dimension of 64. We first encode the headline with the first LSTM. We then encode the body with the second LSTM, initialized with the hidden states of the first LSTM. The output of the second LSTM is passed through two fully connected layers, with ReLU and a dropout of 0.25 applied between them.

The second stance detection model is a simple CNN with filters for windows of 3-7 words (100 of each), and a single fully connected layer at the end, with dropout of 0.5. We used cross-entropy loss and the Adam optimizer and L2 regularization with alpha = 0.00001.

6. Results


6.1 Relevance Detection

For relevance detection, we achieved very good results with just simple models based on our derived features, with around 95-97% validation accuracy for logistic regression, random forest with CART trees, and our own implementation of C4.5 decision trees. On a relevance detection subset of 10000 observations, our implementation of C4.5 took 118.45 seconds to train a random forest of 25 trees. The random forest achieved a mean accuracy of 97.01%, while a single C4.5 decision tree achieved a mean accuracy of 96.33%. All of these results are much higher than the baseline of 73%. C4.5 decision trees had slightly better results than CART trees both when used alone and when used to build a random forest, with an improvement in accuracy of about 0.5-1%.

6.3 Stance Detection

For the stance detection task the baseline for the subset of data we focused on was 66%. We achieved results of 5-10% above baseline using a CNN trained on custom embeddings with filters of size 3-7 (100 of each) and a single fully connected layer at the end.

An image of stance detection results is on the following page.

6.4 Combined Model

In the combined model, the stance detection model is used to predict classes only for the data points that are classified as related by the relevance detection model.

Scoring in the FNC is determined by giving 0.75/0.25 weighting for the stance and relevance detection scores, respectively. For the combined model when evaluating against the official competition test set we got a composite score of ~9200, corresponding to 79.0% weighted accuracy. This is slightly more than 3% above the baseline implementation (a gradient boosting classifier) and corresponds to a ranking of 13th place had we competed in the actual competition.

7. Analysis


Perhaps the most surprising result is how well simple models work for the task of relevance detection. Our model for relevance detection had excellent performance even on the competition test set, showing that it generalizes well.

Our stance detection model had very mixed results. Overall, it was better than just guessing the most common class, but the performance has a lot of room for improvement. The model still classifies the majority of each class as ‘discuss’ but the results for ‘agree’ is also promising. We think that the model may not be sensitive enough to the connotation of words that express affirmation or doubt, which is something we hope to expand on next semester.

Our model was unsuccessful in correctly classifying instances of disagreement between headlines and article bodies, which was disappointing because it was the most interesting class. We think that this is due to the unbalanced nature of the dataset, with only 6% of labels belonging to that class. The difficulty of classifying something as ‘disagree’ is because the relative stance of the body and the headline can go both ways - the headline could be saying that the body is a hoax, or vice versa. These two types of relative differences would have very different distributions of negation and sentiment, but they are not differentiated in the dataset. The key to improving our model’s performance for this class will likely be in developing a way of encoding and detecting these differences.

8. Interactive Visualization


We developed a visualization that reveals the structure of random forest and decision tree models for binary classification. A single tree can be displayed at once, and the user can select and highlight nodes that split on a particular feature.

The nodes are colored according to the distribution of training labels in the partition that they are splitting on, and have size proportional to the size of the partition. This allows the user to inspect how useful each split was, and help determine if the tree has appropriate depth.

For example, if the leaf nodes alternate red and blue, that means that the important splits happened at a deeper layer of the tree. If the nodes in higher levels of the tree are almost entirely red or blue, then it means that further splits were not very necessary/beneficial; on the other hand, if the leaf nodes are purple then it means that the tree is not deep enough.

There is also a heatmap-like feature on top that shows the accuracy scores of each tree based on the validation dataset, and allows for quick navigation by clicking on a box instead of going through a dropdown menu. This can also display which trees split on a particular feature, which is helpful for seeing how significant a particular feature is and how much it affects a tree’s performance.

The visualization can be found at: https://cornelldatascience.github.io/Insights-FakeNews/decision_tree_vis_test.html

9. Future Work


Our work next semester will focus on improving our stance detection model and coming up with better ways to visualize our methods and results.

To improve our deep learning model, we plan on investigating various feature engineering techniques and different architectures. We will aim to improve our embeddings by creating word embeddings that better capture the semantics of our text, and incorporate encodings of other features derived from bag-of-words and dependency trees. Our encoding is not very expressive for words that have an affirmative or doubtful connotation, so we may seek to incorporate better sentiment/polarity metrics or useful lexicons into our feature engineering.

In addition we want to further balance the classes in the stance detection dataset by either developing a way to synthesize new instances of the disagree class or by using stratified sampling techniques when batching and splitting our data.

This semester we experimented a bit with RNN and LSTM architectures; however, next semester we will look at them more in depth to see if they are a better alternative to CNN. In particular, they seem promising for the challenge of detecting disagreement as mentioned earlier in the writeup.

The random forest visualization can be improved to give more aggregate information about the forest instead of just showing individual trees. In addition, a better interface can be developed since the current interface can only support trees that are 6 layers deep or less (deeper trees end up looking very cluttered due to limitations of D3’s layout algorithms). One way to solve this is to have layers beyond 6 to be collapsed at first, with the option to expand them; this would require finding a way to indicate the depth of collapsed subtrees to the user. Furthermore, we can even use this visualization as a tool to see which features seem to be more important for this entire project. If we can identify features that are more significant than others, we may be able to improve our feature engineering as well.

Finally, our goal for visualizing deep learning models is to map our neuron weights and activations back to the source text, highlighting words depending on the weight or output of the selected neuron or layer or filter. This would give useful information about which words or embedding features are more important to the final classification result.

10. References


1. Augenstein, I., Rocktäschel, T., Vlachos, A., & Bontcheva, K. (2016). Stance detection with bidirectional conditional encoding. arXiv preprint arXiv:1606.05464.

2. Baird, S., Pan, Y., & Sibley, D. (2017). Cisco-Talos Fake News Challenge. Github repository. https://github.com/Cisco-Talos/fnc-1

3. Ferreira, W., & Vlachos, A. (2016). Emergent: a novel data-set for stance classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1163-1168).

4. Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.

5. Schumacher, A. (2015, November 29). See Sklearn Trees with D3 [Blog post]. Retrieved from https://planspace.org/20151129-see_sklearn_trees_with_d3/

6. Sima, O. (2011, March 25). Decision Trees – C4.5 [Blog post]. Retrieved from https://octaviansima.wordpress.com/2011/03/25/decision-trees-c4-5/