Hey there! I'm a fifth year at the University of California, Berkeley. I'm double majoring in Statistics and Computer Science. I'm a Texas native, growing up in Port Arthur, Texas. My hobbies include breakdancing, video games, origami, and sports. I enjoy meeting new people and seeing new places. This summer, I am finishing up my last class and am the Head TA for Machine Structures. I welcome all full-time opportunities. My recent interests lie in machine learning, data mining, and statistical inference. Below is my journey at Cal and why I chose Data Science!
I entered Cal studying Mechanical Eng. and took intro level math, engineering, physics, chemistry, and programming courses. I found that I was more fascinated with the math and programming and switched to Applied Math!
After falling in love with Multivariable Calculus and Linear Algebra, I moved on to advanced courses. Though proofs and theorems were interesting, I found myself wanting something more "applied." Enter Statistics, prominent for solving many real world problems!
I found my niche in Stats, and it found a special place in my heart. I was intrigued with all you could do with data, and dived into statistical learning. As of Fall 2015, I have completed the Stats degree with a concentration in Applied Math. My lingering interest in coding and love for data drove me towards Computer Science!
Fast forward to the present, where I'm working towards a degree in CS. I enjoy coding and fusing CS with Stats. Areas of Computer Science that interest me are data structures, algorithms, databases, artificial intelligence, and machine learning. I intend to finish CS and graduate in Summer 2017, using all my skills to pursue Data Science!
Wait, which is it? Child? History? Religion? Science?
This project aimed to build a classifier that could correctly label which book category an excerpt came from. More specifically, we gathered data from Project Gutenberg, and split it into about 22,000 .txt files. Each .txt file contained a random excerpt from a Child, History, Religion, or Science book. From these .txt files, we constructed features (predictors) and trained different models. We settled on training and tuning SVM and Random Forest models. Our radial kernel SVM model performed best with an accuracy of 0.94191 on Kaggle. You can follow this project on Github.
The data consists of a total of 22,308 text files from Project Gutenberg. Each file contains an excerpt from a book. The files are separated into four different categories corresponding to the genre of the book, “Child,” “History,” Religion,” and “Science.” We began by observing the size of each category. Below is our frequency count:
Class | Child | History | Religion | Science |
---|---|---|---|---|
Observations | 7,164 | 5,352 | 2,361 | 7,431 |
We have the most observations in the Science category and the least in the Religion category. To get an idea of how the data is formatted, we looked at a few files in each category. Here is a preview of how a .txt file looked:
The files seem to contain excerpts from random points and endings in the book. Many of our files include table of contents, chapter titles, page numbers, and have text art. From our random selection of files in the child folder, we noticed talk of princes and princesses, animals, and names of well-known characters like Alice in Wonderland. In the religion folder, we identified many files with words pertaining to Christianity. In the history category, some files seem to have historical figures and places such as Rome and Cleopatra; it also seems to include legends and historical fiction. In the science category, there are both science fiction and nonfiction academic papers. We also noticed sparse distribution of foreign words; for example, French was in the history section (5903.txt) and Spanish in child section (1671.txt). Some text excerpts are significantly longer than others, which provoked curiosity about the word count of each text file. Also, we noticed that some Project Gutenberg licensing mentioning Project Gutenberg occurs at the beginning and end of the book observations. It is not always easy for humans to classify the differences unless it is obvious from word associations, prior knowledge of the text, or numbers because of some of the categories may overlap. For example, the child’s category contains a children's history book and the history category contains the history of the number “\(e\)”, primarily consisting of numbers (which might be misclassified as science).
Our method of creating features was broken into two steps. First, we created "word features." This was simply the process of finding all the unique words in our dataset, and making counts for each one's frequency. This is very much along the lines of the Bag-of-words model. Then we created "power features," which were custom features that we felt may be helpful in predicting the categories.
Since we were given .txt files, we had to figure out how to most efficiently convert them into data we could work with in R and Python. We used Python to import the files and parse out word features using the open and read functions.
To remove common word features (stop words), we created a text file containing all the words found here. Each time we parsed a word, we checked to see if it was in our remove file and, if it was, we did not include it.
After we had all the .txt files in Python as strings using the read function, we looped through each excerpt using the glob library. We converted all letters to lowercase, replaced all non-alphabet characters with spaces, and then split by whitespace to make a list of all the words in the file. We converted the list into a set and back to a list in order to remove duplicate words. We added the lists produced by the documents together, then applied the same procedure to remove duplicates.
Some excerpts involved licensing information for Project Gutenberg. To remove that information, we added more words to our remove file. The new words we added consisted of: "Project", "Gutenberg", "eBook", "Title", "Author", "Release", and "Chapter". We also added some additional words that we found were associated with copyright information. After removing all these words and the common stop words, we created a list of the unique words remaining and the total number of times they appear throughout the dataset. Then we divided by the number of words per .txt observation to standardize into relative frequencies.
With such a large dataset, we were bound to have a large feature space of word features. We definitely wished to reduce the feature space without losing too much information. In order to do so, we thought it might be helpful to group words with common stems. We did some research and found the Natural Language Toolkit (NLTK) in Python. There were two methods that we considered, basic stemming and lemmatization. Basic stemming in NLTK groups words with common roots (for example, car and cars would be collapsed into one feature). Lemmatization would combine words with the same meaning (collapsing car and automobile, for example). We chose to go with basic stemming because we worried that different choices of vocabulary for the same word may be indicative of class. We decided to test our models on both the stemmed and unstemmed matrices and later decide which to use based on error rates, as documented later.
In creating the word features above, we incurred several programming challenges. The script took a long time to remove duplicates in the appended list of approved words. As we added more words, the list grew larger and the loop became increasingly slower as more documents were parsed. If we only removed duplicates at the very end of the loop in the final list, the list became so long that it caused a memory error. This was resolved by only removing duplicates for every 100 documents, then adding the result of each of these lists together to form a final list, from which we removed the final duplicates.
We initially produced around 1.8 million word features. Another programming challenge we encountered was reading such a large number of initial features into R. We thought this seemed like too high of a number, and we noticed that an immensely large amount of the words seemed to be part of DNA sequences. We worried that we might have to get rid of these in our scripts, but later saw that they would disappear when filtering our histogram (explained in the section below). To reduce our feature space to something we could process, we needed to cut out stopwords and rare words. For the unstemmed word feature matrix, we ended up with 22,308 features with 22,308 observations. For the stemmed word feature matrix, we ended up with 9,000 features with 22,308.
As stated above, our word features matrix was too large for R to process. It was just too big in general. We needed a way to reduce it drastically. To do so, we decided to count the frequency of each word, and make a histogram of those counts. This told us how many words occured with certain number of counts. From this histogram, we could set a minimum and maximum threshold. Any words outside our threhold interval either occured too frequently or rarely. These words would obviously not help our classifier, and so we decided to remove them. To help decide where to set the thresholds, we produced an untrimmed histogram of the counts, as seen below:
This untrimmed, unstemmed word features histogram is not very informative. The vast majority of words occur under twenty times where the x-axis is number of times the word occurs and the y-axis is the number of words. The histogram has a long right tail, indicating that some words occur very frequently. The stemmed word feature matrix histogram looks similarly skewed.
In order to reduce the word feature matrix, we created a function makeMatrix. This function follows a procedure similar to producing the original word features. We looped through the .txt files using glob, parsed the file into a list of words, then created a dictionary that contained a key for each word that occurred in the document using the Counter function from the collections library. The value attached to the key is the number of times the word occurred in the document. Each of these values are mapped to the corresponding column and row of the matrix.
The mapping process was extremely slow since there were so many word features. This was resolved by sorting the word features alphabetically and using a binary search to find the correct column to map the value to.
We trimmed the histogram (changed the minimum threshold) to remove more rarely occurring words. We did not want to alter the maximum threshold yet because we thought that words occurring very commonly might have a different prominence among classes. For the unstemmed word features, we ended up with a minimum threshold of 2,508 and a maximum of 130,000 for a total of 2,495 predictors. For stemmed word features, we used a minimum threshold with 1,864 and a maximum threshold with 160,000 with 3,001 predictors. Below is the histogram after we trimmed for thresholds for the unstemmed words.
There is still a peak on the left side and a long right tail, but this histogram provides us with a much more reasonable number of word features. A similar histogram was produced for the stemmed words after cutting between the threshold.
Now that we got rid of super rare words, we needed to focus on the maximum threshold. The issue here was that if we didn't set the maximum just right, then we would lose words that occured frequently overall, but were also indicative of a particular category. For example, the word "Jesus" occured frequently, however mostly occured only in Religion books. Thus, removing "Jesus" as a word feature was detrimental, since it was a strong indicator of a Religion book. In order to find the right maximum threshold, we used a validation set to check the proportion breakdown for all frequent words. With a validation approach, the validation set was created with equal proportions by a stratified random sample of 300 of the .txt observations.
Set | Child | History | Religion | Science | Total |
---|---|---|---|---|---|
Entire Training | 7,164 | 5,352 | 2,361 | 7,431 | 22,308 |
Validation Set | 96 | 72 | 32 | 100 | 300 |
The first row of this table shows the number of observations in the full training set and the second row shows the number of observations in each category that were in our validation set.
The supervised feature selection was performed in R. The programming challenges included reading the data into R after unsupervised feature selection, with 22,308 rows and 2,495 word predictors for the unstemmed word features. Write.csv() could not read in such a large file and thus the package data.table with the function fread() was used to import the data into R.
From this validation set, the relative frequency in each category was summed and divided over the total relative frequency for all the classes. By taking the product of the four relative frequencies per class, we formed a threshold based on the minimization of area with four dimensions. Thus, through minimization, one class would be optimized closer to one hundred percent occurrence for a particular word predictor.
With a lower cutoff number for the threshold, we allowed fewer predictors to pass through. We settled on a threshold of 2E-3 from the validation set, which left 1,642 unstemmed word predictors. For the training set, we took out the validation set and were left with 22,008 observations; then we reduced the predictors space for the unstemmed matrix to 1,642 predictors.
Similarly, the supervised feature selection was also performed in R for the stemmed matrix, with the 22,308 rows and 3001 stemmed word predictors.
From the same validation index above, the relative frequency per stemmed word predictor in each class was also summed and divided over the total relative frequency for all the classes. The same validation index was used as for the unstemmed so that the stemmed and unstemmed matrix could be compared later when the models were made on the same observations.
With a lower cutoff number for the threshold, we allowed fewer predictors to pass through and we settled on a threshold of 1.5E-3 from the validation set, which left 1,291 stemmed word predictors. For the training set, we took out the validation set and were left with 22,008 observations; then we reduced to 1,291 predictors that passed the threshold.
Below are the "custom" power features we came up with.
Power Feature | Description |
---|---|
uniquecount | Number of unique words per book over number of words per book. |
sentence.length | Average number of words per sentence. |
avg.word.length | Average number of characters per word. |
digit.prop | Proportion of the document that is numbers. |
capital.prop | Ratio of capital to lowercase letters (only first letter of word). |
quotation | Number of quotation marks divided by number of words. |
question | Number of question marks divided by number of words. |
exclamation | Number of exclamation marks divided by number of words. |
noun | Percentage of the file that is nouns (using NLTK). |
adj | Percentage of the file that is adjectives (using NLTK). |
adv | Percentage of the file that is adverbs (using NLTK). |
verb | Percentage of the file that is verbs (using NLTK). |
foreign | Percentage of the file that is foreign words (using NLTK). |
preposition | Percentage of the file that is prepositions (using NLTK). |
pronoun | Percentage of the file that is pronouns (using NLTK). |
interjection | Percentage of the file that is interjections (using NLTK). |
childW | Percentage of words in our “child dictionary” that appear in a book. |
historyW | Percentage of words in our “history dictionary” that appear in a book. |
religionW | Percentage of words in our “religion dictionary” that appear in a book. |
scienceW | Percentage of words in our “science dictionary” that appear in a book. |
We utilized NLTK to classify words by part of speech. We also hypothesized that exclamation marks might occur more in the books for children. Using the validation set in part 3b, we made lists of words that appeared commonly in just one class. We had four of these lists, which we called our “dictionaries” to see what percentage of words in these dictionaries appeared in an excerpt.
Another thing we found in our research was that using bigrams might be helpful. For example, children’s stories often begin with “once upon a time” and end with “and they lived happily ever after.” Bigrams, or combinations of two words in a specified order, seemed to be a common method of text classification. We pursued this idea and created a list of bigrams following the exact same method we used for word features. This took a great deal of time, but when we tested our models using our custom features and our bigrams, we found that it performed worse than models including our custom features alone. Since bigrams did not improve any of our models, we decided to drop them.
After creating the word and power feature matricies seperately, we created a combined feature matrix. The dimensions of our combined word and power feature matrix was 22,008 rows (all the books minus our validation set, which we must exclude from here on out) by 1662 columns (1642 unstemmed word and 20 power features). For the stemmed matrix, there were 1,311 columns (1291 stemmed word and 20 power features). We wanted to train our models on these three feature matricies to see which worked best.
The first challenge was creating a random forest classifier in R as performing cross validation with the randomForest() function was not computationally feasible; the 10-fold cross validation code was run for more than 8 hours without one model being fully produced.
We considered changing some parameters like nodesize and maxnodes, but also researched alternative methods. Another package, bigrf (available on github) by Aloysius Lim, with the function bigrfc(), was able to train the random forest on large data sets quickly. For data sets too large to be stored in memory, it caches the files while building the forest through calling on another package, bigmemory, and also performs parallel tree growing. The only downside to this method was that the bigrf and bigmemory packages were only available on *nix systems.
With a comparison of the unstemmed and stemmed word features with the random forest, the out of bag error was lower for the unstemmed word features. Additionally, when we first tried to perform 5-fold cross validation with unstemmed features, we ran into computational issues. When we tried to parallelize across multiple computers, bigrf was not working due to memory or OS issues. Thus, we decided to move forward with a random forest model based on the unstemmed word features. Below is a comparison of the out of bag (OOB) errors for unstemmed vs. stemmed word features.
The OOB error for the unstemmed word features appears to be less than the OOB error for the stemmed word features. Although the stemmed matrix does indeed have less features, the matrix also may contain less information as more words were collapsed together.
The OOB error was used to justify the number of trees to be used when building random forest. 500 trees were grown, with the parameter \(m\) taken to be the default of the square root of the number of predictors. Finally, 200 trees was chosen for the next step of tuning the parameter \(m\).
As seen above, after growing 500 trees, the OOB appeared to be stable at 200 trees and thus was chosen for the next tuning step of the parameter \(m\).
After the number of trees was selected at 200, a 10-fold-CV was run with predictors \(m\) = 10, 40, 100, and 400. However, it took 12 hours to tune the parameter m. Finally, the predictor with \(m\) = 100 had the lowest average CV error rate over ten folds.
As seen above, the minimum average CV error rate came from \(m\) = 100. The x-axis is the number of predictors and the y-axis is the average 10-fold CV error rate
As 10-fold-CV was computationally intensive and we were unable to perform more narrowing values of \(m\) on a suitable time scale, we decided to use out of bag error to narrow the parameter \(m\). We also decided to use out of bag error to narrow down the parameter \(m\) for the remaining random forest classifiers to be produced in the later sections.
The same minimum parameter \(m\) = 100 was found with out of bag error.
The final word feature RF used \(m\) = 100 and with number of trees equal 200. Below are the total accuracy and accuracy per class based on the final word feature random forest.
Total Accuracy | Child(0) | History(1) | Religion(2) | Science(3) |
---|---|---|---|---|
0.8662759 | 0.8614884 | 0.8185606 | 0.8153714 | 0.9214295 |
Below is the confusion matrix for the actual labels (column) and the predicted labels (row) from the final word feature random forest classifier.
Actual/Predicted | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 6,089 | 558 | 35 | 386 |
1 | 608 | 4,322 | 127 | 223 |
2 | 50 | 264 | 1,899 | 116 |
3 | 307 | 218 | 51 | 6,755 |
We can see that this model does an "okay" job of predicting the book categories, however we definitely wanted something with at least 90% accuracy. Furthermore, we're pretty concerned with in-class accuracy. Especially since History and Religion seem to conflict a lot more than other pairs. Let's see how the pairwise ROC curves look.
For the ROC curves below, note that the other values of the parameter \(m\) look negligible as the true positive rate and the false positive rate are close.
We can see that the Area of the Curve (AUC) is minimal for History and Religion. This showed us that we needed to do a better job of seperating those two classes.
At this point in the project, we were experimenting with with a Boosting model, SVM linear model, and a SVM radial model. Tuning for parameters in the models took an extremely long amount of time. The fact that some of these models had multiple parameters to tune for, also led us to require more computational power. In order to make the tuning process efficient, we used the SCF servers to run multiple cross validation processes at the same time. Since cross-validation on R took a long time, the sklearn library in Python was used to run cross validation on the boosting and SVM model. The SVM radial model was chosen because be it had the best CV.
The final SVM radial classifier for word features was chosen with cost=7.7 and gamma=80. Below are the cross-validation fold accuracy rates.
Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 |
---|---|---|---|---|
0.91720 | 0.93777 | 0.92193 | 0.92094 | 0.91981 |
Below are the in-class and cverall accuracy rates of the SVM radial model. For using only the word features, SVM performed better than RF, although it may be overfitting the training data.
Set | Total | Child | History | Religion | Science |
---|---|---|---|---|---|
word | 0.97099695 | 0.96677834 | 0.95310164 | 0.96315121 | 0.990445431 |
Again, we are concerned with in class accuracy. Below are the ROC curves for all pairs.
Again, we see that the AUC is lowest for the History and Religion pair. Overall though, our SVM model seems better than our Random Forest one.
The out of bag error was used to justify the number of trees to be used later on to tune for the predictor m with the default argument for m, the number of predictors.
After growing 500 trees, the OOB appeared to be stabled at 300 trees and was chosen for the next tuning step.
Four different predictors were tried including the square root of the number of predictions: 4, 5, 10 and 15. The best out of bag error (0.1155191) was found with \(m\) = 4 predictors and was selected as the final model.
Increasing the parameter m did not help as the predictor space was small with only 20 predictors, thus the m with the minimal out of bag error was \(m\) = 4.
The final word feature RF model used had hyperparameters of \(m\) = 4 and number of trees equal 300. Below is the total accuracy and accuracy per class based on the final power feature random forest versus the word feature random forest. Note how we not only had a higher overall accuracy, but also a higher in-class accuracy with the power feature random forest.
Model | Total Accuracy | Child(0) | History(1) | Religion(2) | Science(3) |
---|---|---|---|---|---|
Power RF | 0.8844809 | 0.8658571 | 0.8374439 | 0.8598052 | 0.9441529 |
Word RF | 0.8662759 | 0.8614884 | 0.8185606 | 0.8153714 | 0.9214295 |
We can see that our greatest increase in accuracy occured in the Religion class. This is really good since we were having issues distinguishing between History and Religion books.
To build on above, let's again see the ROC curves for all pairs.
The final SVM radial classifier chosen with power features was with cost=15 and gamma=8. The model performed worse with just power features than with just word features. The accuracy within each class went down from using word features to power. From all the classes, the Child class had the most dramatic decrease, losing almost 20% accuracy.
Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 |
---|---|---|---|---|
0.79200 | 0.81828 | 0.78193 | 0.82161 | 0.81548 |
Set | Total | Child | History | Religion | Science |
---|---|---|---|---|---|
power | 0.8781603 | 0.8803471 | 0.840994 | 0.765772 | 0.93850087 |
Here are the pairwise ROC Curves below. Again, note that this model is one of our worst, and does poorly between history and religion.
The out of bag error was used to justify the number of trees to be used later on to tune for the predictor \(m\) with the default argument for the parameter \(m\) (the square root of the number of predictors). The number of trees chosen was 200.
As seen above, after growing 500 trees, the OOB appeared to be stabled at 200 trees and was chosen for the next tuning step.
Six different predictors were tried including the square root of the number of predictions: 10, 40, 50, 100, 125, 150. The lowest out of bag error was found using \(m\) = 100 predictors.
The parameter \(m\) that produced the local minimal out of bag error was \(m\) = 100.
The final word and power feature RF used \(m\) = 100 and with number of trees equal to 200. Below is a comparison of the total accuracy and the accuracy per class of the random forest classifiers.
Model | Total Accuracy | Child(0) | History(1) | Religion(2) | Science(3) |
---|---|---|---|---|---|
Word & Power RF | 0.8817465 | 0.8625070 | 0.8409940 | 0.8504871 | 0.9395774 |
Power RF | 0.8844809 | 0.8658571 | 0.8374439 | 0.8598052 | 0.9441529 |
Word RF | 0.8662759 | 0.8614884 | 0.8185606 | 0.8153714 | 0.9214295 |
When compared with just the word feature RF, the overall total accuracy of the combined word and power RF performed better, with the best improvement in the history, religion, and science classes.
When compared with the power RF, the overall total accuracy of the combined word and power RF is lower; the only class that performed better with combined word and power RF was history. Thus, the best random forest classifiers out of the ones produced is the power random forest.
Below are the ROC curves for all pairs. As expected, the History and Religion combo is not the best, but we also don't do as well in the Child and History combo. Perhaps this is a sign that we should really stick with our SVM model.
The final radial SVM for combined word and power features was chosen with gamma=1 and cost=75. Below is the Cross-Validation fold accuracy.
Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 |
---|---|---|---|---|
0.88010 | 0.87419 | 0.86991 | 0.86294 | 0.86421 |
Below is the in class and cverall accuracy rates of SVM radial models for combined word and power features.
Set | Total | Child | History | Religion | Science |
---|---|---|---|---|---|
wordpower | 0.96203238 | 0.9435725 | 0.97246929 | 0.990714574 | 0.96826251 |
Comparison of all the SVM classifiers. From the CV score, the word feature SVM seems to perform the best.
Model | Cross Validation Score |
---|---|
Word & Power SVM Radial | 0.87027 |
Power SVM Radial | 0.80585 |
Word SVM Radial | 0.92353 |
Comparing our SVMs it is clear that the word only feature space does the best. Our results did not improve by using the combined feature space matrix.
Once more, let's see the ROC curves for our model.
After comparison of errors between all models, the best model submitted was the radial SVM created with just the word features with cost=7.7 and gamma=80. The overall accuracy and accuracy per class for the final submission to Kaggle is seen once more below.
Set | Total | Child | History | Religion | Science |
---|---|---|---|---|---|
word | 0.97099695 | 0.96677834 | 0.95310164 | 0.96315121 | 0.990445431 |
The final results had an accuracy of 0.94294 on the final Kaggle validation set (testing on 60% of the observations for the public leaderboard).
The learning process was iterative and involved tuning multiple parameters simultaneously to yield the best results. Once we realized that a radial SVM model was consistenly our most accurate classifer, we fine tuned it as much as possible using cross-validation. We set one of the paramters, either cost or gamma, constant and chose a range of values for the other paramterer to find the best combinations out of around 165 possible combinations.
A special thanks to Temi Lal, Athena Nghiem, and Charles Thorson! Without their help, this project wouldn't have been possible.
Hmm... what can we look for to predict the onset of diabetes?
This project is complete, but its web version writeup is still in progress. In the meantime, feel free to follow it on Github.
So, is it the Democrats or Republicans?
This project was one I did early in my years at Cal. It was a gentle introduction to machine learning, without going in to too much depth. For that reason, the classifiers used are rather basic and achieve mediocre results. The main idea of this project was to determine which political party each county in the US would vote for. We wish to determine the best method (using different combinations of predictors) by training on the 2004 election results. After we successfully trained the most precise model, we used the same model on 2012 data to see if it can predict accurate results. We tested out knn and recursive partitioning models. You can follow this project on Github.
Below are the following features we used in our model. The feature selection was not done in any statistical way (Lasso, PCA, Decision Trees) since this was a beginner project. However, we did choose them based off of what made the most sense intuitively. These are features that correspond to rows represented by a county in the US.
We are given 3 different types of data: 2012 Election data (XML), 2010 Census Data (CSV) and County latitude and longitude Geographic Data in (GML). To use that data, we used web scraping to import them into R as data frames. To properly use those data frames, we had to first clean the data up for each data set. Because some of the data were missing Alaska, while others contained Puerto Rico, we had to take them out from their respective data frames. We were able to merge the data together using the GEO.id variable. To merge all the data frames together, we specifically created a new column called Merger. Merger contained the county and the state for each row. The county and state names are squished together with no space or dash and is upper case to prevent any issues merging them together.
We hand-picked which predictors we thought would be useful, as shown above. We wanted to create a data frame with the numeric values of our predictors only. To do this, we removed all rows that contained "NA", "-", or any other non-numeric value. We created a data frame with the remaining rows and used the county names as row names so they would not interfere with the knn.cv function. Then we took the "winner04" column and stored it as "cl" to use as the cl argument in knn.cv. The cl argument is simply a factor vector with the true classifications of the rows in our training set data frame. We standardized the predictors in order to account for the fact that different variables have different ranges and variances. We then tested our knn.cv error rate with different values of k, finding that k = 5 consistently produced the best results with the 2004 data. For the 2004 election, our lowest error rate was 13.4523%. When we ran the knn.cv function using the 2012 data and k = 5, we got an error rate of 14.7382%. Thus the knn model seems like a stable method to predict election results without overfitting.
We chose k based on the 2004 election data. This graph below shows how our misclassification rate changes for different values of k.
We see that the error rate starts out relatively high for k = 1 and k = 2. After that, we observe a rapid decrease up to k = 5, which is the minimum. Beyond k = 5, the error rate increases gradually. It helps to include more than one or two neighbors because this provides us with more information. However, beyond a certain point, it no longer makes sense to let such dissimilar counties have a say in our prediction.
The knn misclassification graph below for the 2012 election data is slightly different.
Once again, the misclassification rate starts off relatively high. After k = 2, the error rate falls up until about k = 17, which is the minimum. Beyond this point we observe the same gradual increase. The difference here is that the optimal value of k is much higher (17 vs. 5). However, using k = 5 still produces a low error rate. As expected, our errors on this "test" set follows the typical U-shape.
We note that for the 2004 election data, the K Values Errors for the K Values 1 , 2 , and 3 follow a "scattered" pattern while the remaining K Values follow a linear trend. This can be seen below:
This leads us to fit the remaining K Values of 4, 5, 6, ... , 48, 49, 50. The remaining data fits a linear model very well with a correlation of r = 0.96053. What we conclude from this plot is that for all K Values after 3, there is a linear increase in K Value Error with each increment of the K Value.
For the 2012 election data, the same situation applies as it did for 2004, except that we truncate the K Values 1, 2, 3, 4, and 5. This can be seen below:
Again, the remaining data follows a linear trend as well with a correlation of r = 0.91254. Once again, we conclude that after we truncate the "scattered" K Values and their K Value Errors, we have a linear trend in which there is a linear increase in K Value Error with each increment of the K Value.
The Classification Tree of 2004 election results utilizes a set of 20 predictors selected from the 2010 U.S. census data. After modifying the rpart.control argument, we were able to minimize the classification error rate as low as 19.828%, using all predictors. The model generated an extensive and complicated classification tree with massive overfitting. In order to avoid overfitting the data, we pruned the tree according to the cross-validation error (xerror) given in the complexity parameter table. We located the value of the complexity parameter corresponding to the lowest xerror, and pruned our classification tree using this complexity parameter.
After training our predictors to the 2004 election results, we predicted 2012 election results using the same set of predictors together with the same rpart.control function arguments. The prediction generated a relatively accurate result and the error rate was as low as 13.636%. Implementing the same pruning procedure, we pruned the Classification Tree of 2012 election results based on the complexity parameter (cp) of the lowest cross-validation error (xerror). Then we finalized our Classification Tree of 2012 Election Result plot. The plots can be seen below:
This is the classification tree for the 2004 data. Out of our 20 predictors, rpart picks the most useful ones. This tree shows that variables such as U.S. citizenship, the x-coordinate location, language spoken at home, and race were used to make splits.
This is the classification tree for the 2012 data. Similar variables were used to make splits. Once again, U.S. citizenship, race, language spoken at home, and geography were important variables.
With all the predictions across the many counties in the US, it can be hard to make a sense of the nation as a whole. The map below can help us see the bigger picture.
Every arrow in the generated graph indicates a shift in how much the indicated county (the county from which the arrow originates) has changed in its political orientation between the 2004 and 2012 elections. A right-facing arrow voted Republican in the 2012 election and a left-facing arrow voted Democratic in 2012. A blue arrow means that the county voted more Democratic in 2012 than in 2004 and a red arrow means that county voted more Republican. The length of the arrow indicates the magnitude of the electoral changes. So a long leftward blue arrow indicates that a county voted Democratic in 2012 and was significantly more Democratic than in 2004.
When comparing the two methods of rpart and knn, we see that for 2004 election data, knn was a better approach while rpart was better for 2012 election data. This can be seen below:
We also see that from 2004 to 2012 election data, the knn error goes from 13.4523% to 14.7382% while the rpart error goes from 19.828% to 13.626%. This shows us that knn is a more consistent method, but that rpart can give better results, as seen for the 2012 data. What we conclude from this is that for these data sets, knn is a more stable method that we can use with more confidence, whereas rpart is less reliable due to its inconsistency.
A special thanks to Temi Lal, Dongping Zhang, Aloysius Lai, and Ankit Aggarwal! Without their help, this project wouldn't have been possible.
C's get degrees, but do degrees get monies?
The purpose of this study is to examine how expenditures made per household are related to the highest level of education obtained by the people living there. One might expect those with a higher education level to earn more money and thus spend more in general. The research question at hand, however, is also concerned with exploring how households with varying levels of education might be different in how they distribute their spending into categories like food and apparel. Every quarter, the Bureau of Labor Statistics (BLS) uses a multi-stage, rotating panel design survey to collect data on consumer expenditures. You can follow this project on Github.
The data shows, as expected, a substantial difference in the amount of money spent on food, housing, apparel, and in total per quarter among households with different highest levels of education. Households where at least one person has obtained a professional or doctorate degree spend around five times on average more per quarter than those in which no one has attended school. This gap is larger for goods like clothing and smaller for necessities like food. In general, there is a steady increase in spending in all categories as highest education level increases.
Samples are representative of the total US civilian population (non institutional)
First step: The selection of 91 areas (PSU’s or Primary Sampling Units), each of which are composed of counties or groups of counties.
Each PSU is one of 4 types:
An unclustered sample of Consumer Units (CUs) is chosen within each of the 91 PSU’s. A CU is represented as a group of people living in one household and can be visualized as below:
This list of houses(sampling frame) from which the CUs were picked came from the 2000 Census of Population 100-percent-detail file.
This survey uses 3 rotating panels. This diagram below depicts when panel A’s CU’s get the questionnaire. The CU then reports expenditures for the last 3 months. After 5 quarters, the CU is replaced.
For panel B, the CU’s receive questionnaires on Feb, May, Aug. and so on.
We began by creating weighted boxplots. The boxplots showed the expenses for a particular category such as food, by each education level. Below is the one for food expenses:
We see that median expenditures increase fairly linearly with education. The maximum spent by households in which no one has attended school is about equal to the seventy-fifth percentile of CU’s in which someone has a doctorate degree. Below are the exact numbers for comparison:
Education Level | Avg Food Cost Per Quarter | SE of Avg |
---|---|---|
0 | 684.4608 | 154.14609 |
10 | 813.2782 | 29.29602 |
11 | 899.8643 | 35.41176 |
12 | 966.4474 | 15.4189 |
13 | 1069.1435 | 18.74799 |
14 | 1235.8400 | 24.16602 |
15 | 1392.7496 | 22.78892 |
16 | 1677.0883 | 29.85763 |
17 | 1886.9310 | 108.33085 |
Below is the weighted boxplot for house expenses.
From above, we see that the four lowest levels have about the same median housing expenditures. We also see that expenditures increase non-linearly after 13 (some college). Below are the exact numbers:
Education Level | Avg Housing Cost Per Quarter | SE of Avg |
---|---|---|
0 | 1692.092 | 535.88908 |
10 | 1468.386 | 73.36080 |
11 | 1660.299 | 54.74892 |
12 | 1843.497 | 39.88800 |
13 | 2189.924 | 51.76436 |
14 | 2546.429 | 54.03962 |
15 | 3201.593 | 57.07351 |
16 | 4256.581 | 110.40218 |
17 | 4869.307 | 265.37950 |
Below is the weighted boxplot for apparel expenses.
We can see from above that this plot has the largest percent increase between highest and lowest levels. We also see that PhD level households spend 5 times more than those with no education. Below are the exact numbers for comparison:
Education Level | Avg Apparel Cost Per Quarter | SE of Avg |
---|---|---|
0 | 96.42134 | 61.109791 |
10 | 73.99194 | 6.294394 |
11 | 103.43423 | 10.471877 |
12 | 102.46369 | 3.853024 |
13 | 136.80003 | 4.546982 |
14 | 163.29004 | 6.950044 |
15 | 226.30929 | 6.354671 |
16 | 275.50286 | 9.590669 |
17 | 464.94509 | 45.839083 |
Lastly, below is the weighted boxplot for total expenses.
We se that the largest jump occurs from 15 (Bachelor's) to 16 (Master's). We also see that combining expenses widens the expenditure gap by education. Lastly, we see a nonlinear increase in expenditures after 13 (some college). Below are the exact numbers for comparison:
Education Level | Total Cost Per Quarter | SE of Total |
---|---|---|
0 | 3147.508 | 929.3599 |
10 | 3821.470 | 147.4508 |
11 | 4331.983 | 151.4866 |
12 | 5473.5952 | 110.8788 |
13 | 6553.773 | 148.0559 |
14 | 8086.071 | 200.8992 |
15 | 9930.513 | 176.2847 |
16 | 13279.432 | 299.9030 |
17 | 15219.252 | 688.8679 |
From our analysis above, we can conclude that higher education is associated with higher education expenditures. These households also vary more in terms of how much they spend in total and in every category separately. With a higher education, there are more expenses. Either people with a higher education are spending too much money they do or don't have. However, with a higher education, we're sure they're wise enough to spend within their limits. Thus, it seems that having a higher education really does mean you'll more money to live by.
As with every study, ours has a few caveats. A few data points had to be excluded from the boxplots, but were included in the numerical estimations. Also, respondents may not be able to accurately recall all expenditures.
A special thanks to Temi Lal and Andrey Mironyuk! Without their help, this project wouldn't have been possible.
Woah, how many people are in that crowd?! Let's use stratified sampling and find out!
In early 2003, many peaceful marches were held around the country in protest of the US involvement in Iraq. One of these marches had a huge turnout and occured in San Francisco. That day, the San Francisco police and march organizers estimated the total crowd at 200,000. The San Francisco Chronicle tried to use aerial photography to estimate the size of the protest. The Chronicle used an interesting technique of sampling on their photos, and came up with an estimate of 65,000 at the moment of the snapshots. My group and I decided to replicate The Chronicle's methods to see if our estimate was close. We obtained a crowd count of 61,732 with a 95% confidence interval of [57121, 66343] people. You can follow this project on Github.
The anti-war demonstration began with an 11 AM rally at Justin Herman Plaza at the Embarcadero in San Francisco. At around noon, the march began up Market Street toward the Civic Center Plaza. The march route was approximately 1.75 miles. In the photos, taken of the entire length of the march at 1:45 PM, the crowd had moved far up Market Street. The Embarcadero area is nearly empty. The march was densest near the Powell Street cable car turnaround. The front of the main march had reached McAllister Street, as another mass of protesters was gathering in the Civic Center Plaza area. Organizers said the crowd was largest between 1:30 and 2 PM. The Chronicle's estimate is 65,000 at the moment the photos were made. This figure does not account for marchers who left the group before that time or who joined the rally later.
The aerial photographs of Sunday's anti-war march were taken at a time believed by organizers to be close to the peak of the march. The plane made a pass over the length of Market Street and the Civic Center area, taking contiguous photographs.
The lens in the camera looked directly down from 2,000 feet. This process was repeated once to ensure that there would be a complete set of usable photos. An additional pass was made to allow a Chronicle staff photographer to shoot out the window of the plane.
Air Flight Service, a company that specializes in high-resolution aerial photography, produced 9-by-9-inch black and white negatives, which were enlarged into prints with enough clarity to define individuals on the street. Below is one of the pictures that was taken.
Air Flight Service, under contract to The Chronicle, then applied a grid pattern to the prints to provide units in which the marchers could be counted. The entire length of the 1.75-mile route is visible in the images.
Because the density of the crowd varied over the route, the grids were sorted into five categories - 100 percent, 75 percent, 50 percent, 25 percent and 10 percent filled with people. This was done by visual estimates. Each of the nearly 300 grids was given a colored dot representing these percentages.
At this point, the hand counting began. A representative sample of grids for each density was selected. For instance, eight 50 percent grids were counted, eight 25 percent grids were counted and so forth. The totals for each group were examined to ensure there were not extreme deviations. This tested the original visual estimate that had assigned grids to density groups. If necessary, a sample grid's category was changed to reflect the count.
An average number of people for each density grid was established using the sample groups. Air Flight Service then tallied the grids to come up with a total of nearly 65,000 people visible from the air at the moment the photographs were shot. The company said the margin of error is 10 percent.
Chronicle staff members double-checked the calculations counting multiple grids, confirming that the density assignments were correct. The Chronicle then counted individuals in a sampling of the grids. The newspaper's count in some cases was higher, in some cases lower than Air Flight Service's count. But overall the newspaper's total confirmed the Air Flight Service count.
Counting every single person one by one in these photographs would be a very cumbersome task. Hand counting would also lead to a significant amount of human error. We also noticed the same pattern as The Chronicle's across all the images. When analyzed in pieces, the grids appeared to follow a consistent classification of crowd density. For example, if a grid had a lot of people in it, then it was classified as a high density grid. Thus, we needed to apply a sampling method that took these classifications into account and to only count particular number of grids within these different classifications. Therefore, stratified sampling was the sampling method that was selected.
Our first step was to open the six PDF images in Adobe Photoshop. We needed to divide all of the images into reasonably sized grids from which we could sample. These grids serve as our sampling and observation units. After experimenting with different settings, we decided to split each of the six large photographs into 600 by 600 pixel grids. We looked through the grids and divided the ones most densely packed with people in half again (600 by 300 pixels) to make counting easier.
The next step in this process was to pick the strata. We looked carefully at the grids to decide how to split them up most logically. It seemed to us that the grids we drew were either quite full or quite empty. We decided that we would stratify by the density of the crowds in the grids. Our five strata were for grids 0-5%, 6-25%, 26-50%, 51-75%, and 76-100% full. At this point, we went through each grid and color coded it depending on which stratum we believed it belonged to (see image below). After this, we counted the number of grids in each stratum.
We wanted the proportion of grids from each stratum in our sample to be about equal to the proportion of grids of that stratum in the population (proportional allocation). For example, if 50 of 200 grids in our population were from stratum one, then 5 of 20 grids in our sample would be from stratum one. Once we decided how many grids we wanted to sample from each stratum, we used a random number generator in R to select the actual grids. Grids were numbered moving first from top to bottom and then from left to right.
After that, we meticulously counted the number of people in the selected grids. We used the paintbrush tool in Photoshop to mark off people in the images and counted as we went. Once we got an estimate for each grid, we averaged within the stratum (see calculations in the Results section). We multiplied the stratum average by the total number of grids in that stratum and added these five numbers together to get our estimate. The standard error calculation is also shown below.
Here is a quick overview of our methodology and how we handled the photos.
As we counted each grid, we stored its count and classification in a CSV file. You can download the data here.
We began by importing our data and seperating out the strata grid counts by color. Here is a comparison of the boxplots of grid counts by their stratum.
For the most part, each strata was quite distinct in size. However, the two largest strata are much closer since it became hard to distinguish how "big" a grid was with many dots. Nonetheless, we still found a way to seperate the two groups respectively.
Our main goal of this project was to calculate a 95% Confidence Interval of the total across the strata. In order to do this, we first calculated the estimate of the total across the strata.
The following is the formula to find the estimated total:
$$ \hat{t}_{strata} = \sum\limits_{h=1}^H \hat{t}_{h} $$
Recall that:
$$ \hat{t}_{h} = N_{h}\times \bar{y}_{h} $$
where \(N_{h}\) is population size of the stratum and \(\bar{y}_{h}\) is the mean of the stratum.
Let's substitute this in the equation for \(\hat{t}_{strata}\).
$$ \hat{t}_{strata} = \sum\limits_{h=1}^H N_{h}\bar{y}_{h} $$
Now we will compute \(\hat{t}_{strata}\) in R.
strata.total
## [1] 61732.2
To find the confidence interval of the total among all the strata, we also need to find the standard error of the total. To do this, we must first calculate the variance of the total across the strata. Because the total in each stratum is independent from all others, we can use independence to add up the variance of the total in each stratum to find the variance of the total among all strata.
$$ \begin{aligned} Var(\hat{t}_{strata}) & = Var(\hat{t}_{h=1}) + Var(\hat{t}_{h=2}) + \cdot\cdot\cdot + Var(\hat{t}_{h=H}) \\ & = \sum\limits_{h=1}^H Var(\hat{t}_{h}) \end{aligned} $$
We will now substitute for \(\hat{t}_{h}\) in the equation for \(Var(\hat{t}_{strata})\).
$$ \begin{aligned} Var(\hat{t}_{strata}) & = \sum\limits_{h=1}^H Var(N_{h}\bar{y}_{h}) \\ & = \sum\limits_{h=1}^H N_{h}^2Var(\bar{y}_{h}) \end{aligned} $$
Now recall the following approximation:
$$ Var(\bar{y}_{h}) \approx (1 - \frac{n_{h}}{N_{h}})\frac{s^2_{h}}{n_{h}} $$
where \(n_{h}\) is the size of the sample in stratum \(h\), \(s^2_{h}\) is the variance of the sample in stratum \(h\), and the first term in the product, \((1 - \frac{n_{h}}{N_{h}})\), is the finite population correction factor.
Because we do not know \(S^2_{h}\), the variance of the population in stratum \(h\), we must approximate the variance with \(s^2_{h}\). Thus, we will plug in the approximation above into the equation for \(Var(\hat{t}_{strata})\).
$$ Var(\hat{t}_{strata}) \approx \sum\limits_{h=1}^H N_{h}^2(1 - \frac{n_{h}}{N_{h}})\frac{s^2_{h}}{n_{h}} $$
Now we can solve for \(SE(\hat{t}_{strata})\), the standard error of the total across the strata.
$$ \begin{aligned} SE(\hat{t}_{strata}) & = \sqrt{Var(\hat{t}_{strata})} \\ & \approx \sqrt{\sum\limits_{h=1}^H N_{h}^2(1 - \frac{n_{h}}{N_{h}})\frac{s^2_{h}}{n_{h}}} \end{aligned} $$
Now we will compute \(SE(\hat{t}_{strata})\) in R.
strata.totalSE
## [1] 2352.209
To compute the 95% Confidence Interval, we need to solve for the following:
$$ 95\%\:\text{Confidence Interval} = [\hat{t}_{strata} - 1.96 \times SE(\hat{t}_{strata}),\:\hat{t}_{strata} + 1.96 \times SE(\hat{t}_{strata})] $$
Plugging in our values for \(\hat{t}_{strata}\) and \(SE(\hat{t}_{strata})\) gives us:
$$ \begin{aligned} 95\%\:\text{Confidence Interval} & = [61732.2 - 1.96 \times 2352.209,\: 61732.2 + 1.96 \times 2352.209] \\ & = [57121,\: 66343] \end{aligned} $$
Our estimate of 61,732 people seems reasonably close to The Chronicle's estimate of 65,000. The Chronicle's estimate is also contained in our 95% confidence interval. We put a lot of time and effort into designing the strata, sampling, and taking measurements. However, like all studies, ours is not perfect and has some potential limitations.
We defined our strata based on what seemed most fitting after studying the distribution of the crowd densities among the grids. Because these were qualitative preliminary judgements, they are subject to error. We tried to assign grids to the strata as carefully as possible, but this is somewhat subjective. Some of the grids were harder than others to classify, which introduces room for more error. For the most dense stratum, we realized that it was too overwhelming and difficult to count the number of people in an entire 600 by 600 grid. We divided each of the 600 by 600 grids in this stratum in half. Our estimates for the number of people in a 600 by 600 grid in this stratum were our measurements times two. This implicitly implies an assumption of complete uniformity within the grid.
One of our largest sources of error comes from when we had to count the number of people in the sampled grids. These are aerial photographs, so people just appear as black dots. The darkness of the pictures in some areas made it very difficult to tell what was a person as opposed to a tree, telephone poll, parking meter, etc. It was especially difficult to count people who were standing in close groups because they blended together in the image. Sometimes, we found it easier to count the number of shadows instead of dots.
Though all of these factors are sources of potential inaccuracy in our study, some of them are practically unavoidable. Thus, when we dealt with factors with which we could minimize error, we tried to be as precise as possible.
Our goal was to get an estimate for the number of people at the peace march in San Francisco and we ended up with a 95% confidence interval of [57121, 66343]. While our estimate has some variance, it gives us a good idea of the order of magnitude of the desired quantity. Our confidence interval is consistent with the estimate found by the San Francisco Chronicle. Sampling allowed us to answer our research question quickly and efficiently. Using stratified sampling with proportional allocation let us sample with more precision as we had a smaller variance of the total across the strata.
"Counting crowds / Using aerial photography to estimate the size of Sunday's peace march in S.F." SFGate. 21 Feb. 2003. Web. 10 Feb. 2015.
Lohr, Sharon L.. Sampling: Design and Analysis. 2nd ed. Boston: Cengage Learning, 2010. Print
A special thanks to Temi Lal and Andrey Mironyuk! Without their help, this project wouldn't have been possible.
Do distinct objects really make the brain light up differently?
Visual recognition is a complicated process in our brain. Studying how the brain performs during it helps us improve our understanding of brain functionality. Therefore, we want to use the dataset from Haxby et als study, Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex to answer the following questions. If one presents different types of visual stimuli to a subject, would each stimulus evoke the same category-specific pattern of response in the ventral object vision pathway? Furthermore, can one fit a time series model to replicate the behavior of the BOLD signal from the responses? Also, can one use time series to match the correlation values of categories found in the study?
To study these questions, we performed linear modeling and ARIMA based times series analysis with this dataset. We found there is a consistent pattern for a subject to recognize one specific ob- ject. Even when we excluded maximally responding areas from the analysis, there was still a consistent pattern for visual recognition. Interestingly, our analysis showed that different subjects have dif- ferent visual recognition patterns. Although these patterns differed greatly across subjects, they were consistent within subjects. In other words, even though subject one has different patterns from subject two, they both had similar patterns among themselves when visually stimulated. You can follow this project on Github
In Haxby et al's study, Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex[1], researchers collected data from six subjects, including five females and one male. Different categories of pictures were presented to subjects as visual stimuli. The categories are faces, cats, chairs, shoes, bottles, tools, houses, and a control category of phase-scrambled images. The study aimed to answer the following questions: if we present different categories of visual stimuli to a subject, will category-specific response be evoked in the ventral cortex?
Each subject was placed into a functional magnetic resonance imaging (fMRI) facility for 12 experiment runs. One complete experiment run lasted 300s. It began with 12s of rest, followed by 8 stimulus blocks of 24s duration, one for each category of visual stimuli. There were 12s intervals of rest between every two blocks, and the whole procedure ended with another 12s of rest. Each picture stimulus was presented for 0.5s followed by an inter-stimulus interval of 1.5ms. 12 stimuli were presented during each stimulus block, and a total of 96 stimuli in a complete experiment run.
In the study, the data collected for each subject were split into two runs: odd runs and even runs. Correlation was used as the indication of response similarity. Results from analyzing within-category and between-category correlations suggest that a lot of response patterns are overlapping. In the over- laps, there are parts of the cortex that respond more to a certain stimuli than others. The study defined these responses as maximal response. In order to gain more insights from the non-overlapping parts of the responses, the study tested whether the patterns of non-maximal responses carry category-related information. Voxels that responded maximally were excluded from this calculation of correlations. The results shown that the removal of maximally responsive voxels from correlation calculations barely diminishes the accuracy of identification. The study concluded that both the pattern of large and small responses and the location of large responses carry category-related information and small responses are an integral part of the representation.
The study's curated dataset can be found and downloaded on the OpenfMRI database with ds105's accession number. The ds105 sub-directory contains files detailing this study, including general information (README file), related research articles (references.txt), detail information and update for this released dataset (release history.txt), the MR repetition time (scan key.txt), the name (study key.txt), and the major task for this study (object viewing) (task key.txt). In addition, the models folder contains files with the key conditions (list of object categories) (condition key.txt) and the comparison setting in this study (tast contrasts.txt).
Subjects have individual directories storing their results. There are four sub-directories in each of the respective directories. The anatomy sub-directory contains high-resolution scans of the subject's head (highres001.nii.gz), mask for obtaining the \brain only" scans (highres001 brain mask.nii.gz), and the \brain only" anatomy result (high- res001 brain.nii.gz). The \behav" sub-directory is empty since subject's behavior is irrelevant to this study. The \model" sub-directory provides information such as the onset time (in seconds), and the duration and weighting for each conditions (object category) for the 12 task runs in this study. The \BOLD" sub-directory contains fMRI results for all 12 task runs for each subject respectively. In each task run directory, we can find the fMRI result (bold.nii.gz) and a QA sub-directory with that run's time series analysis report, fMRI results pre-processing and confound files, and visualization of the brain (nii files).
The preprocessed data for this ds105 dataset can be found and download from https://nipy.bic.berkeley.edu/rcsds/. The processing applied to these fMRI scans includes: high-pass filtering in time (to remove low frequency drifts / noise, and registration to a standard anatomical template (the MNI template).
For each subject, there are eight condition files corresponding to each of the eight objects for each of the twelve runs. Each condition file consists of time points of when the object was presented during the run. We started our analysis with subject 1, where we first wrote helper functions to identify and remove outliers. Then we ran the event2neural function on subject 1's condition files, which returned an array of 121 zeros or ones. These values indicate time intervals at which the subject was looking at the object. We then used the numpy convolve function to convolve the BOLD signal onto the specified time intervals needed. After repeating this process for all objects, we created regressors and built our design matrix. Lastly, we fitted a linear model to the predicted BOLD signals from convolution, and gathered a mean RSS value. We then applied the contrast function to investigate whether each object shows a response pattern in the brain.Unfortunately, initial images produced did not show suffcient contrast, making it difficult to observe patterns. Thus, we used smoothing techniques to produce better images. For smoothing, we applied Gaussian filters over an array of voxel intensity values to generate images that are more appealing to the human eye and easier to identify different parts of the brain. The downside of this technique is that the new patterns identified can potentially be false since the data is transformed. We then repeated our data analysis with the rest of the subjects.
We used various statistical methods and tests in our analysis. First, we built a general linear model to provide estimates for the magnitude of the response for respective stimulus. Then we applied this model to obtain beta values for different objects. We then ran correlation analysis on the beta values, which is further explained below. Moreover, the BOLD signal data are time series, and we performed appropriate analysis such as an object correlation table for each subject and an autoregressive integrated moving average model to predict and fit the BOLD signal responses. Finally, in order to validate our models and data analysis, we looked at the mean squared errors to judge how well our model performs.
For correlation analysis, we divided all the runs into two groups (odd and even), and we aggregated all the results from each object within the group. Then, we computed the correlations between the average beta values of each object of each group and all objects in the opposite group. Take all the even runs where the subject was shown an image of a face as an example, we took the compiled array of beta values, and found the correlation between this array and that generated from all the odd runs where the subject was shown the images of a face, bottle, cat, chair, house, scissors, scrambledpix, and shoe respectively. This gives us 8 correlation values. By repeating the same procedures with all the other objects, we created a 8x8 correlation matrix.
Since we started our analysis with subject 1, run001, we will first display those results. We performed initial analysis to attempt to identify specific brain region for recognizing specific object, such as face or house. Here are the outliers we identified and removed.
Then, we generated the task time course with the event2neural function.
Afterwards, we performed convolution to generate predicted BOLD signals for this dataset.
These convolved results for each objects were used as parameters in our design matrix for linear regression. To avoid the drifting problem, we also included two drift parameters in the design matrix (as shown i Fig. 4). The final design matrix was arranged as followed: bottle, cat, chair, face, house, scissors, scrambledpix, shoe, drift1, drift2, and ones (for background)
As for the dataset, after removing the outliers and smoothing the data, we subset the brain region to perform analysis on brain areas. We then used the design matrix to run linear modeling on the brain subset to get beta values. These beta bata values helped us to identify specific brain area and pattern for a subject under specific visual stimulation like house:
As mentioned above, we want to run correlation analysis on the beta values. We first focus on the correlation analysis on a 2D brain slice. To do that, we selected the most responding 2D brain area to perform the analysis. In the initial analysis, we tried to compare between run1 and run2 of subject1. We expected to see high correlation within category and low correlation between category. For instance, the run1 house and run2 house should have high correlation, whereas the run1 house and run2 face should have a low correlation. The 2D brain slice for the corresponding conditions are shown here:
Unfortunately, after performing this initial study, we didnt get what we expected. However, some comparison across different runs showed our expected results. These observations indicated that there might be a variation in the results across runs. Therefore, to reduce the impact of variation across runs, we compiled the even run results and odd run results separately and used them to perform correlation analysis. The mean results for even run and odd runs showed a similar pattern:
The within category correlation increased and is relatively higher than between category correlation (as shown in appendix). Excluding maximally responding areas shown the similar pattern indicating the both the pattern of large and small responses and the location of responses carry category-related information.
Since the brain is a 3D structure, the visual recognition is processed in the entire brain. Therefore, we next focus on the 3D brain slice to perform correlation analysis. The correlation table between even and odd runs are listed below and shown as bar plots in appendix:
As shown in the tables above, none of the correlation values are perfectly 1. The diagonal values are correlations between each object with their respective odd or even runs (within-category correlations). All within-category correlation values are larger than 0.6. For bottle, cat, face, house, scrambledpix, and shoe, the respective within-category correlations are higher than all of that object's between-category correlations with other objects. For chair and scissors, the within-category correlations are not higher than all of their between-category correlations. Removing the maximally response area gave us the similar analytic results. This is consistent with our findings on 2D correlation analysis.
Next, to further study whether the observation is consistent across subjects, we performed the same analysis in all subjects of this dataset. Interestingly, the brain pattern for visual recognition is different across subjects. This might increase the variation for cross subjects analysis. The mean correlation table was then created to statistically study the correlation pattern across subjects with and without maximally response areas:
Since there is high variation across subjects, all the correlation numbers increased. However, the within category correlation was still relatively higher than between category correlation. The correlation without maximally responded areas shown similar pattern. This again confirmed our and Haxby et als finding that both the pattern of large and small responses and the location of responses carry category-related information.
In time series correlation analysis, some questions we wanted to answer were would the time series correlation tables be similar to ones produced from the linear regression process if we looked at the raw nii data, how would the correlations differ between the same and different objects, and was the experiment designed in such a way so that correlations are the approximately the same with different runs and subjects. The first step in the time series analysis was retrieving a particular subset of the brain from the raw data. This was done by getting the entire 4-D data, applying a mask on the 4-D data to remove any noises, and using the same subsetted brain as the linear regression process. Then, we averaged the sub- setted brain into an array that represents the BOLD signal responses from the beginning to the end of each run. The array has a length of 121, which is the same as the length of each run. Next, based on the condition files of each subject and run, we segmented the array and created a time series array for each of the eight objects. The condition files specified when an object was shown and for how long; thus, giving us a particular range to subset precisely. Finally, we grouped the object time series arrays into either even or odd runs. This allows us to compare averaged even and odd object runs to one another and generate Pearson Correlation coefficients. At the end of this analysis, six correlation tables were generated.
On the time-side analysis, we were able to fit autoregressive integrated moving average (ARIMA) models to the BOLD signal responses. This allowed us to see how good of a fit a time series model is compared to the raw data and predict BOLD signal responses without having to know an object a subject is looking at. To see if we could initally fit a time-side model, we looked at subject one run one's data. An autocorrelation and partial autocorrelation plot of the run's raw BOLD signals was generated.
There is a significant lag cut-off after lag two in the partial autocorrelation plot, which means the autoregressive order should be at least two. The cut-off in the autocorrelation plot was at lag four, indiciating the moving average order should be four. To ensure these are the right orders for the ARIMA model, an auto.Arima() function was used to automate the parameter search process. In subject one run one, the best ARIMA model was ARIMA(2, 0, 0). Upon fitting the model, the residual's autocorrelation and partial autocorrelation plot were looked at to ensure a goodness of fit.
In both plots, the significant lag cuts off after lag one; thus, showing that the residuals have no serial correlation. In addition, the Ljung-Box test was used to validate this assumption. It has a null hypothesis that the residuals are independently distributed and uncorrelated, with an altnerative hypothesis that they are correlated. The p-value came out to be 0.6071, so we accept the null. One trend that came up was that there were no periods of volatility in the residuals plot; thus, we could not look into an ARIMA-GARCH hybrid model.
Most of the within-category correlations we found are fairly similar than the paper's. Some of the values from our analysis are higher than the paper's, which could be due to the fact that we only analyzed correlations between runs for one subject, while the original research paper studied all subjects. It would make sense that correlations are higher when only analyzing brain activity from one brain.
Correlations between different objects (between- category correlations) seem fairly random, which also seem to be in line with the original paper's finding. We observe that our between-category correlation values also tend to be higher than those from the paper.
To test the reproducibility of the study, we also redid our correlation analysis after excluding the maximally-responsive voxels. With the exception of some within-category correlations, most correlation values across all categories dropped after we re- moved the maximally-responsive voxels. This might be an indication that most of the highly correlated areas tend to have higher responses.
Looking at the correlation tables, the first thing that we noticed was that the time series correlation tables were much more different from the linear regression correlation tables. Even though the same subset of the brain was used, the time series tables exclusively had negative correlations, higher correlations with dissimilar objects against same objects, and even negative correlations with the same objects.
These differences could have come from how the two processes were executed. For example, the linear regression process involved convolving the data and transforming the data with beta values, whereas the time series analysis only looked at the raw data. Thus, this is the most likely reason as to why the tables differed. The second thing the time series correlation tables had was that across the six subjects, no two subjects had similar correlation results. This confirms one of our hypothesis that the brain is a complex structure that will never give consistent results even with a well designed experiment. What this means is that the BOLD signal responses are not exactly 100% comparable and it should be expected to have contrasting correlation coefficients. One last point to bring up is that the BOLD signals were not consistent across a subject's runs. In one run, a cat BOLD signal would come out at an intensity of 10,800 and then in another, it would come out to 10,500. One question we wanted to answer was given the objects, can we predict the BOLD signal responses? Given that an object's BOLD signal was never the same, this showed us that we couldn't answer that particular question. Thus, we were unable to conclude anything regarding predicting a BOLD signal given the objects.
In terms of time series analysis, now that there is a good model for one particular run, we can go about answering the original questions we had with time-side analysis. When we plotted the actual to the fitted values, we see that the ARIMA model was able to capture the actual values very well. This allows us to get into forecasting and predicting future responses. However, one limiation with the python StatsModel package is that there are no forecasting methods available. Thus, no further progress could be done unless R was available as a tool.
There are a lot of problems that came up during our analysis. To begin with, since most of us have limited neuroscience knowledge, it was very difficult just to understand the study and the data itself. The research paper of the study uses specific and technical terms that make it hard to comprehend. More time was spent at the earlier stages rereading the original paper and exploring the data folders than expected, which slowed down our progress for analysis. Moreover, the values from correlation analysis did not come out as clean as we hoped, which also set our progress back. We did not end up have time to apply machine learning techniques for predictions as we originally planned to.
[1] J. V. Haxby et al., Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex, Science, 293 (2001), pp. 2425-2430.
A special thanks to Dongping Zhang, Edith Ho, Mike Tran, and Tzu-Chieh Chen! Without their help, this project wouldn't have been possible.
Sub001 2D Correlation: Bottle
Sub001 2D Correlation: Cat
Sub001 2D Correlation: Chair
Sub001 2D Correlation: Face
Sub001 2D Correlation: House
Sub001 2D Correlation: Scissor
Sub001 2D Correlation: Scram
Sub001 2D Correlation: Shoe
Sub001 3D Correlation: Bottle
Sub001 3D Correlation: Cat
Sub001 3D Correlation: Chair
Sub001 3D Correlation: Face
Sub001 3D Correlation: House
Sub001 3D Correlation: Scissor
Sub001 3D Correlation: Scram
Sub001 3D Correlation: Shoe
Cross Subject 2D Correlation: Bottle
Cross Subject 2D Correlation: Cat
Cross Subject 2D Correlation: Chair
Cross Subject 2D Correlation: Face
Cross Subject 2D Correlation: House
Cross Subject 2D Correlation: Scram
Cross Subject 2D Correlation: Shoe
Cross Subject 2D Correlation: Scissors
Cross Subject 3D Correlation: Bottle
Cross Subject 3D Correlation: Cat
Cross Subject 3D Correlation: Cat
Cross Subject 3D Correlation: face
Cross Subject 3D Correlation: House
Cross Subject 3D Correlation: Scram
Cross Subject 3D Correlation: shoe
Cross Subject 3D Correlation: Scissors
A special shoutout to those who made this website possible!
A Bootstrap template was used to create this website. It can be found here.
The background image for the intro was found here.
The logos in the About section were found here.
PrettyR was used for R code syntax highlighting. You can find the tool here.
Below are where the thumbnail images for each project were found: