Summary

Project Motivation

Having an advanced degree has become a necessity for securing a job in many fields. However, it is difficult for applicants to predict their likelihood of being accepted to these academic programs. Applying to graduate school can cost students thousands of dollars in fees that may not always result in a return on investment. We want to help prospective students estimate the likelihood that they will be accepted to graduate programs in order to help students plan the number of applications to send out given their current academic standing.

Training and Testing

We scraped data from GradCafe.com using a Python script. The numeric attributes are Student GPA, Student GRE Quantitative Score, Student GRE Verbal Score, and Student GRE Writing Score. Moreover, we consider the rank of the 16 schools with the most observations as a numeric attribute. The nominal attributes are Student Residential Status (A: American; U: International, with US degree; I: International, without US degree), Graduate Program Type (PhD, Masters), School Name, Program Name and Result (Rejected/Accepted).

We removed observations that did not have GRE scores and were left with 10571 observations out of an initial 22582 observations. We then converted GRE scores on the 200 to 700 scale to scores on the 130 to 170 scale (effective August 1, 2011). We also made the school and program names consistent as sometimes a school was referred to by different names (e.g. MIT and Massachusetts Institute of Technology), or a program name contained typographical errors. For our initial training we marked unknown values as such and replaced unknown numeric values with the average of the attribute for the data set. The unknown values for the residential status remained ‘?’. In order to create training and test data sets, we randomized rows in the spreadsheet by creating a column of random numbers and sorting on this column. We used the top 70% of the observations for training, and the remaining 30% for testing. We used an online tool to convert the .csv file into an .arff file. We classified using Weka.

Solution and Key Results

We decided to explore the following two settings:

Training considering only the schools with most observations, testing on 30 % of the data. Can we predict the likelihood of being accepted into one of 16 graduate schools for engineering given school rank, school name, GPA, status, degree type, and GRE scores? The baseline accuracy (ZeroR) for the dataset is 50.3 %. RotationForest provided us with an accuracy of 71.2 %.
Training on the entire data set, testing on 30 % of the data. Can we predict the likelihood of being accepted to graduate school for engineering given GPA, status, degree type, and GRE scores? We obtained a baseline accuracy of 56.9 %. RotationForest provided us with an accuracy of 63.1 %.

In both settings, the decision tree indicated that the attribute with the highest information gain is the degree type. In order to determine how representative the contributors to GradCafe are of all applicants, we compared some of the attributes of our dataset to the average values of college students. The students in our sample had higher GPA and GRE scores in all areas. Furthermore, there was a significantly higher percentage of international students in our dataset. Thus, our data may only be representative of those students who use GradCafe, who may overall be more concerned about getting into graduate school.

Learning curve for the top school data set, where we used 10-fold cross validation on the combined training and test sets. The underlying model is RotationForest.

Learning curve for the entire data set, where we used 10-fold cross validation on the combined training and test sets. The underlying model is RotationForest.

Like what you see? Download the final report (PDF) below or try the model online and predict whether you will be accepted to a graduate engineering program.

Download final report (PDF)

SuMMARY

Project Motivation

Training and Testing

Solution and Key Results