Friday, November 4, 2016

Project 3 Analyze Week - Using independent variables and an OLS regressions to predict methamphetamine lab locations

We were given extra time for this lab and I can see why!  We began with 31 independent variables and ran 20 OLS regressions each time removing one variable and analyzing its effect on the final outcome.  Before illustrating the final OLS regression result I want to indicate my methodology.  The initial 3 checks were used to determine whether the independent variables were helping or hurting the model, were the relationships in line with the expected results and were there redundant explanatory variables.  Items evaluated were probability, ideally this value should be as small as possible to indicate statistical significance. We used the cut-off >0.4 for removing independent variables.  The next check was the Value Inflation Factor (VIF) which represents if there are multiple variables which similarly effect the model.  In this case, we set the baseline for removal candidacy as >7.5.  The third check was the variable’s coefficient.  A strongly positive or strongly negative coefficient is an indicator of the relationship between the dependent and the independent variable.  Numbers near zero (absolute value less than 1), indicate that the variable has minimal effect on the model and may need to be removed.  By simultaneously analyzing these three criteria at each iteration the impact which the individually removed independent variable had on the model could be determined.  In some cases, one may need to return a removed variable to the regression even if it first appeared unimportant as each iteration produces new results which impact all variables.  After twenty iterations, the next of the six checks were to be employed.  Check 4 determined whether the model indicated bias.  By bias represents non-linear trends, outliers or skewed results.  Conveniently the analysis results within each OLS provided the Jarque-Bera Statistic score which is the result of a check for bias.  If the p-value was <0.05 and had an asterisk it was an indicator of bias.  By employing scatter plots and graphs the potential independent variables causing the bias could be identified and adjusted within the next regression; combining these tools with the ability to visualize the histograms the potential issues were quickly identifiable. This would suggest reevaluating the OLS routines to improve results.  Check 5 was used to confirm that important independent variables had not been removed.  By examining the residual standard values within the map generated via the OLS routines, the range of results could be graphically identified.  Ideally a range between -0.5 and +0.5 indicates an accurate prediction.  It is important to note that a negative standard residual means that the model predicted fewer locations for meth labs than were identified in the original data; conversely a positive standard residual means the model predicted more locations for meth labs than identified in the initial data.  The final check, check 6, reviewed the models ability of predicting the dependent variable (in this case meth lab density).  By reviewing the R-Squared value the predictor was that the higher the value the more accurate the model.


The resulting map from the final regression is:

No comments:

Post a Comment