Statistical Learning and Applied Modeling in Education

Examples and Concerns

Jared Knowles

01-31-2014

Motivation

  • Of the two modeling cultures, we've tended to focus overwhelmingly on one
  • Computation increases are changing everything
  • Data is growing and many problems have different issues
  • Prediction is underused and undervalued, and this undermines inference

The Data Modeling Culture

  • Starts philosophically with the idea that we have written down a set of X that describe Y with a known functional form that we are testing
  • Black box between x and y can be known because the data generating process DGP is some functional combination of predictors, parameters, and noise
  • Model fit is based on goodness of fit and residual tests

FP

The Algorithmic Modeling Culture

  • Black box is unknowable - we are not modeling nature but seeking to use similar inputs to predict the outputs of the natural process
  • Model fit measured by prediction accuracy

FP

The Wisconsin Dropout Early Warning System

“To help keep all kids on a path to graduation, we just delivered - with no new funding - a new statewide Dropout Early Warning System, called DEWS, to all districts. DEWS makes it possible to identify kids who may be at risk, and allows districts to intervene as early as middle school.” ~ Tony Evers

DEWS

  • The Dropout Early Warning System for the Wisconsin Department of Public Instruction
  • Leverages DPI's administrative records to provide predictions on student high school completion while students are in middle grades
  • Communicates the results to school staff in all Wisconsin public schools serving students in the middle grades
  • Comes with an interpretive guide and strategies for success available online
  • Released in September of 2013, with bi-annual updates in April and August
  • Serves as a good example of where social science and applied modeling intersect

Why DEWS?

  • Every child a graduate, college and career ready – Agenda 2017, DPI's strategic plan
  • DEWS focuses on providing schools and districts an early notice of whether or not a student is likely to complete high school on time
  • DEWS uses data on historical cohorts of students in Wisconsin to link middle grade student outcome data with the long-term outcome of on time graduation
  • DEWS provides a relatively accurate assessment of the likelihood of on time graduation for individual students across the state

Graduation and Droput By the Numbers

9,092 students in 2010-11 did not graduate with their cohort.

Group Expected Grads Rate Difference
White 54,468 49,783 91.4% -
American Indian 1,027 737 71.7% 19.7
Asian 2,517 2,225 88.4% 3
Black 6,889 4,395 63.8% 27.6
Hispanic 4,751 3,420 72.0% 19.4
Total 69,652 60,560 86.9% -

What is DEWS?

DEWS is an applied statistical model that combines several major features:

  • Data import, filtering, and cleaning for analysis from the state longitudinal database
  • A machine learning algorithm to search for the best predictive model
  • A prediction routine to apply models to current students
  • An exporting feature to push predictions into the state business intelligence tool, WISEdash for Districts
  • A display layer available to schools and districts securely for exploring the results
  • In reality, it resembles software as much as a statistical analysis

Under the Hood of DEWS

DEWS consists of several sub-routines that can be thought of as states of building a statistical model

  1. Data acquisition
  2. Data cleaning, normalizing, and standardizing
  3. Model feature and model algorithm search
  4. Model testing
  5. Model selection
  6. New case scoring
  7. Prediction export for reporting

All modules are built in the free and open source statistical computing language, R.

DEWS by the Numbers

  • Analyzes over 250,000 historical records of student graduation
  • Provides predictions on over 180,000 students in the state
  • Produces predictions on students in over 1,000 schools
  • Selects from over 50 candidate statistical models per grade
  • Hundreds of users have accessed thousands of individual student reports across nearly every Wisconsin school district
  • Working on open sourcing the code
  • Being explored in Michigan, New Jersey, and school districts in Kansas, Montana, and Minnesota

DEWS as an Applied Model

Data and Computing Trends

  • Available data in education is growing astronomically.
  • People are talking about things like “data science” and “big data” even NSF.
  • Data sources are shifting from national surveys to administrative records
  • More data = more problems; more data + more sources = more problems\( ^{2} \)

Increased Computational Power

FP

Increased data size and complexity leads to new problems that increased computational power often helps to solve.

Examples of Challenges and Solutions Posed by Computation

  • Bigger datasets have highly complex structures to them such as deep hierarchies, cross classifications, and high collinearity
    • Methods like HLM have difficulty scaling to 3, 4, or 5 levels that may exist within a statewide data system
    • Cross-nested and cross-classified observations are common in observational data, and difficult to deal with for many approaches
    • Alternative methods like Bayesian mixed effect regression or regression trees are more CPU intensive, but more flexible
    • With 12 regions, 72 counties, 424 districts, 2,200 schools, and tens of thousands of classrooms and hundreds of thousands of students the modeling data structure is complex

Straining our Generalized Linear Models

  • Increased number of predictors allows us to build models of complex group interactions that separate
  • Parameter estimates of demographic indicators are invalid when the demographic indicator is not observed in each outcome category (they are perfectly collinear)
  • Quasi-separation can occur when this is close
  • Corrections exist to adjust for the fact that maximum likelihood estimates are invalid in this case (Bayesian estimates, Firth bias-correction)
  • Again, leveraging computation to address a problem of increased data complexity

DEWS

  • DEWS data has a complicated hierarchical structure
  • DEWS data has rare cases that have to be addressed (e.g. blind students) across most indicators
  • Using CPU-intensive techniques can work, but is not limitless – some models are too slow to developed, modified, evaluated, and implemented
  • As it is, DEWS takes about 48 hours to build data and models, test them, select the winners, and produce predictions for current students
  • But in the future… who knows?

Being a Modeling Pluralist

Schools of statistical thoughts are sometimes jokingly likened to religions. This analogy is not perfect - unlike religions, statistical methods have no supernatural content and make essentially no demands on our personal lives. Looking at the comparison from the other direction, it is possible to be agnostic, atheistic, or simply live one's life without religion, but it is not really possible to do statistics without some philosophy. ~ Andrew Gelman

What is a statistical model?

  • “All models are wrong, some models are useful” ~ George Box
  • Statistical models are mathematical summaries of correlations and probabilities of known data
  • Being wrong is a feature of a statistical model, the goal is to explain as much data as possible with as few variables as possible
  • The most common in the social sciences is the linear regression model
  • Sometimes the goal is inference and other times it is prediction

Statistical Modeling

It is useful to remember that in all statistical modeling we are looking at the following relationship:

\[ \hat{Y} = \hat{f}(X) \]

In this case \( \hat{f} \) represents our estimate of the function that links \( X \) and \( Y \). In traditional linear modeling, \( \hat{f} \) takes the form:

\[ \hat{Y} = \alpha + \beta(X) + \epsilon \]

However, there exist limitless alternative \( \hat{f} \) which we can explore. Applied modeling techniques help us expand the \( \hat{f} \) space we search within.

Functional forms

Figure Adapted from James et al. 2013

Figure adapted from James et al. 2013 (figure 2.7)

Buyer Beware

A big computer, a complex algorithm and a long time does not equal science. ~ Robert Gentleman

Statistical Learning or Statistical Inference?

The line between statistical learning and statistical inference has always been blurry and unclear. A few questions can help:

  • Am I interested in accurately estimating unobserved observations based on what I have learned in my sample?
  • Am I interested in the relationships among the parameters in my sample because of a theory I am testing, or because of how they can explain an outcome I am interested in?
  • Is the data I am using common and relatively untransformed? Will new data be created regularly that I can fit the same model to and update?

Why the Difference?

Algorithmic Models:

  • Provide information to users about what to expect given certain data
  • Serve many goals including prediction of non-observed outcomes, summarizing large datasets, measuring uncertainty
  • Goals for the model are defined by explicit tradeoffs

Data Models:

  • Focused on understanding patterns in the current data
  • Seek to understand how current data extrapolates to a population
  • Estimates population parameters from sample data about relationships between inputs and outputs

Predicting Dropout

Algorithmic Models:

  • Data: Regularly collected at specific timepoints, standardized
  • Many cohorts with common data
  • Interested in learning which students today are likely to dropout in the future
  • Want: Confident predictions on likely graduation of new students, used to decide how to allocate resources and services to students

Data Models:

  • Data: national survey data, unlikely to be collected on future observations
  • One cohort is followed in the data set
  • Interested in learning if social and emotional concerns are more important than academic success in predicting graduation
  • Want: unbiased and precise estimates of parameters and if possible ability to make causal claims

On Prediction

  • Note that prediction is important in both cases
  • In data models, making a good prediction is the sign that our theory has explanatory power
  • In algorithmic models, making a good prediction is a sign that we have approximated the natural process correctly
  • In both cases, we should care deeply about prediction and think carefully about measuring it

On Nails, Hammers, and Models

The best available solution to a data problem might be a data model; then again it might be an algorithmic model. The data and the problem guide the solution. To solve a wider range of data problems, a larger set of tools is needed. ~ Leo Breiman

Some Vocabulary

  • Training data
  • Test data
  • Bias (error)
  • Variance (error)
  • Data the model is fit to (analytical sample)
  • Data the model predicts, to evaluate model fit
  • Refers to the amount of error due to simplifying a complex process
  • The amount the \( f \) would change if fit to a different training set of data

The Challenge

  • When using a statistical model to make predictions we have to think clearly about the data we use to build the model, and the data we will be making predictions about
  • We may build a model with high internal validity for the data at hand, but that data may not be representative of the data the model will apply to
  • We call this the training error and the test error
  • In inferential statistics we often seek to reduce training error and not concern ourselves with test error
  • In applied modeling we focus on finding the optimal tradeoff between variance and bias in order to reduce test error

A Simple Motivating Example

plot of chunk unnamed-chunk-2

Forecasting Apple Stock Could be Useful

  • Fit a model on the earlier part of the data (in blue)