**Applying multiple regression analysis to my data:**

**Is this valid? Does the output tell me anything useful about the interrelationship between dyslexia and academic confidence?**

Tops, W., Callens, M., Lammertyn, J., van Huus, V., Brysbaert, M., 2012, Identifying students with dyslexia in higher education, *Annals of Dyslexia*, 62(3), 186-203.

The paper recently digested (above) which presents research that took place recently at a Belgian university to explore ways to better identify students with dyslexia caused me to reflect on the process of multiple regression.

Although I had been aware of the power of multiple regression as a prediction tool I had not considered how it may be useful for gaining a further understanding about my data.

We understand that multiple regression requires the determination of one dependent variable to explore in relation to multiple independent variables with a view to generating a model that might *predict* a value output for that dependent variable based on multiple inputs that is a *better *predictor than the mean model, I want to consider how the multiple dimension model of Dyslexia Index that my project has generated might be linked as input (i.e. independent) variables to Academic Behavioural Confidence output.

Laerd Statistics has guided me through the process of setting up and running a multiple regression in SPSS and to become acquainted with the procedure, I have run the regression on the complete datapool. The regression appears valid and meaningful and the results are reported below.

But my interest is in using the multiple regression process to add substance to my argument that students with previously unidentified apparently dyslexia-like study characteristics present a higher level of Academic Behavioural Confidence than their dyslexia-identified peers.

So my plan is to run the regression again but on the split-group datapool. In this way, distinct regression models will be produced for each of my main research subgroups: ND, and DI – students with no disclosed or declared dyslexia and students with identified dyslexia respectively.

Thence, by using just the model generated by the subgroup of dyslexic students, use this as predictor for the ABC scores of students in the subgroup of non-dyslexic students but who have presented a Dyslexia Index Profile of Dx > 592.5, which I have set as the critical point identifier for establishing the real research subgroup of interest, students with unidentified dyslexia-like attributes (research subgroup DNI).

I can then use the model to calculate the expected ABC values for each of the students in this research subgroup to see how this compares against the *actual* ABC value that has been measured using the eQNR. Then I can think about what this may be telling me :-/

I am guided that these are the assumptions that must apply in order for a multiple regression analysis output to be appropriate and meaningful:

- I have continuous dependent variable – I do, it is Academic Behavioural Confidence and I have measured it across a continuous scale 0 – 100;
- I have two or more continuous independent variables – I do, these are the 20 dimensions of dyslexia that together constitute my Dyslexia Index, Dx;
- I have independence of observations (i.e. independence of residuals) – SPSS checks this using the Durbin-Watson statistic and apparently an ‘ideal’ value is close to 2.
- There needs to be a linear relationship between both a) the dependent variable and each independent variable AND ALSO b) between the dependent variable and the independent variables collectively. This is checked in SPSS by plotting the studentized residuals against the (unstandaradized) predicted values – exactly what these are is not important to understand at this stage but I will explore the theory behind this later;
- The data needs to show
*homoscedasticity of residuals*– which means that the variances along the line of best fit remain broadly similar; - My data must NOT show multicollinearity – this is where two (or more) variables are highly correlated with each other. The test for this is by inspecting the matrix of correlation coefficients, looking for any values of > 0.7 (these
*would*be indicating multicollinearity) and by inspecting the Tolerance/VIF values where we are looking for a value of < 0.1 for the tolerance (equivalent to VIF > 10 – they are reciprocals of each other) and if these occur, there exists an element of collinearity that needs dealing with; - The data should present no significant outliers, high leverage points or highly influential points and these are all classifications of the effects that unusual points can have on the regression output. SPSS provides guidance about how to identify and deal with these cases should they occur.
- We need the distribution of residuals – that is, errors – to be approximately normally distributed. SPSS helps with this by mapping out a histogram with a superimposed normal curve, but also presents a P-P plot (too complicated and boring to explain here).

So that’s it. What follows first is the output and reporting for the *second* multiple regression run where I instructed SPSS to use the split group data. Assumptions 1 & 2 both apply and so reported is the output results for each subgroup commencing with Assumption 3:

**Assumption 3:****SATISFIED**; For research subgroup ND there was**independence of residuals**as assessed by a Durbin-Watson statistic of 1.913;- For research subgroup DI, Durbin-Watson = 1.915 also indicating independence of residuals;
**Assumption 4:****SATISFIED**; By plotting scattergraphs of the standardized residuals against the unstandardized predicted values, both research subgroups, ND and DI presented approximately linear relationships;**Assumption 5:****SATISFIED**– pretty much; there was strong evidence of**homoscedasticity**as identified from the scatterplots constructed for assumption 4 above;**Assumption 6: SATISFIED**– for neither of the research subgroups ND nor DI were there any independent variables’ correlations of > 0.7;**Assumption 7: SATISFIED**– for neither research subgroup were there any**VIF**values < 0.1;**Assumption 8: partially SATISFIED**– firstly, for neither research subgroup did SPSS output a ‘Casewise Diagnostics’ table which indicates that no values of residuals in either research subgroup of > +/-3 (standard deviations); Additionally, checking for values > =/- 3 in the SPSS datacolumn SRE**confirmed no significant outliers**; Checking for**leverage points**by inspecting SPSS datacolumn LEV indicated some values between 0.2 and 0.5, considered to be ‘risky’ to include with three above the critical 0.5 level considered as ‘dangerous’ to include. These were associated with respondents: #98294854; #54513534; #16517091.**RUN THE REGRESSION AGAIN (later) WITH THESE RESPONDENTS REMOVED**task to included in the formal data analysis write up);**HOWEVER:**by inspecting SPSS datacolumn COO,**no values occured for Cook’s > 1**which suggests that even though some leverage points were identified, these are unlikely to have been significantly influential.- The final ‘test’ is for
**normality**; for research subgroup ND, a somewhat normal distribution was indicated on the plot for standardized residuals although bi-modal peaks were evident. However, the P-P plot show good alignment of variances with the line of best fit; - for research subgroup DI, the histogram plot for standardized residuals showed an approximately normal distribution, also indidated on the P-P plot where small variances from the line of best fit were shown;

**Determining how well the model fits the data:**

- For research subgroup ND:

The model summary table presented by SPSS indicated a **moderate to strong** correlation of **0.738 **between the scores predicted by the correlation model and the actual datapoints.

The corresponding **coefficient of determination**, which measures the **proportion of variance** in the dependent variable which is explained by the independent variables, was **0.545**, indicating that 54.5% of the variance in Academic Behavioural Confidence in this sample is explained by the addition of all the independent variables into the regression model. The **Adjusted R Square** value, which provides an estimate of the proportion of variance in the dependent variable which might be expected in the background population was **0.427** (that is, 42.7% of the ABC variance) and **this is also an indication of the EFFECT SIZE**.

- For research subgroup DI:

The model summary table presented by SPSS indicated a **moderate to strong** correlation of **0.721 **between the scores predicted by the correlation model and the actual datapoints.

The corresponding **coefficient of determination**, which measures the **proportion of variance** in the dependent variable which is explained by the independent variables, was **0.520**, indicating that 52.0% of the variance in Academic Behavioural Confidence in this sample is explained by the addition of all the independent variables into the regression model. The **Adjusted R Square** value, which provides an estimate of the proportion of variance in the dependent variable which might be expected in the background population was **0.316** (that is, 31.6% of the ABC variance) and **this is also an indication of the EFFECT SIZE**.

Thus the multiple regression I have conducted through SPSS on the split-group datapool appears to be strong and valid.

**Prediction Model**

These are the linear regression model equations which can be used to predict values of Academic Behavioural Confidence based on inputs from Dyslexia Index dimensions 1 -> 20.

Research Subgroup DI:

ABC = 53.332 + 0.074(Dx01) – 0.080(Dx02) – 0.017(Dx03) + 0.043(Dx04) + 0.170(Dx05) – 0.144(Dx06) + 0.058(Dx07) – 0.018(Dx08) + 0.110(Dx09) + 0.122(Dx10) + 0.046(Dx11) + 0.001(Dx12) + 0.071(Dx13) – 0.046(Dx14) + 0.006(Dx15) – 0.050(Dx16) + 0.003(Dx17) + 0.005(Dx18) – 0.168(Dx19) – 0.028(Dx20)

Research Subgroup ND:

ABC = 58.432 – 0.037(Dx01) – 0.071(Dx02) + 0.036(Dx03) + 0.020(Dx04) + 0.162(Dx05) – 0.085(Dx06) + 0.095(Dx07) – 0.136(Dx08) -0.013(Dx09) – 0.040(Dx10) + 0.015(Dx11) + 0.027(Dx12) + 0.066(Dx13) – 0.039(Dx14) +0.029(Dx15) + 0.079(Dx16) + 0.077(Dx17) + 0.045(Dx18) – 0.107(Dx19) + 0.008(Dx20)

The next (exciting!) step is to apply the Regresion Model for research subgroup DI to students in research subgroup DNI – that is, students in a subgroup of research subgroup ND who presented a Dyslexia Index of Dx > 592.5 – and compare the outcomes with the ACTUAL Academic Behavioural Confidence values recorded by each of these respondents in the QNR returns. The table below presents the results:

## Leave a Reply

You must be logged in to post a comment.