|
Guidelines
for the Accreditation Panel’s deliberations
The charge to the panel
In evaluating the program’s evidence for each component of
the TEAC system, the panel has two tasks: (1) to eliminate, if possible,
the plausible rival hypotheses for the interpretation of the evidence
that undermine its validity; and (2) to determine how much evidence
is sufficient to support the claim that the program satisfies the
system’s elements.
In this respect, TEAC panel members are like jurors
in the American judicial process, who must determine whether the
evidence rises to a level that satisfies a legal standard. Whereas
the legal standard may require, for example, evidence of intent,
the evidence that supports the claim of intent resists a clear-cut
standard in the traditional sense of some bright line between intention
and no intention. TEAC Accreditation Panel members, like jurors,
must weigh the evidence and decide if the evidence is sufficient
to certify that the program merits accreditation, provisional, new
program, preaccreditation status, or a continuation of candidate
status.
TEAC defines the standard for each element and
component of its system as the point, as determined by the Accreditation
Panel, at which competing and rival claims can be ruled out, the
point at which the evidence is conclusive, clear and convincing,
and the point below which the evidence is insufficient, flawed,
or inconsistent.
In practice, given the state of scholarship in
education, the TEAC standard of evidence is met when the evidence
cited in the Brief is consistent with the claims made about student
learning and when there is little or no credible evidence that is
inconsistent with the claims.
How the panel makes its decision
Although TEAC’s quality principles and standards for program
capacity suggest the characteristics of a quality program, they
do not offer sure rules or algorithms to follow that would determine
whether or not the evidence that a program has these characteristics
is trustworthy and sufficient.
For this reason, to establish that a program has
met TEAC’s principles and standards, TEAC employs heuristics
to guide the accreditation decision making and judgment about whether
or not the evidence of student learning is trustworthy (determined
by the audit team) and sufficient (determined by the Accreditation
Panel and Accreditation Committee).
TEAC’s heuristics guide the determination
of whether or not the cited evidence of student learning, for example,
is accurate and trustworthy; is, in fact, evidence of what it purports
to be; and is sufficient to support the program faculty’s
claims for student learning.
Ruling out rival hypotheses. The
panel members represent several roles in the profession because
their diversity makes it more likely that they can bring forward
alternative explanations of the evidence presented in the Brief.
The panel conceptually tests the evidence in the Brief to see if
these alternatives can be ruled out, or shown to be inconsistent
with the claims made in the Brief.
Sufficiency of the evidence. The
panel then determines whether the evidence that survivesthese tests
is of sufficient magnitude. It does this, in the absence of any
other guidance, by applying a heuristic of 75 percent.
The 75 percent heuristic is a guide to assist the
panel in its determination of evidentiary sufficiency in cases where
there are no other guides provided in the TEAC principles and standards
or by research standards or findings from the scholarship in education.
The 75 percent heuristic is applied to the evidence
that is presented in the Brief. It is applied, in other words, to
the evidence the faculty truly relies upon. It is also applied to
corroborating, or disconfirming, evidence that was uncovered by
the auditors and presented in the audit report.
TEAC elements. The panel must
determine whether or not the program meets TEAC’s quality
principles and standards of capacity for quality. For this decision,
TEAC has adopted a part/whole heuristic. This heuristic calls for
the panel to consider the components of each element, make a decision
about each, and move on successively to the consideration of each
element in the TEAC system until the panel can determine by vote
the program’s conformity to one of the TEAC accreditation
categories.
Heuristics that the panel uses to determine the sufficiency
of evidence, to determine that the program meets TEAC’s quality
principles and capacity standards, and to make the accreditation
recommendation are described below: (Also see heuristic
tables)
Ruling out rival hypotheses and determining sufficiency
of evidence
The panel begins its work by attempting to reduce the credibility
of the obvious rival hypothesis of chance--that the evidence the
program presents in the Brief is simply what would have been expected
by chance, and not by what the program faculty claim. Generally,
the role of unsystematic or random factors and “noise”
can be reduced, or substantially eliminated, when the Brief has
evidence supporting the reliability of the assessment procedures
used to generate the evidence in the first place. This is the logic
behind Quality Principle II (component
2.2).
Threats to reliability
The panel considers several threats to the reliability and validity
of the evidence in the Brief. One threat is from unsystematic factors
that introduce errors that plague much of the evidence in education.
- For example, if a program faculty were to claim that 20 percent
of the board-certified teachers in its state are graduates of
its program, the panelists would wonder whether or not this
was merely what would be expected by chance. If the program
had prepared 60 percent or more of the teachers in the state,
20 percent or more could be expected by chance alone. Had only
1 percent of the teachers in the state graduated from the program,
it would be unlikely that the 20 percent board-certified teacher
rate could be dismissed as just what would have been expected
by chance. Had the program faculty missed this point, incidentally,
the formative evaluation or the audit could be expected to have
examined it by way of corroborating the evidence in the Brief.
- To take another example, if the distribution of scores of
the program’s graduates on the state’s license examination
were on the order of the variation in scores that would be expected
by chance, the program faculty or the panel would make nothing
of them. There are, of course, several statistical techniques
for assessing the degree to which chance is a compelling rival
hypothesis that would account adequately for the evidence in
the Brief.
Regression to the mean is a statistical artifact
associated with the retesting of those who had extremely high or
low scores. These retested scores can be expected to shift by chance
towards the group’s average or mean score as a consequence
of the statistical error properties of extreme scores, and not as
a consequence of what might be claimed by the program faculty.
Ruling out rival hypotheses
The next step in the deliberation calls for the panelists to attempt
to rule out rival hypotheses that are rooted in systematic errors
that might be embedded in the evidence cited in the Brief. Campbell
and Stanley have identified several sources of systematic error
that could reduce the validity of the evidence cited in a Brief.
Those potentially related to a Brief are recounted below.
For every data point (mean, count, frequency, etc.)
reported to advance the credibility of a claim associated with Quality
Principle I, the panel members should ask themselves the
following questions.
1. Representative data.
Are the measures reported truly representative of the program’s
students and graduates? At least two rival hypotheses or factors
come into play in deliberating on this question and each needs to
be ruled out:
a) Is there a “selection” factor? Is the
evidence in the Brief about only a select and unrepresentative
group of students and graduates? If a program reports 100 percent
pass rate on a license examination, or an average score at the
85th percentile, but it is only for some of its students, the
panelists cannot easily rule out the rival hypothesis that evidence
may have more to do with the selection of the students than with
accomplishments of the entire group about which the claims are
made. It may be that the evidence cited in the Brief is only about
full-time students when the majority of students are part-time
attendees, or it may be about only those who work in State when
most of the graduates work elsewhere, or it may be about only
the in-state residents, when substantial portions were out-of-state
enrollees, or it may exclude transfer students, or it may exclude
dual majors, etc.
b) Is there a “dropout” factor? This
question is quite similar to the selection factor, because it
refers to the possibility that the evidence is restricted to a
particular select group--in this case, those who secured a teaching
position. This factor might show itself in gain score evidence.
Here a rival hypothesis for the gains reported in an Inquiry
Brief would be that the gains in average scores, for example,
were not really gains in accomplishment on anyone’s part,
but only evidence that the weaker students were not hired as teachers
and were not counted. Or it might be the case that the evidence
of accomplishment of the program’s graduates might only
be based on the more able graduates who gained employment immediately
upon graduation. It might not be evidence that was representative
of all of the students who completed the program.
The panel determines that the statistics and findings
are relevant to the populations about which the claims are advanced
and not just some part of the population that does not truly represent
the population of students or graduates.
2. Measurement errors and
influence. Are the procedures and assessments used by the
program faculty to collect the evidence reported in the Brief themselves
a factor in the evidence? Do they rival the claims the faculty seeks
to make about the evidence? Again, the panel members should take
at least three factors into consideration.
a) Is the assessment itself a factor? Do raters
get tired as they rate large numbers of students, so their discriminations
become less accurate over time? Is there “observer bias?”
Is care taken to shield raters and observers from having a bias
(positive or negative) toward the program or toward its graduates?
Are the reviewers “blind?” Are they disinterested
parties? Do they have the opportunity to rate students in the
program and those not in the program? Do they have the opportunity
to rate students near the finish of their program as well as those
just beginning?
Is there variation in the calibration of the
assessment instrument from one time to another so that a score
gain is nothing more than a recalibration effect (as in the new
SAT, for example)? Has the cut-score, or the scale range, been
changed so that gains in pass rate, or even absolute scores, are
meaningless? Is the true zero score known? A score of 170 out
of 190 may look impressive if the zero score is truly zero, but
not if the zero score (as in some Praxis tests) is set at 150.
Has there been grade inflation over the period of the program’s
reporting? Are grades given for reasons other than academic accomplishment,
such as attendance, punctuality, honesty, effort, or extra work?
The results from surveys, as noted earlier, are
known to be affected by the order in which questions were presented,
the context in which questions appeared, whether the questions
weed out those with no opinion (filtering), the range and order
of choices, whether middle categories were provided, whether the
format was open or closed, and so forth.
b) Is there a testing factor? Testing
itself is a factor, for example, when the students taking the
test, or being rated with a checklist, have experienced the ratings
and received feedback many, many times prior to the occasion reported
in the Brief. Repeated testing, while perhaps a component of an
effective evaluation system, renders the measures hard to interpret
because the reported effects may be more parsimoniously accounted
for as practice effects, i.e., the result of the student’s
experience or practice with the test. Re-la ted to the testing
factor is the Hawthorne effect, namely the finding that testing
or observation itself, independently of what is being tested,
is a factor that affects the results of the test or observation
(i.e., the mere looking or measuring itself has an effect on what
is being measured).
Next, drawing on their professional expertise,
the panel members consider (and, presumably, reject) any other
rival hypotheses. For example, any number of events, and the interaction
of events, that could have intervened between one measurement
and another. Many of these events are candidates for hypotheses
that rival the one the faculty has advanced in its Brief, and
the panel members should bring them forward in the discussion
and deliberations so that they may be eliminated.
Determining sufficiency
The final step in the deliberation comes after the panel has satisfied
itself that there are no surviving plausible rival hypotheses. At
this stage, the panel would also have concluded that the TEAC standard
of evidence is met because the evidence is consistent with the claims,
and there is little credible evidence in the Brief or in the audit
report that is inconsistent with the claims. The question that remains,
however, is whether the evidence, which has survived the challenges
cited above, is sufficient to support the claims that TEAC requires
to satisfy the quality principles and standards of capacity.
To determine sufficiency, the panel applies a 75
percent heuristic to the evidence as a guide. This heuristic is
applied in instances where there is no other guide provided by TEAC
or by the state-of-the-art practices and standards of contemporary
scholarship.
Why use the 75 percent
heuristic? The field has established very few metrics for
magnitude, but it has some, like the universally used, although
not uncontested, criterion for statistical significance:
-
A probability less than .05 is the research
standard used to establish that an event probably happened for
some reasons other than chance.
-
Reliability coefficients for individually
administered standardized tests are found generally in .90 range,
and in the .80 range for group administered standardized tests.
-
The best validity coefficients are about
.50 (e.g., between IQ and school grades).
-
Universities and colleges typically require
a 2.0 minimum index out of 4.0 for graduation.
-
States have set the Praxis I cut scores around
170 out of 190 (where the zero score is 150).
-
The academic major is typically 30 credits,
the academic minor is usually 15 credits, the semester is 14
to15 weeks, the BA or BS degree is rarely less than 120 credits,
the master’s degree is about 30 graduate credits, and
so forth.
By and large, however, the field has not committed
itself to a minimum magnitude for the measures it uses, and it has
rarely validated the few minimums it has set. So, the question remains
for the panelists: how much is enough to support the claim that
Quality Principle I has been satisfied, or how much stability
or consistency is enough to support the claim that a measure is
reliable, or how large does the association need to be between two
measures to support the claim that they are measuring more or less
the same thing, and so forth?
Therefore, in areas where there is no other guidance,
TEAC employs a 75 percent heuristic as a guide to solve these problems;
that is, 75 percent of whatever measure is cited in the Brief is
a good guide to the amount or magnitude that would be sufficient
to meet TEAC’s standard. The panel applies the 75 percent
heuristic to whatever measure the program cites as evidence.
When to use the 75 percent
heuristic. The panel should apply the 75 percent heuristic
to the empirical maximum, not the theoretical maximum.
-
For example, one Praxis test has a ceiling score of 990, but,
in fact, no one out of 27,000 test takers scores higher than
790. The panelists would apply the 75 percent heuristic to this
ceiling score, not to the 990 maximum score. Because the highest
reliability coefficients in the literature are about .90, the
TEAC heuristic would accept .68 as the lowest index of reliability
and about .38 for the lowest index of validity as the best validity
coefficients are about .50. The lowest mean grade index on a
four-point scale would be 3.0 by the heuristic, but only if
there were a reasonable number of 4.0 scorers, for example.
The empirical maximum, if it is not otherwise known, may be
established by determining the average score (frequencies, counts,
etc.) of the top 10 percent of scorers,
- If the program reports the mean score on a standardized test,
the 75 percent heuristic would be applied to the maximum empirical
score. For example, if the program reported a mean score of 170
on Praxis I (math), which ranges from 150 to 190, the panelists
would take 75 percent of the 40 point spread (or 30 points) and
be guided not to accept mean scores less than 180 as sufficient
evidence (not 75 percent of 190 or the much lower score of 142).
If, however, the program reported only pass rates (as currently
required under Title II), and not the mean score, then the panel
would determine sufficiency by considering 75 percent of the pass
rates for the top 10 percent of programs. Thus, if the average
pass rate of the top 10 percent of programs were 95 percent, a
program’s 71 percent pass rate would be sufficient.
It would also be appropriate for the panelists to apply the 75
percent heuristic to the preponderance of the evidence standard,
as TEAC has left the judgment of what constitutes “preponderance”
to the panel’s judgment. The panel, using the 75 percent heuristic,
would accept as sufficient evidence of commitment a case where at
least 75 percent of the program’s measures meet the parity
standard (appreciable difference between the norms of the program
and the institution with regard to the standards of capacity).
When not to use
the 75 percent heuristic. The panel employs the 75 percent
heuristic only in the absence of any other guidance with regard
to the magnitude of what would constitute a sufficient or adequate
amount for TEAC’s principles and standards.
- TEAC requires, for example, the program faculty to address in
its Brief all the components of the TEAC system (1.0-4.7), not
just 75 percent of them.
-
TEAC requires that the preponderance of
evidence for commitment show no appreciable differences between
the institutional norm and the program norm. Because the field
has established procedures for determining if differences
are trivial or significant, it would not be appropriate for
the panelists to apply the 75 percent heuristic to the parity
requirement. The panel would not accept as evidence of commitment
a case where the program norm was 75 percent of the institutional
norm in place of TEAC’s requirement of it being trivially
different from it.
Because the 75 percent heuristic is not a rule
or an algorithm, it is only a guide to assist the panel in determining
the sufficiency of the evidence with regard to any claim made in
the Brief. It cannot be a rule or algorithm because if it were applied
automatically to all the evidence, it could lead to serious errors.
For example:
-
Some regions of the country have such teacher
shortages that nearly 100 percent of graduates who wish to teach
will find teaching positions. In such a region, a 75 percent
hiring rate might actually indicate a significant weakness in
the program, not the strength that the program faculty may be
alleging. If a program in a region with teacher shortages were
to base a claim of program quality on hiring rates, the panel
would need to be free to consider a more demanding standard
than 75 percent. If the panel did, it would insure that it applied
its logic even-handedly to all programs during the period in
which there was a teaching shortage in a region.
-
If there were evidence of grade or score
inflation, the panel would need to be free to consider a higher
magnitude than 75 percent of the top grade or score as a measure
of sufficient evidence. On the other hand, the panel needs
to be free to consider a lower magnitude for programs that
have resisted grade inflation pressures and held to an older
standard in which the modal grade at the institution and program
for satisfactory work is a C or 2.0. In other cases, the 75
percent guideline may not reflect the grade index a program
may have actually determined through careful studies of predictive
and concurrent validity.
|