Search:

Teacher Education Accreditation Council

Guidelines for the Accreditation Panel’s deliberations

The charge to the panel
In evaluating the program’s evidence for each component of the TEAC system, the panel has two tasks: (1) to eliminate, if possible, the plausible rival hypotheses for the interpretation of the evidence that undermine its validity; and (2) to determine how much evidence is sufficient to support the claim that the program satisfies the system’s elements.

In this respect, TEAC panel members are like jurors in the American judicial process, who must determine whether the evidence rises to a level that satisfies a legal standard. Whereas the legal standard may require, for example, evidence of intent, the evidence that supports the claim of intent resists a clear-cut standard in the traditional sense of some bright line between intention and no intention. TEAC Accreditation Panel members, like jurors, must weigh the evidence and decide if the evidence is sufficient to certify that the program merits accreditation, provisional, new program, preaccreditation status, or a continuation of candidate status.

TEAC defines the standard for each element and component of its system as the point, as determined by the Accreditation Panel, at which competing and rival claims can be ruled out, the point at which the evidence is conclusive, clear and convincing, and the point below which the evidence is insufficient, flawed, or inconsistent.

In practice, given the state of scholarship in education, the TEAC standard of evidence is met when the evidence cited in the Brief is consistent with the claims made about student learning and when there is little or no credible evidence that is inconsistent with the claims.

How the panel makes its decision
Although TEAC’s quality principles and standards for program capacity suggest the characteristics of a quality program, they do not offer sure rules or algorithms to follow that would determine whether or not the evidence that a program has these characteristics is trustworthy and sufficient.

For this reason, to establish that a program has met TEAC’s principles and standards, TEAC employs heuristics to guide the accreditation decision making and judgment about whether or not the evidence of student learning is trustworthy (determined by the audit team) and sufficient (determined by the Accreditation Panel and Accreditation Committee).

TEAC’s heuristics guide the determination of whether or not the cited evidence of student learning, for example, is accurate and trustworthy; is, in fact, evidence of what it purports to be; and is sufficient to support the program faculty’s claims for student learning.

Ruling out rival hypotheses. The panel members represent several roles in the profession because their diversity makes it more likely that they can bring forward alternative explanations of the evidence presented in the Brief. The panel conceptually tests the evidence in the Brief to see if these alternatives can be ruled out, or shown to be inconsistent with the claims made in the Brief.

Sufficiency of the evidence. The panel then determines whether the evidence that survivesthese tests is of sufficient magnitude. It does this, in the absence of any other guidance, by applying a heuristic of 75 percent.

The 75 percent heuristic is a guide to assist the panel in its determination of evidentiary sufficiency in cases where there are no other guides provided in the TEAC principles and standards or by research standards or findings from the scholarship in education.

The 75 percent heuristic is applied to the evidence that is presented in the Brief. It is applied, in other words, to the evidence the faculty truly relies upon. It is also applied to corroborating, or disconfirming, evidence that was uncovered by the auditors and presented in the audit report.

TEAC elements. The panel must determine whether or not the program meets TEAC’s quality principles and standards of capacity for quality. For this decision, TEAC has adopted a part/whole heuristic. This heuristic calls for the panel to consider the components of each element, make a decision about each, and move on successively to the consideration of each element in the TEAC system until the panel can determine by vote the program’s conformity to one of the TEAC accreditation categories.

Heuristics that the panel uses to determine the sufficiency of evidence, to determine that the program meets TEAC’s quality principles and capacity standards, and to make the accreditation recommendation are described below: (Also see heuristic tables)

Ruling out rival hypotheses and determining sufficiency of evidence
The panel begins its work by attempting to reduce the credibility of the obvious rival hypothesis of chance--that the evidence the program presents in the Brief is simply what would have been expected by chance, and not by what the program faculty claim. Generally, the role of unsystematic or random factors and “noise” can be reduced, or substantially eliminated, when the Brief has evidence supporting the reliability of the assessment procedures used to generate the evidence in the first place. This is the logic behind Quality Principle II (component 2.2).

Threats to reliability
The panel considers several threats to the reliability and validity of the evidence in the Brief. One threat is from unsystematic factors that introduce errors that plague much of the evidence in education.

  • For example, if a program faculty were to claim that 20 percent of the board-certified teachers in its state are graduates of its program, the panelists would wonder whether or not this was merely what would be expected by chance. If the program had prepared 60 percent or more of the teachers in the state, 20 percent or more could be expected by chance alone. Had only 1 percent of the teachers in the state graduated from the program, it would be unlikely that the 20 percent board-certified teacher rate could be dismissed as just what would have been expected by chance. Had the program faculty missed this point, incidentally, the formative evaluation or the audit could be expected to have examined it by way of corroborating the evidence in the Brief.
    • To take another example, if the distribution of scores of the program’s graduates on the state’s license examination were on the order of the variation in scores that would be expected by chance, the program faculty or the panel would make nothing of them. There are, of course, several statistical techniques for assessing the degree to which chance is a compelling rival hypothesis that would account adequately for the evidence in the Brief.

    Regression to the mean is a statistical artifact associated with the retesting of those who had extremely high or low scores. These retested scores can be expected to shift by chance towards the group’s average or mean score as a consequence of the statistical error properties of extreme scores, and not as a consequence of what might be claimed by the program faculty.

    Ruling out rival hypotheses
    The next step in the deliberation calls for the panelists to attempt to rule out rival hypotheses that are rooted in systematic errors that might be embedded in the evidence cited in the Brief. Campbell and Stanley have identified several sources of systematic error that could reduce the validity of the evidence cited in a Brief. Those potentially related to a Brief are recounted below.

    For every data point (mean, count, frequency, etc.) reported to advance the credibility of a claim associated with Quality Principle I, the panel members should ask themselves the following questions.

    1. Representative data. Are the measures reported truly representative of the program’s students and graduates? At least two rival hypotheses or factors come into play in deliberating on this question and each needs to be ruled out:

    a) Is there a “selection” factor? Is the evidence in the Brief about only a select and unrepresentative group of students and graduates? If a program reports 100 percent pass rate on a license examination, or an average score at the 85th percentile, but it is only for some of its students, the panelists cannot easily rule out the rival hypothesis that evidence may have more to do with the selection of the students than with accomplishments of the entire group about which the claims are made. It may be that the evidence cited in the Brief is only about full-time students when the majority of students are part-time attendees, or it may be about only those who work in State when most of the graduates work elsewhere, or it may be about only the in-state residents, when substantial portions were out-of-state enrollees, or it may exclude transfer students, or it may exclude dual majors, etc.

    b) Is there a “dropout” factor? This question is quite similar to the selection factor, because it refers to the possibility that the evidence is restricted to a particular select group--in this case, those who secured a teaching position. This factor might show itself in gain score evidence. Here a rival hypothesis for the gains reported in an Inquiry Brief would be that the gains in average scores, for example, were not really gains in accomplishment on anyone’s part, but only evidence that the weaker students were not hired as teachers and were not counted. Or it might be the case that the evidence of accomplishment of the program’s graduates might only be based on the more able graduates who gained employment immediately upon graduation. It might not be evidence that was representative of all of the students who completed the program.

    The panel determines that the statistics and findings are relevant to the populations about which the claims are advanced and not just some part of the population that does not truly represent the population of students or graduates.

    2. Measurement errors and influence. Are the procedures and assessments used by the program faculty to collect the evidence reported in the Brief themselves a factor in the evidence? Do they rival the claims the faculty seeks to make about the evidence? Again, the panel members should take at least three factors into consideration.

    a) Is the assessment itself a factor? Do raters get tired as they rate large numbers of students, so their discriminations become less accurate over time? Is there “observer bias?” Is care taken to shield raters and observers from having a bias (positive or negative) toward the program or toward its graduates? Are the reviewers “blind?” Are they disinterested parties? Do they have the opportunity to rate students in the program and those not in the program? Do they have the opportunity to rate students near the finish of their program as well as those just beginning?

    Is there variation in the calibration of the assessment instrument from one time to another so that a score gain is nothing more than a recalibration effect (as in the new SAT, for example)? Has the cut-score, or the scale range, been changed so that gains in pass rate, or even absolute scores, are meaningless? Is the true zero score known? A score of 170 out of 190 may look impressive if the zero score is truly zero, but not if the zero score (as in some Praxis tests) is set at 150. Has there been grade inflation over the period of the program’s reporting? Are grades given for reasons other than academic accomplishment, such as attendance, punctuality, honesty, effort, or extra work?

    The results from surveys, as noted earlier, are known to be affected by the order in which questions were presented, the context in which questions appeared, whether the questions weed out those with no opinion (filtering), the range and order of choices, whether middle categories were provided, whether the format was open or closed, and so forth.

    b) Is there a testing factor? Testing itself is a factor, for example, when the students taking the test, or being rated with a checklist, have experienced the ratings and received feedback many, many times prior to the occasion reported in the Brief. Repeated testing, while perhaps a component of an effective evaluation system, renders the measures hard to interpret because the reported effects may be more parsimoniously accounted for as practice effects, i.e., the result of the student’s experience or practice with the test. Re-la ted to the testing factor is the Hawthorne effect, namely the finding that testing or observation itself, independently of what is being tested, is a factor that affects the results of the test or observation (i.e., the mere looking or measuring itself has an effect on what is being measured).

    Next, drawing on their professional expertise, the panel members consider (and, presumably, reject) any other rival hypotheses. For example, any number of events, and the interaction of events, that could have intervened between one measurement and another. Many of these events are candidates for hypotheses that rival the one the faculty has advanced in its Brief, and the panel members should bring them forward in the discussion and deliberations so that they may be eliminated.

    Determining sufficiency
    The final step in the deliberation comes after the panel has satisfied itself that there are no surviving plausible rival hypotheses. At this stage, the panel would also have concluded that the TEAC standard of evidence is met because the evidence is consistent with the claims, and there is little credible evidence in the Brief or in the audit report that is inconsistent with the claims. The question that remains, however, is whether the evidence, which has survived the challenges cited above, is sufficient to support the claims that TEAC requires to satisfy the quality principles and standards of capacity.

    To determine sufficiency, the panel applies a 75 percent heuristic to the evidence as a guide. This heuristic is applied in instances where there is no other guide provided by TEAC or by the state-of-the-art practices and standards of contemporary scholarship.

    Why use the 75 percent heuristic? The field has established very few metrics for magnitude, but it has some, like the universally used, although not uncontested, criterion for statistical significance:

    • A probability less than .05 is the research standard used to establish that an event probably happened for some reasons other than chance.
    • Reliability coefficients for individually administered standardized tests are found generally in .90 range, and in the .80 range for group administered standardized tests.
    • The best validity coefficients are about .50 (e.g., between IQ and school grades).
    • Universities and colleges typically require a 2.0 minimum index out of 4.0 for graduation.
    • States have set the Praxis I cut scores around 170 out of 190 (where the zero score is 150).
    • The academic major is typically 30 credits, the academic minor is usually 15 credits, the semester is 14 to15 weeks, the BA or BS degree is rarely less than 120 credits, the master’s degree is about 30 graduate credits, and so forth.

    By and large, however, the field has not committed itself to a minimum magnitude for the measures it uses, and it has rarely validated the few minimums it has set. So, the question remains for the panelists: how much is enough to support the claim that Quality Principle I has been satisfied, or how much stability or consistency is enough to support the claim that a measure is reliable, or how large does the association need to be between two measures to support the claim that they are measuring more or less the same thing, and so forth?

    Therefore, in areas where there is no other guidance, TEAC employs a 75 percent heuristic as a guide to solve these problems; that is, 75 percent of whatever measure is cited in the Brief is a good guide to the amount or magnitude that would be sufficient to meet TEAC’s standard. The panel applies the 75 percent heuristic to whatever measure the program cites as evidence.

    When to use the 75 percent heuristic. The panel should apply the 75 percent heuristic to the empirical maximum, not the theoretical maximum.

    • For example, one Praxis test has a ceiling score of 990, but, in fact, no one out of 27,000 test takers scores higher than 790. The panelists would apply the 75 percent heuristic to this ceiling score, not to the 990 maximum score. Because the highest reliability coefficients in the literature are about .90, the TEAC heuristic would accept .68 as the lowest index of reliability and about .38 for the lowest index of validity as the best validity coefficients are about .50. The lowest mean grade index on a four-point scale would be 3.0 by the heuristic, but only if there were a reasonable number of 4.0 scorers, for example. The empirical maximum, if it is not otherwise known, may be established by determining the average score (frequencies, counts, etc.) of the top 10 percent of scorers,

    • If the program reports the mean score on a standardized test, the 75 percent heuristic would be applied to the maximum empirical score. For example, if the program reported a mean score of 170 on Praxis I (math), which ranges from 150 to 190, the panelists would take 75 percent of the 40 point spread (or 30 points) and be guided not to accept mean scores less than 180 as sufficient evidence (not 75 percent of 190 or the much lower score of 142). If, however, the program reported only pass rates (as currently required under Title II), and not the mean score, then the panel would determine sufficiency by considering 75 percent of the pass rates for the top 10 percent of programs. Thus, if the average pass rate of the top 10 percent of programs were 95 percent, a program’s 71 percent pass rate would be sufficient.

    It would also be appropriate for the panelists to apply the 75 percent heuristic to the preponderance of the evidence standard, as TEAC has left the judgment of what constitutes “preponderance” to the panel’s judgment. The panel, using the 75 percent heuristic, would accept as sufficient evidence of commitment a case where at least 75 percent of the program’s measures meet the parity standard (appreciable difference between the norms of the program and the institution with regard to the standards of capacity).

    When not to use the 75 percent heuristic. The panel employs the 75 percent heuristic only in the absence of any other guidance with regard to the magnitude of what would constitute a sufficient or adequate amount for TEAC’s principles and standards.

    • TEAC requires, for example, the program faculty to address in its Brief all the components of the TEAC system (1.0-4.7), not just 75 percent of them.
    • TEAC requires that the preponderance of evidence for commitment show no appreciable differences between the institutional norm and the program norm. Because the field has established procedures for determining if differences are trivial or significant, it would not be appropriate for the panelists to apply the 75 percent heuristic to the parity requirement. The panel would not accept as evidence of commitment a case where the program norm was 75 percent of the institutional norm in place of TEAC’s requirement of it being trivially different from it.

    Because the 75 percent heuristic is not a rule or an algorithm, it is only a guide to assist the panel in determining the sufficiency of the evidence with regard to any claim made in the Brief. It cannot be a rule or algorithm because if it were applied automatically to all the evidence, it could lead to serious errors. For example:

    • Some regions of the country have such teacher shortages that nearly 100 percent of graduates who wish to teach will find teaching positions. In such a region, a 75 percent hiring rate might actually indicate a significant weakness in the program, not the strength that the program faculty may be alleging. If a program in a region with teacher shortages were to base a claim of program quality on hiring rates, the panel would need to be free to consider a more demanding standard than 75 percent. If the panel did, it would insure that it applied its logic even-handedly to all programs during the period in which there was a teaching shortage in a region.
    • If there were evidence of grade or score inflation, the panel would need to be free to consider a higher magnitude than 75 percent of the top grade or score as a measure of sufficient evidence. On the other hand, the panel needs to be free to consider a lower magnitude for programs that have resisted grade inflation pressures and held to an older standard in which the modal grade at the institution and program for satisfactory work is a C or 2.0. In other cases, the 75 percent guideline may not reflect the grade index a program may have actually determined through careful studies of predictive and concurrent validity.

    © 2006 TEAC. All rights reserved.
    One Dupont Circle, Suite 320
    Washington, DC 20036-0110
    202-466-7236
    fax: 302-831-3013