Addressing Concerns About Bias in Automated Essay Scoring

Automated essay scoring (AES) systems can unintentionally reinforce biases, disproportionately affecting marginalized students. For example, African American students scored 0.81 points lower on average than Chinese students in ETS E-rater evaluations. These systems often overemphasize superficial metrics like sentence length, penalizing unconventional writing styles or less common vocabulary. Training data reflecting societal biases further perpetuates unfair outcomes. To address this, diversify training datasets, audit systems regularly, and incorporate human oversight. Advanced text representations like Sentence-BERT and LASER can improve fairness by capturing deeper meaning. Exploring these strategies can help create more equitable AES systems and better support all students.

Understanding Bias in Automated Essay Scoring

Automated essay scoring (AES) systems are designed to evaluate writing quality, but they often carry hidden biases that can disadvantage specific groups. If you're relying on these systems for grading or assessment, it's critical to understand how these biases manifest and what they mean for fairness and accuracy.

For example, studies have shown that African American students score, on average, 0.81 points lower than Chinese students when essays are evaluated by ETS's E-rater. This isn't just a minor discrepancy—it's a systemic issue rooted in the data and algorithms used to train these systems. Bias creeps in because AES tools often overemphasize superficial metrics like sentence length and vocabulary complexity, which may not align with true writing quality or creativity.

Here's how bias amplifies in AES systems:

Training Data Bias: If the training data reflects societal grading biases, the algorithm will replicate them. For instance, essays from certain demographic groups might be consistently undervalued in the dataset, leading the system to perpetuate those biases.
Overweighting Superficial Metrics: Algorithms might reward essays with longer sentences or more complex vocabulary, even if the content lacks depth or originality. This penalizes students who write concisely or use simpler language effectively.
Inconsistent Scoring Across Subgroups: Research using the ASAP dataset revealed that different text representation methods (like BOW, TF-IDF, Sentence-BERT, and LASER) produce inconsistent scores for essays from various demographic subgroups. This inconsistency undermines the fairness of the system.

Even when algorithms are designed to ignore sensitive attributes like race or gender, they can still create discriminatory outcomes.

For example, in digital lending, algorithms use proxies for sensitive attributes to make decisions. Similarly, AES systems might indirectly penalize certain writing styles or cultural expressions that don't align with the "norm" encoded in the training data.

The consequences of these biases are far-reaching. Students from marginalized groups may receive lower scores not because their writing is inferior, but because the system is biased against their style or background. This can affect their academic opportunities, scholarships, and even future career prospects.

To address these issues, you need to critically evaluate the AES systems you use. Look for transparency in how the algorithms are trained and tested. Demand evidence that the system has been rigorously evaluated for bias across diverse demographic groups. And most importantly, consider supplementing automated scoring with human evaluation to ensure fairness and accuracy.

The stakes are high, and the time to act is now. By understanding and addressing bias in AES, you can help create a more equitable system for evaluating student writing.

Types of Bias in AES Systems

Bias in Automated Essay Scoring (AES) systems isn't just a theoretical concern—it's a real-world problem that can significantly impact students' academic trajectories. When you're evaluating AES systems, you need to understand the types of bias that can creep in and how they manifest. Let's break it down.

Demographic Bias

One of the most glaring issues in AES is demographic bias. Studies have shown that certain ethnic groups can receive scores up to 1.3 points higher, while others may see scores 0.81 points lower. This isn't just a minor discrepancy—it's a systemic issue that can disadvantage students based on their background. For example, if an AES system is trained on data that disproportionately represents one demographic, it may struggle to fairly evaluate essays from underrepresented groups. This creates a feedback loop where bias is perpetuated, and certain students are consistently penalized.

Human Scoring Bias

Human bias doesn't disappear when you introduce automation—it often gets baked into the system. Take confirmation bias in peer evaluation, for instance. If human raters consistently favor certain writing styles or topics, the AES system learns to mimic those preferences. This means that essays deviating from the "norm" may be unfairly scored, even if they're well-written or creative. You're essentially codifying human biases into an algorithm, which can lead to systematic errors in scoring.

Algorithmic Bias

Algorithmic bias is where things get particularly tricky. AES systems may penalize essays that use unconventional writing styles or less common vocabulary. This disproportionately affects students from diverse linguistic or cultural backgrounds. For example, a student who uses African American Vernacular English (AAVE) might be scored lower, not because their essay lacks quality, but because the system isn't trained to recognize the value in their unique expression. This type of bias can stifle creativity and reinforce narrow definitions of "good writing."

Distribution Shifts

Another critical issue is distribution shifts in training data. These shifts can occur in two main ways:

X-shifts: Changes in the input data, such as essays from new demographics or topics not well-represented in the training set.
Y|X-shifts: Changes in the relationship between the input (essays) and the output (scores), such as evolving grading standards or cultural shifts in what's considered "good writing."

When these shifts happen, AES models can experience significant performance degradation. For instance, if a model is trained on essays from a specific time period or region, it may struggle to adapt to new contexts, leading to biased and inaccurate scoring.

Real-World Implications

Let's look at the ASAP dataset, which includes 1,569 essays scored by multiple human raters. Even with multiple raters, discrepancies in scoring create opportunities for algorithmic bias to take root. If the AES system is trained on this data without addressing these discrepancies, it will inherit and potentially amplify the biases present in the human scores. This isn't just a technical problem—it's an ethical one, with real consequences for students' futures.

What You Can Do

To mitigate these biases, you need to:

Diversify training data to include a wide range of demographics, writing styles, and topics.
Regularly audit AES systems for bias, using metrics that go beyond simple accuracy.
Incorporate human oversight to catch and correct biases that the algorithm might miss.

Bias in AES isn't inevitable—it's a challenge that requires proactive, thoughtful solutions. By understanding the types of bias and their root causes, you can take steps to create fairer, more equitable systems that truly serve all students.

Impact of Training Data on Bias

When you're dealing with automated essay scoring (AES) systems, the training data is the foundation—and if that foundation is flawed, the entire system will be biased. Let's break this down so you can see exactly how training data impacts bias and why it's such a critical issue.

First, consider this: AES systems learn to score essays by analyzing patterns in the data they're trained on. If that data reflects historical inequalities or lacks representation from certain groups, the system will inevitably replicate those biases.

For example, studies from 1999 to 2018 revealed that ETS's E-rater consistently scored essays by students from mainland China 1.3 points higher on average, while essays by African American students were scored 0.81 points lower.

Why? Because the training data likely overrepresented certain writing styles or demographics, skewing the algorithm's understanding of what constitutes a "good" essay.

Here's another example: Utah's 2017-2018 technical report flagged 348 ELA questions with mild differential item functioning (DIF) against minority or female students.

This means the questions—and by extension, the training data used to generate them—were biased. The algorithm learned to favor certain responses over others, disadvantaging entire groups of students. This isn't just a technical glitch; it's a systemic issue rooted in the data.

And it's not just about essays. Take Amazon's recruiting tool, which penalized resumes that included the word "women's." The bias in the training data—historical hiring patterns that favored men—led the algorithm to associate certain terms with lower scores. The same principle applies to AES systems. If the training data overrepresents one group or reflects societal biases, the algorithm will perpetuate those biases in its scoring.

Here's what you need to understand about the impact of training data on bias:

Overrepresentation of specific groups: If the training data includes more essays from one demographic, the algorithm may learn to disproportionately associate certain writing styles with that group. For instance, if African American students are overrepresented in a dataset, the system might unfairly penalize essays that don't align with the dominant style.
Historical inequalities: Training data often reflects past inequities. If historically marginalized groups were underrepresented in education or scored lower due to systemic bias, the algorithm will continue to disadvantage them unless the data is carefully curated.
Unrepresentative samples: If the training data doesn't include enough examples from diverse groups, the algorithm won't learn to fairly evaluate essays from those groups. This leads to biased outcomes, as seen in the E-rater's scoring discrepancies.

The bottom line? Training data is the root of algorithmic bias in AES systems. If you don't address the biases in the data, you're setting the system up to fail—and worse, to perpetuate unfairness. The urgency here can't be overstated. Every day that biased systems are in use, they're impacting students' futures. It's time to take a hard look at the data and ensure it's as fair and representative as possible.

Detecting Bias in Essay Scoring Models

Detecting bias in automated essay scoring (AES) models is critical to ensuring fairness and accuracy in educational assessments. You need to understand how these biases manifest and the tools available to uncover them. Let's break it down.

First, consider the evidence from real-world studies.

For instance, ETS's E-rater system, widely used for scoring essays, has shown significant discrepancies. Between 1999 and 2018, it scored Chinese students 1.3 points higher and African American students 0.81 points lower on average.

These patterns highlight systemic issues that can disproportionately affect certain groups.

Similarly, Utah's 2017-2018 technical report flagged 348 ELA questions with mild differential item functioning (DIF) against minority or female students, with three questions showing severe DIF.

These findings underscore the importance of scrutinizing AES systems for hidden biases.

To detect bias, researchers use advanced methods like DIF analysis, which compares how different demographic groups perform on specific test items.

For example, a study using the ASAP dataset (prompt 7) analyzed individual fairness by comparing scores of similar essays based on text representations and distance metrics. This approach helps identify whether the algorithm is unfairly penalizing or favoring certain groups.

Here's how you can approach detecting bias in AES models:

Analyze Algorithm Outputs: Look for anomalous results that deviate from expected patterns. For instance, if essays from a particular demographic consistently receive lower scores despite similar content quality, this could indicate bias.
Compare Across Groups: Use demographic data to compare outcomes. Are certain groups consistently scoring higher or lower? Tools like DIF analysis can help quantify these disparities.
Leverage Real-World Data: Studies like the one using writeAlizer, which analyzed 421 students' essays, found no evidence of bias in automated scores. However, this doesn't mean bias doesn't exist—it just means you need to dig deeper and use multiple methods to confirm.

The stakes are high. Biased AES systems can perpetuate inequities in education, affecting students' opportunities and outcomes. By using robust detection methods and staying vigilant, you can help ensure these systems are fair and reliable for everyone.

Mitigation Strategies for Bias in AES

Mitigating bias in Automated Essay Scoring (AES) systems isn't just a technical challenge—it's a moral imperative. If you're working with or developing AES, you need to implement strategies that ensure fairness and equity across all demographic groups. Let's break down the most effective approaches to tackle bias head-on.

Use Diverse and Representative Training Data

The foundation of any AES system is its training data. If your dataset lacks diversity or overrepresents certain groups, the system will inevitably amplify those biases. To avoid this:

Curate datasets that include essays from a wide range of demographics, including race, gender, socioeconomic status, and geographic location.
Balance the dataset to ensure no single group dominates the training process.
Collaborate with educators to gather essays from underrepresented populations, ensuring the system learns to score equitably across all groups.

Preprocess Data to Remove Sensitive Attributes

Even with diverse data, sensitive attributes like names, locations, or cultural references can inadvertently influence scoring. To mitigate this:

Anonymize essays by removing identifiable information before training the model.
Normalize language patterns that might correlate with specific demographics, such as dialects or colloquialisms, to prevent the system from penalizing non-standard English.
Use adversarial debiasing techniques, where the model is trained to ignore features that correlate with sensitive attributes.

Implement Bias-Mitigating Algorithms

Algorithms can be designed to prioritize fairness. For example:

Individual fairness algorithms ensure that essays with similar content receive similar scores, regardless of the writer's background.
Group fairness metrics can be applied to ensure that scores are distributed equitably across demographic groups.
Regularization techniques can penalize the model for making predictions that correlate with sensitive attributes.

Audit and Monitor for Bias

Bias can creep into AES systems over time, especially as new data is introduced. To stay ahead:

Conduct regular audits using metrics like Differential Item Functioning (DIF) to compare how the system performs across different groups.
Set up continuous monitoring to flag any disparities in scoring patterns.
Engage third-party auditors to provide an unbiased assessment of the system's fairness.

Prioritize Transparency and Human Oversight

Even the most advanced AES systems aren't infallible. To build trust and ensure fairness:

Document the scoring process clearly, so educators and students understand how scores are determined.
Incorporate human review for essays that receive outlier scores or are flagged for potential bias.
Provide explanations for scores, allowing students to understand why they received a particular grade and how they can improve.

Role of Human Raters in Bias Propagation

You might think that automated essay scoring (AES) systems are purely objective, but the truth is, they inherit biases from the humans who train them. Human raters play a critical role in shaping these systems, and their biases—whether conscious or unconscious—can seep into the algorithms, perpetuating unfairness. Let's break down how this happens and why it's such a pressing issue.

When human raters grade essays, their personal beliefs and preferences can influence their scoring. For example, if a rater has a strong bias toward certain writing styles or cultural perspectives, they're more likely to favor essays that align with those preferences. This confirmation bias doesn't just affect individual scores—it gets baked into the training data used to develop AES systems. Over time, the system learns to replicate these biases, disadvantaging students whose writing doesn't fit the "preferred" mold.

Subgroup inconsistencies: Studies reveal that human graders often score essays differently based on the writer's ethnicity, gender, or socioeconomic background. These inconsistencies create a ripple effect, as AES systems trained on this data inherit the same biases. For instance, African American students have been shown to receive lower scores on average compared to their peers, even when the quality of writing is comparable.
The E-rater example: Take the E-rater system, one of the most widely used AES tools. Research has found that it tends to favor essays from students in mainland China while penalizing African American students, with score differences averaging 1.3 points. This isn't just a technical glitch—it's a direct reflection of the biases present in the human-scored data used to train the system.
The challenge of correction: Even when developers try to adjust for bias, the results are often counterproductive. For example, tweaking the system to reduce bias against one group can inadvertently increase bias against another. This highlights the complexity of the problem and underscores the need for more nuanced solutions.

The ASAP dataset, a benchmark for AES research, provides a stark illustration of this issue. While human-human agreement on essay scores is far from perfect, it serves as a baseline for evaluating AES fairness. However, when AES systems are trained on biased human scores, they struggle to achieve even this imperfect standard, further entrenching disparities.

Evaluating Fairness in Automated Scoring

When you're evaluating fairness in automated essay scoring (AES), you need to dig deep into how similar essays are treated.

A recent study using the ASAP dataset—1,569 essays—measured individual fairness by analyzing essay similarity across multiple text representations and distance metrics.

Text Representations: The study used BOW (Bag of Words), TF-IDF, Sentence-BERT, and LASER to represent essays. Each method captures different aspects of the text, from surface-level word counts to deep semantic meaning.
Distance Metrics: Cosine, Euclidean, Manhattan, and Jaccard distances were used to measure how similar essays are. These metrics help quantify the "distance" between essays in the feature space.
Lipschitz Mapping: To assess fairness, the study applied the Lipschitz mapping function to 1,230,096 essay pairs. This function checks whether similar essays receive similar scores—a key indicator of individual fairness.

The results? Sentence-BERT and LASER consistently outperformed simpler methods like BOW and TF-IDF, especially when paired with Gradient Boosting models. This combination achieved higher Quadratic Weighted Kappa (QWK) scores, showing better agreement with human graders.

But fairness isn't just about overall performance—it's about consistency.

The study also analyzed 30,000 essay pairs across similarity groups. They found that essays with high similarity scores (using LASER or Sentence-BERT) had smaller score differences, indicating fairer treatment.

To test robustness, the researchers used paraphrased essays—100 pairs in total. These pairs achieved near-perfect NDCG (Normalized Discounted Cumulative Gain) scores, close to 1.0, proving that the system could handle subtle variations in text without penalizing students unfairly.

Finally, interpretable features from the EASE library—12 in total—were incorporated to analyze essay similarity.

While LASER generally outperformed other representations, it struggled slightly with language errors, highlighting the importance of combining deep semantic models with surface-level features for a balanced approach.

If you're working on AES systems, these findings are critical. They show that fairness isn't just a buzzword—it's measurable, actionable, and essential for building trust in automated scoring.

Challenges in Defining Essay Similarity

Defining "similar essays" for individual fairness assessment in Automated Essay Scoring (AES) is far more complex than it might seem at first glance.

Writing quality isn't just about grammar or vocabulary—it's a multifaceted construct that includes content depth, stylistic choices, and structural coherence.

This complexity makes it challenging to pin down a universal definition of essay similarity, which is critical for ensuring fairness in scoring.

To tackle this, researchers have experimented with various text representations and distance metrics.

For instance, this study used four distinct text representations—Bag of Words (BOW), Term Frequency-Inverse Document Frequency (TF-IDF), Sentence-BERT, and LASER—alongside four distance metrics: cosine, Euclidean, Manhattan, and Jaccard.

Each combination offers a different lens through which to assess similarity, but here's the catch: there's no one-size-fits-all solution.

The choice of representation and metric can significantly influence the fairness assessment, and this variability underscores the lack of consensus in the field.

Consider the computational challenges involved.

The study analyzed 1,230,096 unique essay pairs from the ASAP dataset.

That's over a million comparisons!

The sheer volume highlights the logistical hurdles in evaluating individual fairness at scale.

But it's not just about the numbers—it's about the nuances.

When researchers sampled 30,000 essay pairs from different similarity groups, they found that the average score differences varied depending on the similarity metric used.

This inconsistency is a red flag, signaling that fairness assessments can be highly sensitive to the methods employed.

Even when essays are paraphrased—essentially saying the same thing in different words—the results are telling.

In a test with 100 paraphrased essay pairs, near-perfect NDCG scores (close to 1.0) were achieved, but only with certain vector representations.

This suggests that not all methods are equally adept at capturing semantic similarity, which is crucial for a fair evaluation.

Key takeaways:

Essay similarity is a multidimensional challenge, encompassing content, style, and grammar.
The choice of text representation and distance metric can drastically impact fairness assessments.
Computational demands are significant, with millions of comparisons required for robust analysis.
Paraphrased essays reveal gaps in some methods' ability to capture semantic equivalence.

If you're working on AES systems, this is a critical area to focus on.

Without a reliable way to define and measure essay similarity, ensuring individual fairness remains an uphill battle.

The stakes are high, and the clock is ticking—students' futures depend on it.

Text Representation and Bias Detection

When you're dealing with automated essay scoring, the way you represent text can make or break the fairness of the system. Let's break it down: Bag-of-Words (BOW) and TF-IDF are the old-school methods you've probably heard of. They're straightforward—they count words or weigh them based on frequency—but here's the catch: they don't capture the deeper meaning behind the words.

This can lead to bias because they miss the nuances that make essays unique. For example, two essays might use the same words but convey entirely different ideas, and BOW or TF-IDF won't pick up on that. That's where things can go sideways.

Now, let's talk about the game-changers: deep learning methods like Sentence-BERT and LASER. These tools create vector representations of essays in higher-dimensional spaces, which means they can capture context and semantic meaning far better than BOW or TF-IDF.

Studies using the ASAP dataset have shown that LASER features outperform other methods when predicting essay scores across multiple regression models. Why does this matter? Because better text representation can help mitigate bias by ensuring the scoring system understands the essay's true meaning, not just its surface-level features.

But here's something you need to watch out for: the distance metric you use to compare essay vectors. Cosine similarity is a popular choice—it's been effective in NLP studies—but it's not a one-size-fits-all solution. Depending on the type of bias you're dealing with, cosine similarity mightn't be the best fit. You need to experiment and see what works for your specific use case.

And don't stop at just similarity scores. Pair them with interpretable essay features to get a fuller picture. For instance, combining Sentence-BERT embeddings with BOW features can help you pinpoint exactly which aspects of an essay are contributing to biased scoring. This approach gives you actionable insights, so you can tweak your system to be fairer and more accurate.

Key takeaways:

BOW and TF-IDF are limited in capturing semantic meaning, which can introduce bias.
Deep learning methods like Sentence-BERT and LASER offer more nuanced representations, improving fairness.
The choice of distance metric (e.g., cosine similarity) can influence bias detection.
Combining similarity scores with interpretable features provides a comprehensive approach to bias detection.

Ensuring Equitable Evaluation in Education

Automated Essay Scoring (AES) systems are transforming education, but they come with a critical caveat: bias. If you're relying on these systems to evaluate student writing, you need to understand the risks—and the urgency of addressing them. Studies show that AES systems can unfairly disadvantage certain groups, with Chinese students sometimes scoring 1.3 points higher and African American students scoring 0.81 points lower than their peers. These discrepancies aren't just numbers; they're barriers to equitable education.

One of the biggest challenges in AES is defining what makes two essays "similar."

A study using the ASAP dataset (1,569 essays) explored this issue by analyzing text representations and distance metrics. The findings? Even when essays are structurally similar, AES systems often fail to account for cultural, linguistic, or stylistic nuances. This means a student's unique voice or perspective could be penalized simply because it doesn't fit the system's narrow criteria.

Example: A student using African American Vernacular English (AAVE) might be scored lower, even if their argument is compelling and well-structured.
Impact: This bias can discourage students from expressing themselves authentically, stifling creativity and diversity in writing.

AES systems often prioritize surface-level features like sentence length, vocabulary complexity, and grammar over deeper aspects of writing quality.

While these metrics are easier to quantify, they don't capture the richness of a student's ideas or their ability to construct a persuasive argument.

Example: A student with a concise, impactful essay might score lower than one with longer, more verbose sentences—even if the latter lacks substance.
Impact: This approach disadvantages students who excel in critical thinking but may not conform to traditional writing norms.

The stakes are high. At least 21 states use AES systems for standardized testing, and in many cases, only 5-20% of essays receive human review. This means the majority of students' work is evaluated solely by algorithms that may perpetuate bias. If you're an educator or policymaker, this isn't just a technical issue—it's an ethical one.

Example: A student's college admission or scholarship eligibility could hinge on an unfairly scored essay.
Impact: Bias in AES can have lifelong consequences, reinforcing systemic inequities in education.

Addressing bias in AES requires a multi-pronged approach. Here's what you can do to ensure more equitable evaluation:

Diversify Training Data: Ensure the datasets used to train AES systems include a wide range of writing styles, dialects, and cultural perspectives.
Incorporate Human Oversight: Combine AI scoring with human review, especially for borderline cases or essays from underrepresented groups.
Focus on Holistic Metrics: Develop algorithms that assess not just grammar and structure, but also creativity, argument strength, and originality.

The urgency to act is clear. Every day that bias in AES goes unaddressed is another day students are being unfairly judged—and another day the education system fails to live up to its promise of equity. You have the power to change this. Start by questioning the systems you use, advocating for transparency, and pushing for solutions that prioritize fairness above all else.

Questions and Answers

How Do You Evaluate Individual Fairness for Automated Essay Scoring System?

You evaluate individual fairness by measuring if similar essays get similar scores using fairness metrics, analyzing group disparities, and applying counterfactual analysis. You assess bias mitigation strategies and conduct impact assessments to ensure equitable scoring.

Should You Fine Tune Bert for Automated Essay Scoring?

You should fine-tune BERT if you've got enough labeled data to avoid data scarcity and overfitting risks. Incorporate human feedback for bias mitigation and ensure domain adaptation aligns with student needs for fair, accurate scoring.

How Does Automated Essay Scoring Work?

Automated essay scoring uses scoring algorithms to analyze text through feature engineering or deep learning. It incorporates human feedback for training, employs bias detection to ensure fairness, and conducts error analysis to refine accuracy over time.

What Is a Framework for Evaluation and Use of Automated Scoring?

You'll evaluate system reliability by comparing automated and human scores using metrics like QWK. Ensure score validity through human oversight, bias mitigation via fairness constraints, and conduct cost-benefit analysis to balance efficiency with educational outcomes.