August 12, 2022

Bias in text-mining and natural language processing (NLP)

Keywords, metadata and thesauri terms may contribute to biases

Natural Language Processing (NLP) and Text Mining are used to analyze unstructured data and may contain inherent biases.

Text-mining and natural language processing (NLP) use "seeds" (i.e., words) to develop its models and datasets. But the seeds may contain biases or stereotypes, inherently impacting the results. The seeds are typically not documented or are found deep in the code. If you are just using the tool, you'd never know that the biases and stereotypes exist in the dataset. Keywords, metadata and thesauri terms may factor into the biases. Researchers recommend tracing the origins of seed sets and manually examine and test them to make sure the results are trustworthy. Read the full story at:

https://news.cornell.edu/stories/2021/10/words-used-text-mining-research-carry-bias-study-finds