blog

New York Times vs OpenAI and the Battle against Bias

Written by Jack Cullinane | Apr 23, 2024 12:25:06 PM

The legal dispute between The New York Times, Microsoft, and OpenAI over the use of the newspaper’s articles to train a generative AI model is set to be a landmark case for the 21st century. The outcome of this case will determine whether The New York Times Is entitled to compensation for its content, and whether OpenAI has infringed its copyright. If the technical reality of the technology is fully understood through the court, then result is likely to be against OpenAI and Microsoft. This case will have vast consequences for these companies but also the AI industry as a whole. If OpenAI and other AI engineering firms learn from the mistakes that the lawsuit highlights, they will be able to create future AI models that are less biased and less harmful.

One of the key complaints in the NYT lawsuit against OpenAI is that OpenAI is copying their content, with AI-generated outputs that are nearly identical to NYT articles. In the field of machine intelligence, the truth is that models like OpenAI's generative AI do not store data. Instead, the model is trained on data, such as the NYT articles, only to learn patterns. Then, when prompted, the machine uses these patterns to produce a response.

Therefore, unlike a plagiarizer who may copy the content of another, the model is creating an original response that happens to match the style of The New York Times on the same topic – a phenomenon best described as imitation bias. The imitation bias in ChatGPT is entirely due to a poorly designed training dataset. It is likely that the machine intelligence is overly influenced by the style of The New York Times because it was trained on more NYT articles than other sources.

The real legal question Is whether a data-driven imitation constitutes plagiarism. Ultimately, it seems likely that OpenAI will be found guilty of copywrite infringement because of the legal system’s inability to distinguish between genuine creation and imitation driven by data. After all, the precedent just does not exist.

In truth, that is an ideal outcome for the industry. This case is an important lesson for the AI industry, emphasizing the need for balanced datasets that avoid biases, including imitation bias and bigotry. Towards that end, in order to mitigate the harm to society, the judgement from the courts should require these entities to correct their approach. Machine intelligence models should be retrained with balanced datasets that come from diverse sources of equal proportion. This way, the model can learn a more comprehensive pattern to generate content.

As an advocate of AI Hominism, I think the biggest threat from AI and its potential sentience is the unethical or poor design of data models by human creators. These systems are prone to corruption from the start if AI engineers do not ensure holistic and ethical training programs. This lawsuit is crucial for preventing machine intelligence models from causing social harm. It shows how biased data can lead to undesirable outcomes and could be the first time the judiciary intervenes in regulating AI engineering processes to protect the public interest.

Original Post