Knowledge Matters: The Value Of Free AI Training Data

The original Star Wars movie had an interesting scene illustrating the issue of being unable to verify training data and the resulting knowledge. During the flight on the Millennium Falcon, 3CPO and R2-D2 play Dajarik, a chess game against Chewbacca. Chewbacca is about to lose and gets angry. Then Han-Solo convinces the droids that Chewbacca would get violent when losing. These words prompt 3CPO to suggest letting the Wookie win.

The two droids are far ahead in their AI compared to today’s systems. Yet, it beautifully shows the danger of AI learning from itself. Han provides the two droids with questionable knowledge, and the machines act on it. In the old expanded universe, 3CPO later recommends to another person always to let Wookies win at virtual chess. He has fully internalized the lesson Han-Solo has given him. As no one has the initial training material, it becomes impossible for anyone to verify what he has learned and the basis of the observation.

Today’s AI might not be as sophisticated as the science fiction robots. Yet, faulty knowledge can still cause significant problems, as shown when Air Canada’s AI gave wrong policy information. Thus, let us dive into the fascinating world of AI training data and why we need to know the training data.

Open-Weights vs. Open-Source

If you look at Open-Source AI today, many don’t share everything needed to reproduce or modify the AI. The code for the AI and its interfaces is available for us to study. Yet, the actual AI part consists of a database of weights. These are the actual brains of the AI. The part that makes the decisions and tries to be intelligent. Yet, a collection of weights is nothing more than the machine code. No one in their right mind would say that you can read the code of a program if I give you the binary file. While with enough time, I can pierce together what each register shift of a program does, it is in no way comparable to actual source code. Yet, with open weights, AI developers say, here are the weights; there is no need for the training material. Just trust me. I didn’t train my AI exclusively on propaganda material.

The Impact of Missing Knowledge

Companies can easily hide Hallucinations, misplaced DEI initiatives, conspiracy theories, and wrong answers. If the public doesn’t know the training data, the only way to find it is to stumble over it. Instead of discussing these issues openly, organizations hide them to avoid making themselves vulnerable to criticism, which might change their stock prices. Yet, with the adoption of AI in more and more areas of our economy and society, it becomes crucial that we start developing free, verifiable, and accountable AI.

Otherwise, we might end up like Air Canada. When their glorified chatbot gave the wrong advice about their policies, they decided not to own up to it. First, they denied it happened. Then they tried to hide behind some bullshit explanation that a chatbot is not an agent of theirs and, thus, they aren’t responsible for it. Only after the court hammered them did they grudgingly pay. All because some designers and lawyers were happy enough to buy AI software without understanding what was happening inside the software.

Today, an airline is giving wrong pricing information. Yet, with the government, the medical sector, and anyone with a bit of cash joining the race to enhance our experience, we need AIs we can verify and models we can replicate.

The upcoming GenAI Knowledge Problem

With the advent of GenAI, AI learning itself from queries, the Knowledge Problem will worsen. For the Star Wars example, in the beginning, we assumed that Han Solo was a verified source of truth. Yet, in GenAI, the robots would assume simply from Chewbacca’s behavior that Wookies would react violently when losing a game. Consequently, they would learn and grow their knowledge base based on false assumptions.

At the same time, with potentially millions of queries per base LLM, it is hard to imagine that any provider would want to store all queries to ensure the model’s reproducibility. Thus, selection and sanitation might become more important than initial training and query engineering.

Knowledge is Power

Knowledge has always been the foundation of our society. Whether it was literacy, an understanding of the Bible, or college access, our opportunities to advance depended on the opportunities to learn. With AI, it isn’t much different. If we can control what goes in, we can control what the systems will output. Ultimately, we won’t be able to build verifiable models or modify existing models to fit our needs if we don’t know what went into creating said models.