Open-Source and LLMs: We Need an Honest Partnership

A week ago, a friend asked me whether most current “open-source” Large Language Models (LLM) are genuinely open-source. After we couldn’t reach a conclusion, I decided to put the question on social media and in front of some of the founders and co-workers in our networks. After a lot of good feedback, the question is still open, but at least it gave me a much better understanding of the thoughts and pitfalls of the recent open-source developments. Thus, let us look at the problems that drove the questions, the different approaches, and how to fix the underlying issue.

What is Open-Source About?

The Open-Source definition requires licenses to grant you ten essential rights for a license to be considered Open-Source. In AI, four of these terms are of particular importance.

Term 1 requires a license to allow for the usage and redistribution without restrictions or royalty payments.
Term 2 requires the source code to be available.
Term 3 requires that anyone is free to create derivative works. Commonly, these include both feature works and bug fixes.
Term 7 requires that the license applies to anyone using the software.

For many, the “available without costs” aspect is at the forefront of open-source. However, for many Free and Open-Source Software developers, the factors related to the community, the ability to create derivative works, and the permission to analyze the code are the driving factors.

Before AI, this difference led to discussions in another sector: Video Games. For many Linux Distributions and the Free Software Foundation, a video game would only be acceptable if all its content was available for modification. Under the Open-Source Definition, however, non-computer programming assets, such as images and videos, could be distributed under a proprietary license.

The State of Open-Souce AI Licensing

LLMs are a combination of machine learning code, data retrieval code, and a model built on training data. Thus, the differentiation between the surrounding code and the actual model, similar to game engines and assets, might be possible. However, currently, none of the LLMs go this route.

While some AI models, such as Dolly, are fully open source, others, like Falcon and LLama, have put restrictions on the usage of the model in their licenses. Falcon, for example, disallows you to sell API access to a shared model. LLama, in contrast, limits its use to smaller companies and projects.

Looking back at the open source definition, both would violate rule 1, while LLama might also violate rule 7.

Bottom Line or Valid Concern?

The question remains: why would companies choose these kinds of restrictions? First, AI is expensive, especially since training the base model requires considerable resources. Thus, the fact that we have any open-source and close-to open-source models is remarkable. Further, disappointed companies and developers have said a lot about how certain cloud service providers utilize open-source without being a community player in the recent debate about open-source and licensing.

Thus, I can understand Meta and other big players who don’t want to shoulder the costs of developing an open-source AI that is subsequently utilized by companies that don’t contribute. Naturally, there is a revenue component here as well.

Another reason is a liability concern, especially for the models that don’t allow commercial utilization. After OpenAI released ChatGPT, several companies trained their models utilizing ChatGPT output. However, this violates the Terms and Conditions. While the fair use doctrine might apply to research projects, it almost certainly would not apply to commercial usage.

Liability protection might also be a concern for the restrictions on Falcon. Cross-poisoning becomes a more significant concern if anyone utilizes the model in a shared environment. Thus, disallowing it in the terms might protect TII from lawsuits down the road.

Lastly, trying to anticipate regulatory constraints could also drive the restrictions. There is a good chance that shared AI offerings will receive higher levels of scrutiny. Even when the developers themselves don’t offer the services, it might flow upstream.

AI Strategy needs to align

Consequently, any company utilizing AI needs to understand the license restrictions, underlying concerns, and risks before building their strategy and choosing their model.
Regulatory concerns are the easiest to understand, yet currently, they are one of the most fluent problems. Keeping up with where customers are and where technology suppliers and upstream projects reside isn’t just crucial for AI but should be part of any supply chain management.
Awareness of the license, possible changes, and who the developers and backers are go hand in hand with the regulatory framework. After all, terms rely on local laws for enforcement. Likewise, state-sponsored research institutes might have much better lobbying capabilities than even multinationals.
Lastly, be aware of how the inbound technology aligns with the corporate norms, values, and culture. If most of your employees are heavily involved in open-source and open-source advocacy, a semi-open license might not be acceptable to them.

What to do to change the Open-Source stand?

With the developing field, I expect AI models to consolidate and show up under different licenses. To ensure open-source has a fighting chance and more of the models become genuine open-source software, there are three common steps any of us should take:

First, we should contribute with bug fixes, documentation, and even features. If we move models from a single shoulder to many, it becomes harder to justify restrictions.
Second, share your derivative models. Setting an example and being open is a powerful sign.
Third, share your success stories and highlight how triumph wouldn’t have been possible without open-source.

Is the AI Future Open-Source?

While it is hard to predict the future, three types of entities will be controlling AI models: the largest companies, government-backed groups, and open-source communities. If we play well as open-source users and developers, AI will give all of us a fighting chance to drive society forward. Otherwise, it will soon turn into a veneer of selling yet another shareware.