AI Model Gives Data Owners Control

Researchers at the Allen Institute for AI (Ai2) have developed a new large language model that allows control over how training data is used, even after the model is built. This model, called FlexOlmo, could be an alternative to the current artificial intelligence industry practice of collecting data from various sources without considering ownership and then owning the models.

Ali Farhadi, CEO of Ai2, states that extracting data from a model today is difficult, like trying to recover eggs from a finished cake. He explains that conventionally, data is either included or excluded, and once training occurs, control is lost, requiring another expensive round of training to remove it.

Ai2’s method involves dividing up training to allow data owners to maintain control. Those contributing data to a FlexOlmo model can copy a publicly shared model, the “anchor,” train a second model with their data, combine it with the anchor model, and contribute the result. This way, the data doesn't need to be handed over, and it can be extracted later. For instance, a magazine publisher could contribute text and later remove the sub-model if needed.

Sewon Min, a research scientist at Ai2, notes that the training is asynchronous, allowing data owners to work independently without coordination.

The FlexOlmo model architecture uses a “mixture of experts,” combining several sub-models. Ai2’s innovation involves merging independently trained sub-models using a new scheme for representing values. FlexOlmo researchers created a dataset called Flexmix from sources like books and websites to test their approach. They built a model with 37 billion parameters and compared it to others, finding that it outperformed individual models and scored better than other merging approaches.

According to Farhadi, this offers a new way to train models, allowing users to opt out of the system without causing significant damage.

Percy Liang, an AI researcher at Stanford, believes the Ai2 approach is promising, offering more modular control over data without retraining. He also notes the importance of openness in the development process.

Farhadi and Min suggest that FlexOlmo could allow AI firms to access sensitive private data in a more controlled manner, as the data doesn't need to be disclosed. However, they caution that data reconstruction may be possible and that differential privacy might be needed to ensure data safety.

Ownership of data used to train AI models has become a significant legal issue. Min suggests that FlexOlmo could enable better shared models where data owners can co-develop without sacrificing data privacy or control.