Apertus: a fully open, transparent, multilingual language model

snikta@programming.dev · 3 months ago

Apertus: a fully open, transparent, multilingual language model

snikta@programming.dev · 3 months ago

A fully open-source LLM

As a fully open language model, Apertus allows researchers, professionals and enthusiasts to build upon the model and adapt it to their specific needs, as well as to inspect any part of the training process. This distinguishes Apertus from models that make only selected components accessible.

“With this release, we aim to provide a blueprint for how a trustworthy, sovereign, and inclusive AI model can be developed,” says Martin Jaggi, Professor of Machine Learning at EPFL and member of the Steering Committee of the Swiss AI Initiative. The model will be regularly updated by the development team which includes specialized engineers and a large number of researchers from CSCS, ETH Zurich and EPFL.

KubeRoot@discuss.tchncs.de · 3 months ago

Apertus was developed with due consideration to Swiss data protection laws, Swiss copyright laws, and the transparency obligations under the EU AI Act. Particular attention has been paid to data integrity and ethical standards: the training corpus builds only on data which is publicly available. It is filtered to respect machine-readable opt-out requests from websites, even retroactively, and to remove personal data, and other undesired content before training begins.

We probably won’t get better, but sounds like it’s still being trained on scraped data unless you explicitly opt out, including anything that may be getting mirrored by third parties that don’t opt out. Also, they can remove data from the training material retroactively… But presumably won’t be retraining the model from scratch, which means it will still have that in their weights, and the official weights will still have a potential advantage on models trained later on their training data.

From the license:

SNAI will regularly provide a file with hash values for download which you can apply as an output filter to your use of our Apertus LLM. The file reflects data protection deletion requests which have been addressed to SNAI as the developer of the Apertus LLM. It allows you to remove Personal Data contained in the model output.

Oof, so they’re basically passing on data protection deletion requests to the users and telling them all to respectfully account for them.

They also claim “open data”, but I’m having trouble finding the actual training data, only the “Training data reconstruction scripts”…

lime!@feddit.nu · edit-2 3 months ago

that’s the problem with deletion requests, the data isn’t in there. it can’t be, from a purely mathematical standpoint. statistically, with the amount of stuff that goes into training, any full work included in an llm is represented by less than one bit. but the model just… remakes sensitive information from scratch. ih reconstructs infringing data based on patterns.

which of course highlights the big issue with data anonymization: it can’t really be done.

dwt@feddit.org · 3 months ago

This begs the question: how Good is it? Did anyone test it yet?

hylaea@reddthat.com · 3 months ago

well i tested it and wasn’t impressed. my prompts were about Python3, that i’m working on a script and if i could show the AI my code so we can work together on it. It didn’t wait for my input, just gave me an endless python tutorial. I said wait till i show you my code but nope…

dwt@feddit.org · 3 months ago

Was that the small or the large variant?

hylaea@reddthat.com · 3 months ago

there are different ones?

dwt@feddit.org · 3 months ago

8b and 70b nets https://huggingface.co/collections/swiss-ai/apertus-llm-68b699e65415c231ace3b059

From 8b I wouldn’t expect much besides basic language skills, but 70b might be better than ChatGPT 3.5

hylaea@reddthat.com · 3 months ago

Thanks for the clarification. I simply tried https://chat.publicai.co/.

lascapi@jlai.lu · 3 months ago

Sounds good!

Is it the first LLM that is open like that (architecture, model weights, and training data and recipes)?

witty_username@feddit.nl · 3 months ago

But can it send me into a psychotic rage?

hperrin@lemmy.ca · 3 months ago

Is this an AI community? I thought this was about software.

icelimit@lemmy.ml · edit-2 3 months ago

No it’s about open source. Hardware can also be open source. As can AI.

hperrin@lemmy.ca · 3 months ago

Hmm. That’s a shame. I wonder if there’s a community that’s just about open source code, not all of the other things people feel like labeling open source.

lime!@feddit.nu · 3 months ago

…what? open source is a standardized term defined by the OSI. it’s a licencing term.

Sonalder@lemmy.ml · 3 months ago

You should check some FOSS, FLOSS communities. Free(dom)/Libre and Open Source Software is more important than Open Source Software itslef in my honest opinion.

Open Source can be applied to hardware, medicine, lessons, construction blueprints schematics, not only code.

Otter@lemmy.ca · 3 months ago

You might like [email protected], which is specific to software

hperrin@lemmy.ca · 3 months ago

Thank you. :)

Sonalder@lemmy.ml · edit-2 3 months ago

Open Source is a way to create but is not limited to only software but many different things. LLMs are software. Most open source LLMs are using Open washing to label themselves as Open Source, however it is not. The importance in Open Source is being able to study how it was made and most of open models have closed training data-sets and training method. Apertus is trully open in the sense that they published Open Data and full training details.

You have the right to be bother by “AI” but let Open Source enthousiasts being… well, enthousiasts when in a field of Open Washing someone created something trully Open Source to the point of sharing it in an Open Source community on a FOSS plateform.

Edit : Minor corrections

Zerush@lemmy.ml · edit-2 2 months ago

You can find it in HuggingFace.

Apertus is designed with transparency at its core, thereby ensuring full reproducibility of the training process. Alongside the models, the research team has published a range of resources: comprehensive documentation and source code of the training process and datasets used, model weights including intermediate checkpoints – all released under a permissive open-source license, which also allows for commercial use. The terms and conditions are available via Hugging Face. Apertus was developed with due consideration to Swiss data protection laws, Swiss copyright laws, and the transparency obligations under the EU AI Act. Particular attention has been paid to data integrity and ethical standards: the training corpus builds only on data which is publicly available. It is filtered to respect machine-readable opt-out requests from websites, even retroactively, and to remove personal data, and other undesired content before training begins.

You can use it here (optional free account).

Review:

Apertus truly delivers on its transparency promises, representing one of the most open and transparent LLM projects to date. The philosophy has been “open at every level,” backed by concrete actions that set new standards for AI transparency.

Apertus: a fully open, transparent, multilingual language model

Apertus: a fully open, transparent, multilingual language model

Key features