flairNLP flair Mode File Loader clustering.py ClusteringModel code injection
Description
A vulnerability, which was classified as critical, was found in flairNLP flair 0.14.0. Affected is the function ClusteringModel of the file flair\models\clustering.py of the component Mode File Loader. The manipulation leads to code injection. It is possible to launch the attack remotely. The complexity of an attack is rather high. The exploitability is told to be difficult. The exploit has been disclosed to the public and may be used. The vendor was contacted early about this disclosure but did not respond in any way.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
CVE-2024-10073: Critical code injection vulnerability in flairNLP/flair 0.14.0 through the ClusteringModel function in flair/models/clustering.py.
Vulnerability
CVE-2024-10073 is a critical code injection vulnerability in flairNLP/flair version 0.14.0. The flaw resides in the ClusteringModel function within the file flair\models\clustering.py, specifically in the Mode File Loader component. The vulnerability allows an attacker to inject arbitrary code by manipulating the file loading process, leading to remote code execution [1][2].
Exploitation
This vulnerability is exploitable remotely, meaning an attacker does not need local access to the system. However, the attack complexity is described as rather high, and exploitability is considered difficult [2]. The exploit has been publicly disclosed, which increases the risk of actual exploitation. The exact prerequisites for exploitation, such as authentication requirements or specific network positioning, are not detailed in the publicly available information. The attack is classified as critical due to the potential for code injection [1][2].
Impact
Successful exploitation allows an attacker to execute arbitrary code on the affected system. Given that code injection is the impact, the attacker could potentially gain full control over the vulnerable server, compromise data, or use it as a pivot point for further attacks. The severity is rated as critical, highlighting the serious consequences of exploitation [2].
Mitigation
The vendor, flairNLP, was contacted early but did not respond. In response, the project removed the clustering module in version 0.15.0 to address the vulnerability [3][4]. The release notes for v0.15.0 explicitly state: "To acknowledge CVE-2024-10073, we decided to drop support for the flair.models.clustering module" [4]. Therefore, users should upgrade to flair version 0.15.0 or later to mitigate this vulnerability. The commit that removes clustering support has been merged [3]. No other workarounds have been published.
AI Insight generated on May 20, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
flairPyPI | < 0.15.0 | 0.15.0 |
Affected products
2Patches
1fb27c7eb1d92Merge pull request #3567 from flairNLP/remove_clustering
3 files changed · +0 −302
flair/models/clustering.py+0 −120 removed@@ -1,120 +0,0 @@ -import logging -import pickle -from collections import OrderedDict -from pathlib import Path -from typing import Optional, Union - -import joblib -from sklearn.base import BaseEstimator, ClusterMixin -from sklearn.metrics import normalized_mutual_info_score -from tqdm import tqdm - -from flair.data import Corpus, _iter_dataset -from flair.datasets import DataLoader -from flair.embeddings import DocumentEmbeddings - -log = logging.getLogger("flair") - - -class ClusteringModel: - """A wrapper class to apply sklearn clustering models on DocumentEmbeddings.""" - - def __init__(self, model: Union[ClusterMixin, BaseEstimator], embeddings: DocumentEmbeddings) -> None: - """Instantiate the ClusteringModel. - - Args: - model: the clustering algorithm from sklearn this wrapper will use. - embeddings: the flair DocumentEmbedding this wrapper uses to calculate a vector for each sentence. - """ - self.model = model - self.embeddings = embeddings - - def fit(self, corpus: Corpus, **kwargs): - """Trains the model. - - Args: - corpus: the flair corpus this wrapper will use for fitting the model. - **kwargs: parameters propagated to the models `.fit()` method. - """ - X = self._convert_dataset(corpus) - - log.info("Start clustering " + str(self.model) + " with " + str(len(X)) + " Datapoints.") - self.model.fit(X, **kwargs) - log.info("Finished clustering.") - - def predict(self, corpus: Corpus): - """Predict labels given a list of sentences and returns the respective class indices. - - Args: - corpus: the flair corpus this wrapper will use for predicting the labels. - """ - X = self._convert_dataset(corpus) - log.info("Start the prediction " + str(self.model) + " with " + str(len(X)) + " Datapoints.") - predict = self.model.predict(X) - - for idx, sentence in enumerate(_iter_dataset(corpus.get_all_sentences())): - sentence.set_label("cluster", str(predict[idx])) - - log.info("Finished prediction and labeled all sentences.") - return predict - - def save(self, model_file: Union[str, Path]): - """Saves current model. - - Args: - model_file: path where to save the model. - """ - joblib.dump(pickle.dumps(self), str(model_file)) - - log.info("Saved the model to: " + str(model_file)) - - @staticmethod - def load(model_file: Union[str, Path]): - """Loads a model from a given path. - - Args: - model_file: path to the file where the model is saved. - """ - log.info("Loading model from: " + str(model_file)) - return pickle.loads(joblib.load(str(model_file))) - - def _convert_dataset( - self, corpus, label_type: Optional[str] = None, batch_size: int = 32, return_label_dict: bool = False - ): - """Makes a flair-corpus sklearn compatible. - - Turns the corpora into X, y datasets as required for most sklearn clustering models. - Ref.: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster - """ - log.info("Embed sentences...") - sentences = [] - for batch in tqdm(DataLoader(corpus.get_all_sentences(), batch_size=batch_size)): - self.embeddings.embed(batch) - sentences.extend(batch) - - X = [sentence.embedding.cpu().detach().numpy() for sentence in sentences] - - if label_type is None: - return X - - labels = [sentence.get_labels(label_type)[0].value for sentence in sentences] - label_dict = {v: k for k, v in enumerate(OrderedDict.fromkeys(labels))} - y = [label_dict.get(label) for label in labels] - - if return_label_dict: - return X, y, label_dict - - return X, y - - def evaluate(self, corpus: Corpus, label_type: str): - """This method calculates some evaluation metrics for the clustering. - - Also, the result of the evaluation is logged. - - Args: - corpus: the flair corpus this wrapper will use for evaluation. - label_type: the label from the sentence will be used for the evaluation. - """ - X, Y = self._convert_dataset(corpus, label_type=label_type) - predict = self.model.predict(X) - log.info("NMI - Score: " + str(normalized_mutual_info_score(predict, Y)))
flair/models/__init__.py+0 −2 modified@@ -1,4 +1,3 @@ -from .clustering import ClusteringModel from .entity_linker_model import SpanClassifier from .entity_mention_linking import EntityMentionLinker from .language_model import LanguageModel @@ -37,6 +36,5 @@ "TARSTagger", "TextClassifier", "TextRegressor", - "ClusteringModel", "MultitaskModel", ]
resources/docs/TUTORIAL_12_CLUSTERING.md+0 −180 removed@@ -1,180 +0,0 @@ -Text Clustering in flair ----------- - -In this package text clustering is implemented. This module has the following -clustering algorithms implemented: -- k-Means -- BIRCH -- Expectation Maximization - -Each of the implemented algorithm needs to have an instanced DocumentEmbedding. This embedding will -transform each text/document to a vector. With these vectors the clustering algorithm can be performed. - ---------------------------- - -k-Means ------- -k-Means is a classical and well known clustering algorithm. k-Means is a partitioning-based Clustering algorithm. -The user defines with the parameter *k* how many clusters the given data has. -So the choice of *k* is very important. -More about k-Means can be read on the official [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). - - -```python -from flair.models import ClusteringModel -from flair.datasets import TREC_6 -from flair.embeddings import SentenceTransformerDocumentEmbeddings -from sklearn.cluster import KMeans - -embeddings = SentenceTransformerDocumentEmbeddings() - -# store all embeddings in memory which is required to perform clustering -corpus = TREC_6(memory_mode='full').downsample(0.05) - -model = KMeans(n_clusters=6) - -clustering_model = ClusteringModel( - model=model, - embeddings=embeddings -) - -# fit the model on a corpus -clustering_model.fit(corpus) - -# evaluate the model on a corpus with the given label -clustering_model.evaluate(corpus, label_type="question_class") -``` - -BIRCH ---------- -BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a hierarchical clustering algorithm. -BIRCH is specialized to handle large amounts of data. BIRCH scans the data a single time and builds an internal data -structure. This data structure contains the data but in a compressed way. -More about BIRCH can be read on the official [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html). - -```python -from sklearn.cluster import Birch -from flair.datasets import TREC_6 -from flair.embeddings import SentenceTransformerDocumentEmbeddings -from flair.models import ClusteringModel - -embeddings = SentenceTransformerDocumentEmbeddings() - -# store all embeddings in memory which is required to perform clustering -corpus = TREC_6(memory_mode='full').downsample(0.05) - -model = Birch(n_clusters=6) - -clustering_model = ClusteringModel( - model=model, - embeddings=embeddings -) - -# fit the model on a corpus -clustering_model.fit(corpus) - -# evaluate the model on a corpus with the given label -clustering_model.evaluate(corpus, label_type="question_class") -``` - - -Expectation Maximization --------------------------- -Expectation Maximization (EM) is a different class of clustering algorithms called soft clustering algorithms. -Here each point isn't directly assigned to a cluster by a hard decision. -Each data point has a probability to which cluster the data point belongs. The Expectation Maximization (EM) -algorithm is a soft clustering algorithm. -More about EM can be read on the official [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html). - - -```python -from sklearn.mixture import GaussianMixture -from flair.datasets import TREC_6 -from flair.embeddings import SentenceTransformerDocumentEmbeddings -from flair.models import ClusteringModel - -embeddings = SentenceTransformerDocumentEmbeddings() - -# store all embeddings in memory which is required to perform clustering -corpus = TREC_6(memory_mode='full').downsample(0.05) - -model = GaussianMixture(n_components=6) - -clustering_model = ClusteringModel( - model=model, - embeddings=embeddings -) - -# fit the model on a corpus -clustering_model.fit(corpus) - -# evaluate the model on a corpus with the given label -clustering_model.evaluate(corpus, label_type="question_class") -``` - ---------------------------- - -Loading/Saving the model ------------ - -The model can be saved and loaded. The code below shows how to save a model. -```python -from flair.models import ClusteringModel -from flair.datasets import TREC_6 -from flair.embeddings import SentenceTransformerDocumentEmbeddings -from sklearn.cluster import KMeans - -embeddings = SentenceTransformerDocumentEmbeddings() - -# store all embeddings in memory which is required to perform clustering -corpus = TREC_6(memory_mode='full').downsample(0.05) - -model = KMeans(n_clusters=6) - -clustering_model = ClusteringModel( - model=model, - embeddings=embeddings -) - -# fit the model on a corpus -clustering_model.fit(corpus) - -# save the model -clustering_model.save(model_file="clustering_model.pt") -``` - -The code for loading a model. - -````python -# load saved clustering model -model = ClusteringModel.load(model_file="clustering_model.pt") - -# load a corpus -corpus = TREC_6(memory_mode='full').downsample(0.05) - -# predict the corpus -model.predict(corpus) -```` - ---------------------- - -Evaluation ---------- -The result of the clustering can be evaluated. For this we will use the -[NMI](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html). -(Normalized Mutual Info) score. - -````python -# need to fit() the model first -# evaluate the model on a corpus with the given label -clustering_model.evaluate(corpus, label_type="question_class") -```` - -The result of the evaluation can be seen below with the SentenceTransformerDocumentEmbeddings: - - -| Clustering Algorithm | Dataset | NMI | -|--------------------------|:-------------:|--------:| -| k Means | StackOverflow | ~0.2122 | -| BIRCH | StackOverflow | ~0,2424 | -| Expectation Maximization | 20News group | ~0,2222 |
Vulnerability mechanics
Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
8- github.com/bayuncao/vul-cve-20/blob/main/PoC.pyghsaexploitWEB
- github.com/advisories/GHSA-9rw2-jf8x-cgwmghsaADVISORY
- nvd.nist.gov/vuln/detail/CVE-2024-10073ghsaADVISORY
- vuldb.comghsathird-party-advisoryWEB
- github.com/flairNLP/flair/commit/fb27c7eb1d92855c27db820a108b17883a5d6fc1ghsaWEB
- github.com/flairNLP/flair/releases/tag/v0.15.0ghsaWEB
- vuldb.comghsasignaturepermissions-requiredWEB
- vuldb.comghsavdb-entrytechnical-descriptionWEB
News mentions
0No linked articles in our index yet.