Clustering is one of the most ubiquitous unsupervised learning tasks, with applications to a wide variety of domains. Genomics data analysis makes no exception, yet genomics datasets combine features of different natures (continuous and categorical) and are, therefore, of mixed type. The hierarchical clustering of a set of genomics data observations requires pairwise distances or dissimilarities, and it returns a sequence of nested clustering partitions represented as a tree graph. The choice of the dissimilarity or distance measure affects the obtained sequence of cluster partitions. Furthermore, selecting the reference partition out of the nested sequence is up to the user: this is done by setting a threshold, that is, by cutting the tree-based graph horizontally. A permutation test-based procedure has been proposed in the literature to select the final partition based on more than a single threshold (non-horizontal cut). This paper introduces a novel top-down implementation of such permutation test-based procedure to identify the final partition out of a hierarchy of solutions. Different approaches for distance computations are considered to extend the procedure’s applicability to the mixed data case.
Automatic Dendrogram Slicing for Mixed-Type Data Clustering
Palazzo, Lucio
;
2025-01-01
Abstract
Clustering is one of the most ubiquitous unsupervised learning tasks, with applications to a wide variety of domains. Genomics data analysis makes no exception, yet genomics datasets combine features of different natures (continuous and categorical) and are, therefore, of mixed type. The hierarchical clustering of a set of genomics data observations requires pairwise distances or dissimilarities, and it returns a sequence of nested clustering partitions represented as a tree graph. The choice of the dissimilarity or distance measure affects the obtained sequence of cluster partitions. Furthermore, selecting the reference partition out of the nested sequence is up to the user: this is done by setting a threshold, that is, by cutting the tree-based graph horizontally. A permutation test-based procedure has been proposed in the literature to select the final partition based on more than a single threshold (non-horizontal cut). This paper introduces a novel top-down implementation of such permutation test-based procedure to identify the final partition out of a hierarchy of solutions. Different approaches for distance computations are considered to extend the procedure’s applicability to the mixed data case.File | Dimensione | Formato | |
---|---|---|---|
s00357-025-09516-3.pdf
solo utenti autorizzati
Tipologia:
Documento in Post-print
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
1.49 MB
Formato
Adobe PDF
|
1.49 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.