Automatic Dendrogram Slicing for Mixed-Type Data Clustering

IRIS

Clustering is one of the most ubiquitous unsupervised learning tasks, with applications to a wide variety of domains. Genomics data analysis makes no exception, yet genomics datasets combine features of different natures (continuous and categorical) and are, therefore, of mixed type. The hierarchical clustering of a set of genomics data observations requires pairwise distances or dissimilarities, and it returns a sequence of nested clustering partitions represented as a tree graph. The choice of the dissimilarity or distance measure affects the obtained sequence of cluster partitions. Furthermore, selecting the reference partition out of the nested sequence is up to the user: this is done by setting a threshold, that is, by cutting the tree-based graph horizontally. A permutation test-based procedure has been proposed in the literature to select the final partition based on more than a single threshold (non-horizontal cut). This paper introduces a novel top-down implementation of such permutation test-based procedure to identify the final partition out of a hierarchy of solutions. Different approaches for distance computations are considered to extend the procedure’s applicability to the mixed data case.

Automatic Dendrogram Slicing for Mixed-Type Data Clustering

Palazzo, Lucio;Iodice D'Enza, Alfonso;Vistocco, Domenico;Palumbo, Francesco

2025-01-01

Abstract

Clustering is one of the most ubiquitous unsupervised learning tasks, with applications to a wide variety of domains. Genomics data analysis makes no exception, yet genomics datasets combine features of different natures (continuous and categorical) and are, therefore, of mixed type. The hierarchical clustering of a set of genomics data observations requires pairwise distances or dissimilarities, and it returns a sequence of nested clustering partitions represented as a tree graph. The choice of the dissimilarity or distance measure affects the obtained sequence of cluster partitions. Furthermore, selecting the reference partition out of the nested sequence is up to the user: this is done by setting a threshold, that is, by cutting the tree-based graph horizontally. A permutation test-based procedure has been proposed in the literature to select the final partition based on more than a single threshold (non-horizontal cut). This paper introduces a novel top-down implementation of such permutation test-based procedure to identify the final partition out of a hierarchy of solutions. Different approaches for distance computations are considered to extend the procedure’s applicability to the mixed data case.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2025

Appare nelle tipologie:

1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
s00357-025-09516-3.pdf solo utenti autorizzati Tipologia: Documento in Post-print Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 1.49 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.49 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11574/247605

Citazioni

ND

social impact