updated: 2022-10-12
#tools/umap
UMAP is a great tool to visualize a high dimensional data. It reduces the dimension by preserving local proximity of data points.
The bottleneck of UMAP is the construction of the k-nearest neighbor graph. Luckly, UMAP can accept pre-computed k-nearest neighbor graph, and we have a faster library for the k-nearest neighbor graph, faiss.
First of all, let's find the k-nearest neighbors using faiss.
import faiss
def find_knn(emb, num_neighbors, metric="cosine", device=None):
if metric == "cosine":
index = faiss.IndexFlatIP(emb.shape[1])
else:
index = faiss.IndexFlatL2(emb.shape[1])
if device is None:
index.add(emb.astype(np.float32))
distances, indices = index.search(emb.astype(np.float32), k=num_neighbors)
else:
try:
gpu_id = int(device[-1])
res = faiss.StandardGpuResources()
index = faiss.index_cpu_to_gpu(res, gpu_id, index)
index.add(emb.astype(np.float32))
distances, indices = index.search(
emb.astype(np.float32), k=num_neighbors
)
except RuntimeError:
if metric == "cosine":
index = faiss.IndexFlatIP(emb.shape[1])
else:
index = faiss.IndexFlatL2(emb.shape[1])
index.add(emb.astype(np.float32))
distances, indices = index.search(
emb.astype(np.float32), k=num_neighbors
)
return indices, distances
indices, distances = find_knn(emb, num_neighbors = 5, metric = "cosine", device="cuda:0") # if you don't have gpu, set device='cpu'
Then, pass the identified nearest neighbors to UMAP
import umap
from pynndescent import NNDescent
index = NNDescent(emb[0, :].reshape((-1,1))) # We don't need the index but have to give something. So make up an index here.
knn = (indices, distances, index) #data about the knn graph
# UMAPing
xy = umap.UMAP(n_neighbors=10, min_dist=0.1,precomputed_knn=knn).fit_transform(emb)