Skip to content
Snippets Groups Projects
Name Last commit Last update
extractor
hdf5
txt
.gitignore
README.md

EmbeddingFramework

This repository contains an embedding framework to evaluate a RDF embedding technique upon to ML and semantic tasks. The implemented tasks are:

  • Machine Learning
    • Classification
    • Regression
    • Clustering
  • Semantic tasks
    • Entity Relatedness
    • Document similarity
    • Semantic analogies

How to run the code?

Environment:

  • Python version : Python 2.7.3
  • Libraries: (output of pip freeze of my virtual environment)
    • certifi==2018.4.16
    • chardet==3.0.4
    • idna==2.7
    • numpy==1.14.0
    • pandas==0.22.0
    • python-dateutil==2.7.3
    • pytz==2018.5
    • requests==2.19.1
    • scikit-learn==0.19.2
    • scipy==1.1.0
    • six==1.11.0
    • sklearn==0.0
    • urllib3==1.23

Parameters:

  • --vectors_file, Path of the file where your vectors are stored. File format: one line for each entity with entity and vector, mandatory
  • --vectors_size, default=200, Length of each vector
  • --top_k, default=2, Used in SemanticAnalogies : The predicted vector will be compared with the top k closest vectors to establish if the prediction is correct or not

Needed : run main.py providing at least --vectors_file as parameter.

Note: The tasks can be executed sequentially or in parallel. If the code raises MemoryError it means that the tasks need more memory than the one available. In that case run all the tasks sequentially.

Project structure

main.py instantiates the distance function to measure how much two vectors are distant and the analogy function used in Semantic Analogies task. It manages the parameters and instantiates the evaluator_manager.

The evaluator_manager.py reads the vectors file, runs all the tasks sequentially or in parallel and creates the output directory calling it results__.

Each task is in a separate folder and each of them is costituted by: a manager that supervises the work and organizes the output, a data_manager that reads the files used as gold standard and merges them with the actual vectors, a model that computes the task and provides the output to the manager.

How to customize distance and analogy function

You have to redefine your own main.

You can use one of the distance metric accepted by scipy.spatial.distance.cdist.

Your analogy function has to take

  • 3 vectors or matrices of vectors used to forecast the forth vector,
  • the index (or indices) or these vectors in the data matrix
  • the data matrixes that contains all the vectors
  • the top_k, i.e., the number of vectors you want to use to check if the predicted vector is close to one in your dataset

and it must return the indices of the top_k closest vector to the predicted one.