Commit 09c8b85a authored by Nishtha Jain's avatar Nishtha Jain
Browse files

readme

parent 6cb22efc
Make sure these directories are present
# BIAS IN BIOS
`mkdir datasets models plots_and_graphs predicted_datasets word_embeddings eval_scores`
This project is created by [Nishtha Jain](https://git.rwth-aachen.de/nishthajain1611) and [Sparsh Jauhari](https://git.rwth-aachen.de/sparsh.jauhari) under the guidance of our mentor [Markus Strohmaier](http://markusstrohmaier.info/) as a lab project for "Lab Course Data, Society and Algorithms",
Chair for Computational Social Sciences and Humanities, RWTH Aachen University.
to add pretrained word embedding models, write in terminal:
In this project, we aim to find potential bias in the data of biographies of individuals ([BIOS.pkl](https://github.com/microsoft/biosbias)) and also in the machine learning models trained on them to predict the occupation.
`cd word_embeddings`
`wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"`
`gzip -d GoogleNews-vectors-negative300.bin.gz`
to get debiased version of pretrained word2Vec embedding, use the link: <br /> https://drive.google.com/file/d/1_PvT4ZvtZjhq4HPywA8-u06epht9ccOw/view?usp=sharing
## Installation and prerequisites
```bash
pip install -r requirements.txt
```
Add these directories
```bash
mkdir datasets models predicted_datasets word_embeddings
```
Download the following embeddings for pre-trained Word2Vec embeddings in the word_embeddings directory.
```bash
cd word_embeddings
wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
gzip -d GoogleNews-vectors-negative300.bin.gz
```
To get debiased version of pretrained Word2Vec embedding, use this [link](https://drive.google.com/file/d/1_PvT4ZvtZjhq4HPywA8-u06epht9ccOw/view?usp=sharing), or download them by using [gdown](https://pypi.org/project/gdown/3.3.1/). Download them in the word_embeddings directory.
<br /> OR <br />
To get pretrained word2Vec embedding:
`cd word_embeddings` <br />
```bash
gdown https://drive.google.com/uc?id=0B5vZVlu2WoS5ZTBSekpUX0RSNDg
```
Download the debiased version of our self trained word2Vec embeddings in the word_embeddings directory.
<br />
To download debiased version of our self trained word2Vec embeddings <br />
`cd word_embeddings` <br />
```bash
gdown https://drive.google.com/uc?id=1Lj_RteEF_wAEsENZBLQYGBLMY99VfI5j
```
## File structure
>`config.py` - contains constants and paths
>
>`model.py` - contains model descriptions and training and prediction modules
>
>`preprocessing.py` - contains embeddings and preprocessing tasks
>
>`sampling.py` - contains data extraction and sampling tasks
>
>`train.py` - contains the runnable flow of the project
>
>`predict.py` - contains functions to predict the occupation of given bios
>
>`bios_bias.ipynb` - contains results, evaluation metrics and plots
## Usage
The runnable `train.py` can be used to train the models and predict then on test set while
`predict.py` can be used for single predictions. All the results, plots and evaluations can be seen in the `bios_bias.ipynb`. `debiasing...............` takes care of the debiasing.
#### Training and Predictions
```bash
python train.py --no-load_data_from_saved --embedding_train --model_train --predict --masking --class_group medical --sampling balanced --embedding cv --model svm --test_size 0.2
```
>`--load_data_from_saved` - if saved data to be used
>
>`--no-load_data_from_saved` - if new data to be taken
>
>`--embedding_train` - to train new embedding
>
>`--no-embedding_train` - to use the saved embedding
>
>`--model_train` - to train new model
>
>`--no-model_train` - to use the saved model
>
>`--predict` - to perform predictions on the test set
>
>`--no-predict` - to not perform predictions on the test set
>
>`--masking` - for 'bio' data
>
>`--no-masking` - for 'raw' data
>
>`--class_group` - choice of domain of occupations ('trial','medical')
>
>`--sampling` - choice of sampling / class weights ('random', 'balanced')
>
>`--embedding` - choice of embeddings to be used ('cv':count vectorize, 'w2v':word2vec, 'self_w2v': self-trained word2vec, 'elmo':elmo, 'd_w2v:debiased word2vec, 'd_self_w2v': debiased self-trained word2vec)
>
>`--model` - choice of models to be trained ('svm')
>
>`--test_size` - proportion of data to be used for testing
#### Single predictions
To predict your own sentence on the trained models
```bash
python predict
```
>`--masking` - for 'bio' data
>
>`--no-masking` - for 'raw' data
>
>`--sampling` - choice of sampling / class weights ('random', 'balanced')
>
>`--embedding` - choice of embeddings to be used ('cv':count vectorize, 'w2v':word2vec, 'self_w2v': self-trained word2vec, 'elmo':elmo, 'd_w2v:debiased word2vec, 'd_self_w2v': debiased self-trained word2vec)
>
>`--pred_all_models` - yes for all or no for the specific one
#### Debiasing of embeddings
........................................................write!!!!!!!!!!!!!!
install requirements.txt
config.py - contains constants
model.py - contains model descriptions and training and prediction modules
preprocessing.py - contains embeddings and preprocessing tasks
sampling.py - contains data extraction and sampling tasks
train.py - contains the runnable flow of the project
predict.py - contains functions to predict the occupation of given bios
bios_bias.ipynb - contains results, evaluation metrics and plots
`python train.py --no-load_data_from_saved --embedding_train --model_train --predict --masking --class_group medical --sampling random --embedding cv --model svm --test_size 0.2`
This diff is collapsed.
......@@ -38,9 +38,9 @@ if __name__ == "__main__":
parser.add_argument("--masking", dest='masking',action='store_true', help = "for 'bio' data")
parser.add_argument("--no-masking", dest='masking',action='store_false', help = "for 'raw' data")
parser.add_argument("--sampling", help = "choice of sampling ('random', 'balanced')")
parser.add_argument("--embedding", help = "choice of embeddings to be used ('cv': count_vectorize(self-trained), 'w2v': word2vec_embedding(pre-trained), 'self_w2v':w2v(self-trained), 'elmo':elmo(pre-trained))")
parser.add_argument("--pred_all_models", help = 'yes or no')
parser.add_argument("--sampling", help = "choice of sampling / class weights ('random', 'balanced')")
parser.add_argument("--embedding", help = "choice of embeddings to be used ('cv':count vectorize, 'w2v':word2vec, 'self_w2v': self-trained word2vec, 'elmo':elmo, 'd_w2v:debiased word2vec, 'd_self_w2v': debiased self-trained word2vec)")
parser.add_argument("--pred_all_models", help = 'yes for all or no for the specific one')
args = parser.parse_args()
X_test = ['She works at the hospital','He works at the hospital']
......
......@@ -167,13 +167,6 @@ def elmo_transform(x_list,masking):
X.append(np.mean(embeddings['word_emb'],1).flatten())
return X
'''
def elmo_fit_transform(x_list):
elmo = hub.load("https://tfhub.dev/google/elmo/3")
X = elmo_transform(elmo,x_list)
return elmo , X
'''
......
......@@ -49,12 +49,11 @@ load_data_from_saved -> True if saved data to be used and False if new data t
embedding_train -> True to train new embedding and False to use the saved one
model_train -> True to train new model and False to use the saved one
predict -> True to perform predictions on the test set and False otherwise
evaluate -> True to perform bias evaluations on the test set and False otherwise
class_group -> choice of domain of occupations ('trial','medical')
sampling -> choice of sampling ('random', 'balanced')
embedding -> choice of embeddings to be used ('cv': count_vectorize(self-trained), 'w2v': word2vec_embedding(pre-trained), 'self_w2v':w2v(self-trained))
model -> choice of models to be trained ('svm', 'rf', 'nn')
test_size -> proportion of data to be used for tesing
sampling -> choice of sampling/ class weights ('random', 'balanced')
embedding -> choice of embeddings to be used ('cv':count vectorize, 'w2v':word2vec, 'self_w2v': self-trained word2vec, 'elmo':elmo, 'd_w2v:debiased word2vec, 'd_self_w2v': debiased self-trained word2vec)
model -> choice of models to be trained ('svm')
test_size -> proportion of data to be used for testing
masking -> True for 'bio' data and False for 'raw' data
'''
if __name__ == "__main__":
......@@ -66,9 +65,6 @@ if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Adding optional argument
parser.add_argument('--feature', dest='feature', action='store_true')
parser.add_argument('--no-feature', dest='feature', action='store_false')
parser.set_defaults(feature=True)
parser.add_argument("--load_data_from_saved", dest = 'load_data_from_saved',action='store_true', help = "if saved data to be used ")
parser.add_argument("--no-load_data_from_saved", dest = 'load_data_from_saved',action='store_false', help = "if new data to be taken")
......@@ -83,24 +79,22 @@ if __name__ == "__main__":
parser.set_defaults(model_train=True)
parser.add_argument("--predict", dest='predict',action='store_true', help = "to perform predictions on the test set")
parser.add_argument("--no-predict", dest='predict',action='store_false', help = "otherwise")
parser.add_argument("--no-predict", dest='predict',action='store_false', help = "to not perform predictions on the test set")
parser.set_defaults(predict=True)
parser.add_argument("--masking", dest='masking',action='store_true', help = "for 'bio' data")
parser.add_argument("--no-masking", dest='masking',action='store_false', help = "for 'raw' data")
parser.add_argument("--class_group", default='medical', required=True, help = "choice of domain of occupations ('trial','medical')")
parser.add_argument("--sampling", required=True, help = "choice of sampling ('random', 'balanced')")
parser.add_argument("--embedding", required=True, help = "choice of embeddings to be used ('cv': count_vectorize(self-trained), 'w2v': word2vec_embedding(pre-trained), 'self_w2v':w2v(self-trained), 'elmo':elmo(pre-trained))")
parser.add_argument("--model", default= 'svm', required=True, help = "choice of models to be trained ('svm', 'rf', 'nn')")
parser.add_argument("--test_size", default = 0.2, required=True, help = "proportion of data to be used for tesing")
parser.add_argument("--sampling", required=True, help = "choice of sampling / class weights ('random', 'balanced')")
parser.add_argument("--embedding", required=True, help = "choice of embeddings to be used ('cv':count vectorize, 'w2v':word2vec, 'self_w2v': self-trained word2vec, 'elmo':elmo, 'd_w2v:debiased word2vec, 'd_self_w2v': debiased self-trained word2vec)")
parser.add_argument("--model", default= 'svm', required=True, help = "choice of models to be trained ('svm')")
parser.add_argument("--test_size", default = 0.2, required=True, help = "proportion of data to be used for testing")
# Read arguments from command line
args = parser.parse_args()
if args.load_data_from_saved:
print("Displaying load_data_from_saved as: % s" % args.load_data_from_saved)
# Calling the main function
main(load_data_from_saved = args.load_data_from_saved,
embedding_train = args.embedding_train,
model_train = args.model_train,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment