Train
JSONL Format
The text field contains the text sample for the training case. The label fields must be an int to represent what ID the text belongs to.
{"text": "Sample train case ", "label": 0}
{"text": "Another sample train case ", "label": 1}
Labels IDs
You can see what labels and IDs are available for your model with the following code
from erictransformer import EricTextClassification
eric_tc_labels = EricTextClassification(model_name="bert-base-uncased", labels=["custom_0", "custom_1", "custom_3"])
print(eric_tc_labels.config.id2label) # {0: 'custom_0', 1: 'custom_1', 2: 'custom_3'}
print(eric_tc_labels.config.label2id) # {'custom_0': 0, 'custom_1': 1, 'custom_3': 2}
# when fine-tuning provide the ID 0 for custom_0, 1 for 'custom_1' and 2 for 'custom_3'
train()
inputs: 1. train_path (string) (required): a path file to a train file or a directory that contains train files.
-
eval_filepath (string) (optional): a path file to an eval file or a directory that contains eval files.
-
args (EricTrainArgs) (optional): a dataclass with the arguments found here.
import json
from erictransformer import EricTextClassification, EricTrainArgs
eric_gen = EricTextClassification(model_name="bert-base-uncased", labels=["LABEL_0", "LABEL_1", "LABEL_3"])
args = EricTrainArgs(out_dir="eric_transformer", lr=2e-5)
train_data = [{"text": "Train data 0", "label": 0},
{"text": "Train data 1", "label": 1},
{"text": "Train data 2", "label": 2}
]
with open("data.jsonl", "w") as f:
for td in train_data:
f.write(json.dumps(td) + "\n")
result = eric_gen.train(train_path="data.jsonl", eval_path="data.jsonl", args=args)
print(result)
View the output directory in eric_transformer/
eval()
Inputs:
-
train_path (string) (required): Same as train()'s train_path parameter.
-
args (EricEvalArgs) (optional): a dataclass with these arguments.
import json
from erictransformer import EricTextClassification, EricEvalArgs
eric_tc = EricTextClassification(model_name="bert-base-uncased")
args = EricEvalArgs(out_dir="eric_transformer")
train_data = [{"text": "Eval data 0", "label": 0}, {"text": "Eval data 1", "label": 1},
]
with open("eval.jsonl", "w") as f:
for td in train_data:
f.write(json.dumps(td) + "\n")
result = eric_tc.eval("eval.jsonl", args=args)
print(result.loss)
You can view the tokenized data in eric_transformer/