Mainly notes from reaading the Natural Language Processing with Transformers book

Really nice book! I have the urge to write down for myself some snippets so I can more easily refer to them later.

Read a dataset to pandas

import pandas as pd
from datasets import load_dataset
emotions = load_dataset("emotion")

# emotions["train"] # this is still a datasets.arrow_dataset.Dataset
emotions.set_format(type="pandas")
df = emotions["train"][:]  # but adding that "[:]" slice grants a DataFrame !
df.head()

# Go back to initial format
emotions.reset_format()

Cool Mini Tokenization example

from transformers import AutoTokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
text = "Tokenizing text is a core task of NLP."

encoded_text = tokenizer(text)
print(encoded_text)
{'input_ids': [101, 19204, 6026, 3793, 2003, 1037, 4563, 4708, 1997, 17953,
2361, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# oh cool and the tokenizer lets you convert back , 
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)
['[CLS]', 'token', '##izing', 'text', 'is', 'a', 'core', 'task', 'of', 'nl',
'##p', '.', '[SEP]']

And finally

print(tokenizer.convert_tokens_to_string(tokens))
[CLS] tokenizing text is a core task of nlp. [SEP]