All the courses
Course information
📙 Notebook: Tokenizer basic examples. 📙 Notebook: Sarcasm detection.
Tokenizer
.from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
'i love my dog',
'I, love my cat',
'You love my dog so much!'
]
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
# num_words: max of words to be tokenized & pick
# the most common 100 words.
# More words, more accuracy, more time to train
# oov_token: replace unseen words by "<OOV>"
tokenizer.fit_on_texts(sentences) # fix texts based on tokens
# indexing words
word_index = tokenizer.word_index
print(word_index)
# {'<OOV>': 1, 'love': 2, 'my': 3, 'i': 4, 'dog': 5, 'cat': 6, 'you': 7, 'so': 8, 'much': 9}
# "!", ",", capital, ... are removed
👉 tf.keras.preprocessing.text.Tokenizer
# encode sentences
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)
# [[4, 2, 3, 5],
# [4, 2, 3, 6],
# [7, 2, 3, 5, 8, 9]]
# if a word is not in the word index, it will be lost in the text_to_sequences()
👉 tf.keras.preprocessing.sequence.pad_sequences
# make encoded sentences equal
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded = pad_sequences(sequences, value=-1,
maxlen=5, padding="post", truncating="post")
# maxlen: max len of encoded sentence
# value: value to be filld (default 0)
# padding: add missing values at beginning or ending of sentence?
# truncating: longer than maxlen? cut at beginning or ending?
print(padded)
# [[ 4 2 3 5 -1]
# [ 4 2 3 6 -1]
# [ 7 2 3 5 8]]
# read json text
import json
with open("/tmp/sarcasm.json", 'r') as f:
datastore = json.load(f)
sentences = []
labels = []
urls = []
for item in datastore:
sentences.append(item['headline'])
labels.append(item['is_sarcastic'])
urls.append(item['article_link'])
👉 Embedding projector - visualization of high-dimensional data 👉 Large Movie Review Dataset
📙 Notebook: Train IMDB review dataset. 👉 Video explain the code.