Why do we need to subtract 3 in word index in IMDb dataset from Tensorflow?

1 minute read

As I was working with IMDb dataset, I noticed while decoding, we need to subtract 3 in the word index to get the corresponding word. Let’s understand why.

We subtract 3 because IMDB reserves the first 3 indices for special tokens, not real words.

(i - 3) shifts the index back so it correctly maps to the actual word.

What’s really happening under the hood

In the IMDB dataset, word indices start like this:

IndexMeaning
0<PAD> (padding)
1<START> (start of review)
2<UNK> (unknown word)
3First real word
4Second real word

So actual vocabulary words start from index 3.


word_index vs encoded reviews

imdb.get_word_index()

Returns a dictionary like:

{
  "the": 1,
  "and": 2,
  "a": 3,
  ...
}

But encoded reviews look like:

[1, 14, 22, 16, 43, ...]

⚠️ These numbers are shifted by +3 to make room for special tokens.


Why subtract 3 during decoding?

When decoding:

reverse_word_index.get(i - 3, '?')

We are saying:

This word index includes special tokens — remove that offset so I can map it back to the original word.


Example

Suppose:

i = 6

Then:

i - 3 = 3

And:

reverse_word_index[3]  "movie"

Without subtracting 3:

  • You’d get the wrong word
  • Or '?' (key not found)

Full decoding logic

reverse_word_index = {value: key for key, value in imdb.get_word_index().items()}

decoded_review = ' '.join(
    reverse_word_index.get(i - 3, '?') for i in sample_review
)