Why do we need to subtract 3 in word index in IMDb dataset from Tensorflow?

Posted Dec 20, 2025

By Shoaib Akthar

1 min read

As I was working with IMDb dataset, I noticed while decoding, we need to subtract 3 in the word index to get the corresponding word. Let’s understand why.

We subtract 3 because IMDB reserves the first 3 indices for special tokens, not real words.

(i - 3) shifts the index back so it correctly maps to the actual word.

What’s really happening under the hood

In the IMDB dataset, word indices start like this:

Index	Meaning
0	`<PAD>` (padding)
1	`<START>` (start of review)
2	`<UNK>` (unknown word)
3	First real word
4	Second real word
…	…

So actual vocabulary words start from index 3.

`word_index` vs encoded reviews

`imdb.get_word_index()`

Returns a dictionary like:

  
{
  "the": 1,
  "and": 2,
  "a": 3,
  ...
}

But encoded reviews look like:

  
[1, 14, 22, 16, 43, ...]

⚠️ These numbers are shifted by +3 to make room for special tokens.

Why subtract 3 during decoding?

When decoding:

  
reverse_word_index.get(i - 3, '?')

We are saying:

This word index includes special tokens — remove that offset so I can map it back to the original word.

Example

Suppose:

  
i = 6

Then:

  
i - 3 = 3

And:

  
reverse_word_index[3] → "movie"

Without subtracting 3:

You’d get the wrong word
Or '?' (key not found)

Full decoding logic

  
reverse_word_index = {value: key for key, value in imdb.get_word_index().items()}

decoded_review = ' '.join(
    reverse_word_index.get(i - 3, '?') for i in sample_review
)

General

This post is licensed under CC BY 4.0 by the author.