DeepSpeech/data/lm
2018-11-08 18:35:42 -02:00
..
lm.binary Update language model to a trie-based LM created from the LibriSpeech LM corpus 2018-09-17 11:11:20 -03:00
README.md Use ctcdecode in native client 2018-10-25 17:01:08 -03:00
trie Remove old versions of decoder binary files 2018-11-08 18:35:42 -02:00

lm.binary was generated from the LibriSpeech normalized LM training text, available here, following this recipe (Jupyter notebook code):

import gzip
import io
import os

from urllib import request

# Grab corpus.
url = 'http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz'
data_upper = '/tmp/upper.txt.gz'
request.urlretrieve(url, data_upper)

# Convert to lowercase and cleanup.
data_lower = '/tmp/lower.txt'
with open(data_lower, 'w', encoding='utf-8') as lower:
    with io.TextIOWrapper(io.BufferedReader(gzip.open(data_upper)), encoding='utf8') as upper:
        for line in upper:
            lower.write(line.lower())

# Build pruned LM.
lm_path = '/tmp/lm.arpa'
!lmplz --order 5 \
       --temp_prefix /tmp/ \
       --memory 50% \
       --text {data_lower} \
       --arpa {lm_path} \
       --prune 0 0 0 1

# Quantize and produce trie binary.
binary_path = '/tmp/lm.binary'
!build_binary -a 255 \
              -q 8 \
              trie \
              {lm_path} \
              {binary_path} 
os.remove(lm_path)

The trie was then generated from the vocabulary of the language model:

./generate_trie ../data/alphabet.txt /tmp/lm.binary /tmp/trie