Cover image for the Optical Character Recognition for Low-Resource Languages project.
< Back to Projects

Optical Character Recognition for Low-Resource Languages

We are using machine learning methods to make legacy documentation more accessible for community-based language revitalization.

Illustration image for the Optical Character Recognition for Low-Resource Languages project.

Kwak̕wala is an Indigenous language spoken on Northern Vancouver Island, nearby small islands, and the opposing mainland. There is an active group of language learners and teachers revitalization the language, working closely with Elder first-language speakers, many of whom are over 70 years old. The Kwak̕wala language includes 42 consonantal phonemes (twice as many as English) and a wide range of allophonic vowels. Several writing systems exist and two orthographies are in widespread use by communities: the U’mista and Liq’wala systems.
However, most documentation in the Kwak̕wala language was originally written over a hundred years ago in a complex system developed by anthropologist Franz Boas working with George Hunt, an Indigenous ethnographer and scholar based in Tsax̱is (Fort Rupert). The Boas-Hunt writing system is an adaptation of the North American Phonetic Alphabet and uses Latin script characters as well as diacritics and digraphs to represent phonemic differences. The cultural and linguistic materials written in this system are of tremendous value to community-based researchers, but are minimally accessible in their current form as non-searchable scanned images in an orthography that few can read.
Rosenblum, Kwak'wala language learners and community members, and computational linguists at Carnegie Mellon are working to improve the OCR system for thousands of pages of legacy materials written in the Boas-Hunt orthography. We will then be able to produce searchable copies of these significant cultural resources, as well as automating their transliteration into both community-preferred orthographies, vastly increasing their accessibility. You can read more about the project here at Shruti Rijhwani's github page. Shruti was named to the Forbes 30 Under 30 in Science for this work!

Project Funders