New software for searching words in digitized Yiddish books — many originally written in the 19th and early 20th centuries — is about to be unveiled.
The search tool will be available via the Yiddish Book Center in Amherst, Massachusetts. Its digital library includes more than 10,000 books in Yiddish — but the current ability to search them is limited.
Amber Clooney, the center’s digital librarian, recently demonstrated just how limited the current search software is. She pulled up a website, and a Yiddish keyboard appeared on the screen. She typed in “tsig” — the Yiddish word for goat.
Because it can search only titles and authors, the software delivered just three references.
Then Clooney pulled up a different screen that showed a beta version of the new search software.
“On the [new] site, we can search anywhere in the text of a book,” she said. “It’s great.”
Within just a few seconds, about 6,000 references came up.
A DECADE IN DEVELOPMENT
The search software is called Jochre, for Java Optical Character Recognition. To date, it’s the most comprehensive Yiddish word search tool available, according to several computer scientists and scholars.
The Yiddish Book Center, which is a repository for more than a million hard copy books in Yiddish, expects to have Jochre up and running on its site by the end of the year.
Aaron Lanksy founded the Yiddish Book Center in 1980. He once thought writing optical character recognition (OCR) for Yiddish was a pipe dream.
“It was going to be inordinately complicated, and cost $10 million, minimum,” Lansky said. “We figured this was never going to happen.”
And it happened by chance.
An email arrived 10 years ago, out of the blue, Lansky said, from a benevolent software engineer in France named Assaf Urieli.
Lansky said Urieli wrote that he was a computational linguist living in the French Pyrenees, and had just invented Yiddish OCR. And Urieli wanted to donate it to the Yiddish Book Center so its books could be searchable.
Urieli remembered telling Lanksy he wanted to give him a demo.
“So we can talk about it, so we can see if we can maybe do a project together,” Urieli said, speaking from France.
Both Lansky and Urieli are on a mission to make Jochre and Yiddish available to the world — a world they say doesn’t know much about how many millions of people once spoke the 1,000-year-old language.
Yiddish was almost erased by Stalin and Hitler, then almost lost when many Jews left Europe after World War II for the U.S., and when Israel didn’t make Yiddish its national language.
Urieli’s work began with an interest in his own family history. He grew up between South Africa, Ohio and Israel, and knew almost nothing about Yiddish until he was a young adult. He was already multilingual when he learned his great-grandparents — and generations before them — spoke Yiddish. He decided to learn the language and research his family.
That’s how Urieli discovered the Yiddish Book Center’s digital library.
“Among other things, I was reading about the town where my grandmother was born in Lithuania, and I was thinking how nice it would be if I could actually perform a search among older books to find all of the references to this town,” Urieli said. “And I thought: well, why not write the software?”
Urieli figured it would take the summer. But after three months, he said he hadn’t made much progress.
“But I suppose I couldn’t abandon it, either,” he said. “The idea was too fascinating.”
OCR IS COMPLEX
In Yiddish, many of the documents are old and have stray marks. Software can easily misread what it thinks are characters. Jochre had to be taught — and retaught — otherwise.
Once Urieli got past some common code challenges, the Yiddish Book Center and other libraries began to build a dictionary of words and proper names.
Urieli has not been the only one trying to build a Yiddish OCR. In roughly the same time frame, computer scientist Raphael Finkel at the University of Kentucky had been developing his own version, one that requires more human editing than Jochre.
Finkel has already used Urieli’s program — and he’s impressed.
“It is searchable in a very fast way, which is nice,” Finkel said. “It makes about the same rate of OCR mistake as mine does. It’s the nature of the problem — mistakes in understanding the text.”
An example of how a simple Yiddish phrase, Finkel said, can be misunderstood is in the phrase, “This bothers me.”
In different Yiddish dialects, the sound and letters change. In the word for “me,” the first Yiddish letter is the same, but the last letters are completely different — to the eye and ear, as well as to an OCR program.
While no software is perfect, Finkel said, Jochre is “a great advance.”
The expectation is that scholars, cultural anthropologists, families and others can dig in deeper to Yiddish history and language. Digital libraries and researchers may be able to use the software on their own sites, with a little setup on their end to make it possible.
Urieli is an enthusiast of open source software.
“You do something not because you want to get rich off of it,” he said, “but because it’s something that you’re passionate about, and you want to share with the world.”
Jochre is designed to get smarter over time through use and corrections.
The Yiddish Book Center reports they’re getting about five or six corrections a day — from Yiddish speakers living around the world.