One of the key challenges in document intelligence and collective intelligence is around knowledge representation. For example, language models like ChatGPT create an experience that’s primarily in English. What happens then to non-English speakers and their ability to participate in the AI revolution?
Tier 1 research labs, as well as funders, are well aware that they need to start digitizing and indexing content from different languages to improve the representation of those languages in these models. If not, communities run the risk of losing some of their identity.
We’re all too familiar with the difficulties of parsing English language text, especially when you deal with handwriting or dense, cross-page tables. What if you want to parse Chinese, Japanese, or Korean? Or how about Arabic, where the direction of the reading order changes?
Datalab already supports over 90 languages, and we’re constantly improving by finding new sources of training data and internal evaluations to improve our ability to support these languages and enable a more global participation and representation in AI systems.
We’re not afraid to tackle challenges like this, and once something is in our sights, you can bet we’re going to beat everyone else.