“Biased labeling undermines the credibility of AI applications”
1 August, 2022
Toloka, which provides data labeling services for AI model training and has amassed a diverse labeling audience from around the world, is entering Israel. Israeli tech companies are global, and it is important for them to base their AI applications on information relevant to their target markets"
The global company Toloka, which provides large-scale data labeling services for the training of artificial intelligence applications, is entering its activity in Israel intending to expose its services to Israeli technology companies engaged in the development of AI applications such as computer vision, natural language understanding, search engines, e-commerce and more. In ad, labeling work on Toloka’s platform is done in a crowdsourcing format, and Toloka is also interested in expanding the Israeli audience on the platform.
In a conversation with Techtime, the company’s founder and CEO, Olga Magroskaya (pictured), explained the move. “We are entering the Israeli market because we see tremendous potential in the Israeli start-up community, many of whom are engaged in artificial intelligence. Israeli technology companies are global, with end customers worldwide, so it is important for them to base their AI applications on information relevant to their target markets.”
Biased training leads to biased application
The training phase is one of the main steps in developing the application of artificial intelligence. At this point, the system scans the information using depth-learning algorithms to find common patterns – and model them. For example, to train an AI application to recognize a cat, the system must be entered a large number of examples of cats of different species, of different colors and sizes, and in different visual contexts. The larger and higher quality the sample pool, the more effective the training will be, and the finished application will be able to identify cats with greater accuracy.
The information used for training the model must be tagged (i.e., indicate where the cat appears), and this action is carried out by a human being manually. Usually, labeling the information is done by an external company or internally by the company developing the application. Since a homogenous group of people usually labels this information, this often leads to gender, ethnic, socioeconomic, and linguistic biases.
Olga: “We all call this field artificial intelligence, but this technology relies heavily on human intelligence. The basis of many AI applications is labeling, and it is carried out by humans, whose labeling is influenced by their background.”
To avoid the bias problem, Toloka’s labeling platform relies on crowdsourcing. Anyone can sign up and perform labeling tasks of photos, text, or voice files, in various languages (including Hebrew), in exchange for a fee. In this way, Toloka can build a very diverse labeling database and tailor each labeling task to the relevant labeling task in terms of background and language.
On the other hand, development companies can purchase tagged data sets on demand through the platform. The platform currently provides over 80 million data annotations per week.
Olga stresses that bias is not just an ethical or social issue but a functional one and that training a model based on biased information will ultimately damage the application’s credibility. “In many cases, it is essential who labels the information, for example, if it is voice samples in different accents and dialects for NLP training, for proper understanding of the meaning in the local context. Also, the way data is tagged today is very cumbersome and slow, making it difficult for the scalability of the AI world,” he said.
Representation of world population
Toloka recently published data on the distribution of active users on the platform by country of origin, socioeconomic status, religion, gender and more. As of 2022, the platform had approximately 250 000 monthly active users from over 100 countries. The users all belong to about 600 different ethnic groups. The average age is 29.6 years and the gender distribution is about 60% male and 40% female.
Olga: “It’s an open platform. Anyone from anywhere in the world can sign up and perform labeling tasks. We managed to amass a very diverse audience of users. This gives companies developing AI applications access to people from all over the world and of all shades. Hundreds of high-tech companies worldwide use our tagged data, and each has different requirements.”
Toloka uses all kinds of statistical techniques and tools such as user rankings, cross-checks, majority determinations, and more to ensure that data is accurately and reliably tagged. “Our added value is the quality control processes. Ultimately, if the information is not properly tagged, the model will not work. This is a challenge, as the labeling is carried out remotely and by people worldwide. We apply a series of mathematical tools that allow us to reach the statistical significance of the reliability of labeling.”
Posted in tags: Toloka