Data – The Bottleneck in Neural Network Training 

May 24, 2023

Status quo: unlimited high-quality data needed  

Robust neural network training involves ensuring that the network is resistant to noise, variations in input data, and other forms of perturbation. This is important for real-world applications, where the input data may be subject to variability or noise. 

To train a neural network robustly, a sufficient amount of diverse and high-quality data is needed. The exact amount and type of data required depend on the specific problem that the neural network is being trained to solve, as well as the complexity of the network architecture. 

In general, the more data that is available for training, the better the performance of the neural network is likely to be. However, the quality of the data is also crucial. It is important that the data be representative of the problem domain and include examples of all possible input and output configurations that the network may encounter in practice. 

Let’s say you are working on a project to develop an autonomous car that can detect and avoid obstacles on the road. To train the neural network that will control the car, you need to provide it with a large and diverse set of data that includes images of different types of roads, weather conditions, and obstacles. The neural network needs to learn how to recognize various objects on the road such as cars, pedestrians, traffic lights, and road signs. 

If you only provide the neural network with a limited amount of data, it may not be able to generalize well to new and unseen situations. For example, if the network has only been trained on images of roads during daylight, it may not be able to detect obstacles in low-light or nighttime conditions. 


…correctly labelled and annotated by domain experts 


Additionally, it is important that the data be labeled correctly, as this is necessary for the network to learn the correct associations between inputs and outputs. The labeling process may require domain expertise or human annotation, which can be time-consuming and costly. 

Let’s say you are working on a project to develop a spam filter for an e-mail service. To train the neural network, you need to provide it with a large dataset of e-mails that are labeled as either spam or non-spam. The labeling process involves marking each e-mail in the dataset as spam or non-spam based on its content. 

If the labeling is incorrect, the neural network will learn the wrong associations between inputs (the content of the e-mail) and outputs (whether the e-mail is spam or not). For example, if an e-mail that should be labeled as spam is labeled as non-spam, the network may not be able to identify similar spam e-mails in the future. This can result in a poor performance of the spam filter and frustration for users who still receive unwanted e-mails. 

Labeling a large dataset of e-mails can be a time-consuming and costly process, especially if domain expertise or human annotation is required. Domain expertise may be needed to correctly identify certain types of spam e-mails, such as those that use sophisticated techniques to avoid detection. Human annotation may be needed to review and correct the labeling done by automated tools, to ensure that it is accurate and consistent across the dataset. This may require significant effort and expertise, but it is essential for achieving the desired performance of the system. 

The specific amount of data required varies widely depending on the problem and network architecture. Deep neural networks, for example, may require hundreds of thousands or even millions of examples for effective training, while smaller networks may require fewer examples.  

In summary, robust neural network training requires sufficient and high-quality data that is representative of the problem domain and correctly labeled. The specific amount of data required depends on the complexity of the problem and the network architecture, and can vary widely. 


Data collection and annotation is costly 

Data collection for neural network training can be costly because it usually requires a team of skilled individuals, including: 

  • Subject matter experts who can identify and collect relevant data. 
  • Data scientists who can design data collection protocols and manage the data pipeline. 
  • Data annotators or labelers who can manually annotate or label data as needed. 
  • Quality assurance personnel who can ensure the accuracy and quality of the collected data. 
  • Legal and ethical experts who can ensure that the data collection process is compliant with relevant regulations and ethical considerations. 

In some cases, it may be possible to outsource certain aspects of the data collection process, such as annotation or labeling, to third-party providers. However, this can also introduce additional costs and challenges related to quality control and data ownership. 


Spiki has your back: robust training made cost-efficient 

Spiki offers a unique neural network training framework which clearly specifies which, where and how data points need to be measured or go into the training. We guide our customers through the data collection process to ensure robust performance. At the same time we are able to limit the amount of data needed and thus make training your AI as effective and efficient as possible.  

Spiki offers the developed robust neural network training workflow as a SaaS in the form of robust software (SW) or hardware (HW) IP licenses usable for safety critical applications in fields such as intelligent control, autonomous driving, robotics, aeronautics and other safety-critical domains. 

Excited? Get in touch and learn how to unlock the full potential of your business with Spiki’s AI you can trust.