The Data Bottleneck and how we approach it  

Sep 14, 2023

Diverse, complex, and never enough

We have already learnt that neural network training requires a vast amount of data to effectively capture complex patterns and generalize well. This need for extensive data is expensive for companies due to the costs associated with data collection, storage, labeling, and the computational resources required for training large-scale models, making it a resource-intensive endeavor. 

Machine learning and neural network training can work with various types of data, but the choice often depends on the specific problem and the physical nature of the data available. Here are some common types of data used:  

  • Image and Video Data: For example object recognition in images, where you identify and classify objects within photographs. 
  • Audio Data: For example speech recognition, where you transcribe spoken words into text. 
  • Multi-modal Data: For example autonomous vehicle perception, where data from cameras, LIDAR, radar, and other sensors are combined to make driving decisions. 
  • Sensor Data: For example predictive maintenance in manufacturing, where data from sensors on machinery is used to predict when maintenance is needed to avoid breakdowns. 

The choice of data type and representation depends on the problem’s requirements and the information available.  

Discrete vs. Continuous Data in NN Training 

Two different types of data can be used in neural network training according to the task at hand: 

  • Discrete Data: Discrete data consists of distinct, separate, and countable values. These values often represent categories, counts, or labels with clear boundaries. Examples include categorical variables (e.g., types of animals, colors), ordinal variables (e.g., levels of satisfaction), or count data (e.g., the number of cars in a parking lot). 
  • Continuous Data: Continuous data, on the other hand, represents a continuum of values with no clear separation between them. This type of data can take on any value within a given range. Examples include numerical variables (e.g., temperature, height, weight) and real-valued measurements (e.g., time, distance). 

The handling, representation, and preprocessing of these data types in neural network training differ based on their fundamental nature.  

Continuous Data Complexity: Managing the Infinite Possibilities

Handling continuous data can be more challenging compared to discrete data due to several reasons: 

  • Infinite Possible Values: Continuous data can take on an infinite number of values within a given range. This makes it computationally intensive to work with, as you can’t store or process every possible value individually. In contrast, discrete data has a finite set of possible values, making it easier to manage. 
  • Precision and Noise: Continuous data often involves measurements and observations that come with varying degrees of precision and noise. This introduces uncertainty into the data and requires careful handling to account for measurement errors and variations. 
  • Data Representation: Discrete data can be easily represented using integers or categorical labels, while continuous data requires more complex representations, usually involving floating-point numbers. This adds complexity to processing and storage. 
  • Granularity: Continuous data can be extremely granular, requiring sophisticated techniques to capture meaningful patterns. Discrete data might already come in a more structured and understandable format. 
  • Dimensionality: Continuous data often leads to high-dimensional feature spaces, especially when dealing with multiple continuous variables. This can result in the “curse of dimensionality,” where distance-based methods struggle due to increased sparsity of data points. 
  • Algorithm Sensitivity: Many algorithms are designed for discrete data or work better with it. Adapting these algorithms to continuous data requires careful consideration and often additional mathematical techniques.

Measuring the world? 

In summary, handling continuous data requires a deeper understanding of the underlying mathematical properties, domain-specific considerations, and often the use of specialized algorithms and techniques to effectively process and extract meaningful insights from the data. 

The goal of machine learning is to create models that generalize well to unseen data, which is termed robustness. Achieving good generalization is partly dependent on having an infinite amount of data but also on having enough diverse and representative data to capture the underlying patterns in the data distribution. 

High-dimensional continuous data tends to result in a larger number of parameters, especially if you have many continuous features such as movement in time and space. Since the number of parameters rises exponentially, it gets harder to capture all the necessary measurements and input data needed for robust training. This is where a tradeoff between local and global robustness comes into play when trying to solve the “never enough data problem”. 

Find out how we approach this problem in our next articles! 

Excited? Get in touch and learn how to unlock the full potential of your business with Spiki’s AI you can trust.