Publishing Data for AI

Data Standardization

Standard data is crucial for AI. This includes ensuring that data follows standard formats (e.g. ISO formats for date/time), uses standard naming conventions for similar variables across different datasets, uses standard encoding schemes, and is represented in standardized units of measurement (e.g. metric units for weight).

Machine understandability

It is not sufficient to just make daa machine readable - ideally it should also be machine understandable. This allows machines to reason and infer connections between disparate data and suggest new insights. This can be enabled by levereaging standard vocabularies, ontologies that captures relationships, and URL-like identifiers.

Metadata

Machine readable metadata enables a number of model building activities including: a wide exploration of data attributes when creating features, creating lineage from provenance metadata, checking data quality, deduplicating datasets for ingestion, and complying with approved use of the data. For e.g. standard vocabularies can be used to documentwhere the data came from (whether from a specific system of record, or a transformation of multiple datasets each having their own lineage information). This enables AI / ML models to be more explainable and also allows for more rapid model development.

Feature Engineering

Feature engineering involves selecting, transforming, and creating features (input variables) from raw data that are most relevant to the AI task. Well-engineered features can significantly enhance the performance of AI models.

Data Normalization and Augmentation

Normalizing data can help ensure that all features have a similar scale, which is important for many machine learning algorithms to converge efficiently during training. Data augmentation techniques involve generating new training data by applying transformations such as rotation, translation, scaling, or adding noise to existing data. Data augmentation can help improve model robustness and generalization.

Data Labeling

Supervised learning algorithms require labeled data for training. Data labeling involves annotating data instances with the correct output or target labels. High-quality labeling is essential for the performance of supervised AI models.