Data Selection#

The Ansys SimAI Pro application model predictions can only be as accurate as the data provided. An AI model is generally only as good as the data it is trained on. Inaccurate training data inevitably leads to unreliable or flawed outputs.

Align data with goals#

Ensure training data represent the desired problem space. A design space defines the nature, scope, and boundaries of your study. In the context of AI model training, it defines the problem the AI model is trying to solve.

Defining an end goal

Consider how the AI model is going to be used to select relevant training data for the expected use case predictions:
- If you intend to predict performance on geometrical design variations, train your model on geometrical variations.
- If you intend to predict performance on operating conditions, train your model on multiple operating conditions.
Input-Output Correspondence

The type of input data should correspond to the type of output you expect from the AI model. As a result, the training data should be curated with regard to the nature of the problem you are trying to solve. Feeding your model with selective training data (only including data necessary to your objective/goal) ensures relevant predictions.

As your goal is to predict surface-level data (integral quantities, surface values or point data on surfaces), the training data should include surface data and properties.
Practical examples:
- If you want to predict drag on a car, you need surface distributions of pressure and wall sheer stress variables.
- If you want to predict mass flow rate through a surface, you need surface distributions of velocity (incompressible flow) and surface distributions of velocity and density (compressible flow).

Consider the size of your Design of Experiments (DoE)#

Determine if your design of experiments is generalized or specialized in the context of your use case.

A “large” DoE pushes for generalization across diverse use cases and is used to observe large tendencies. It generally comprises a substantial amount of data with high variability. It is more scalable, but its performance could be diluted in specific cases.
- A large DoE would be a mix of a lot of different car models (ranging from Coupes to SUVs) with different specifications.
A “small” DoE is specialized. It focuses on smaller scale variations and is used to build models that are more precise and scoped in their generalization. It ensures efficiency but is less performant outside the narrowly defined problem space.
- A small DoE would be only sedans.
- An even smaller DoE would be only altering car mirrors.

Consider the amount of data available#

Select training data depending on your amount of data and use them with a trial-and-error approach following a standard optimization logic.

You have more than 50 data available: You should start with a minimum of 10 simulations as training data to build a first AI model. Then, iterate by adding data that will diversify or specialize your model depending on the goal of your study. Select only meaningful variables in the Model configuration that are varied in the training data to maintain model performance.
You have between 10 and 50 data: Use all the data that are relevant to your study.
You have less than 10 data for your study: You need to generate new data corresponding to your study to enrich your model. Consider the nature of the data and consider a smaller, more segmented design space for improved performance.

Note

Depending on the complexity of the design space, some models can perform with a minimum of 20 simulation data to get good performance while others can perform with a minimum of 500 simulation data. In any case, AI models should always be built progressively, starting with few data to build confidence and then increasing the number of simulations until the performance is satisfying.

For more information on how to use the data for model training, see Adding Data

Handle surface files: Data Location and Normals#

AI models are sensitive to the information contained in surface files, namely:

Surface variables should be located either at cells or at points.
Normals are optional and should always be based on cells.

Location of Surface Variables#

For surface files, the Ansys SimAI Pro application supports field data located at points or at cells.

The platform will guide you if discrepancies are found, for example, if a same field is located both at cells and at points.
It is only possible to build a model with training data that have consistent data location. For example, an AI model cannot be built on some training data with Pressure at cells and other training data with Pressure at points. Would you need to leverage both variables, provide them with different names.