3. Causal Discovery Configuration

The following parameters can be used to control our Causal Discovery algorithm.

aggregation

The aggregation is an optional parameter of a variable space definition. Per default a count aggregation is used, i.e. algorithm counts cases in each cell of the defined group-by. You may use any aggregation. For example use “SUM(dosage)” to form variables like

“total dosage of pain killers prescribed to the patient in last 3 month …”

See https://docs.xplain-data.de/xplaindoc/interfaces/genartefacts.html how to code aggregations in JSON.

aggregations

An optional array of aggregations which are used for building independent variables. (See https://docs.xplain-data.de/xplaindoc/interfaces/genartefacts.html how to code an aggregation in JSON)

allowNegations

The aggregated variables are compared to thresholds to form binary variable. In case the flag allowNegations is set to true also, the negated variables are evaluated during search for independent variables. Default is

allowNegations = false.

allowedToPrune

If you set allowedToPrune = false, then the algorithm will not prune those variables, even if they are non-significant. Default is true - be careful to set this to false.

combinations

A list of group-bys. For each resulting group-by of this variable definition, the group-bys given in this property are are added to for multi-dimensional group-bys.

computeAllAlternatives

If this parameter is set to true, at the end of the model building process computes “alternatives” for each variable, by removing this variable and searching for alternatives in the defined variable space (see also leaveOneOutSearch). The variables which come in as an alternative for the one left out are presented as alternatives to that variable. Replacing this variable by its alternatives will result in a negative likelihood gain, i.e. the variable itself is the better explanation.

computeRelatedFactors

If this parameter is given and, for example, set to computeRelatedFactors = 5, then at the end of the model building process tries to find maximum 5 other factors for each variable which are correlated to the variable - as kind of an additional qualification of the variable.

constrainModelScopePriorToTarget

If constrainModelScopePriorToTarget = true, the model scope will be restricted to objects with no target events prior to a potentially existing selection on the time dimension of the target object (absolute / non-relative)! An example: The selected target is breast cancer in years 2019 and 2020. There might be patients with a breast cancer diagnosis already prior to that period (with no such diagnosis in 2019 and 2020 or with a second diagnosis in 2019 and 2020). Those patients will be excluded form there model scope.

This parameter should in general be set to true. (Basically there is no scenario known where this should be false. Therefore the parameter might soon be deprecated.)

constrainSurrogateTargetsToTargetTimeFrame

If this parameter is set to true and the target time frame is constrained to, e.g., year 2020 and 2021, the the surrogate targets are also constrained to this time frame (for surrogate time models which are bound to an event, e.g., equal_lifetime_event). This helps to ensure that non-target cases are better comparable to target cases, in particular information about target/non-target might otherwise sneak into independent variables simply by the fact that non-targets are from a different time frame. Therefore it is recommended to set this parameter to “true”.

description

An optional additional description, e.g. on the intension of the model and how to use it.

exclusionPattern

The property exclusionPattern is an optional array of strings, each one interpreted as a pattern. If the argument exclusionPattern is given, attributes of the defined object which match one of this pattern are excluded for building independent variables.

exhaustiveOneAttributes

Those attributes need to be attributes on the predictiveModelObject (or from parent objects). The joint combinatorial space (the multi-dimensional group-by) defines the set of variables which form the “default explanation”. Those variables cannot pruned, and those variables are “exhaustive but exclusive”, meaning that for each “Patient” we have exactly one of them as “1”, the others as “0”. If exhaustiveOneAttributes are not given, a default explanation variable (named “Default Cause”)

finalVariableSet

An explicit list of variables as the result of the algorithm. See the definition of VariableSetUIModel for details.

generateInteractions

If generateInteractions is set to true, the algorithm will also test “interactions” between two variables. I.e. if there are two binary variables A and B, the algorithm evaluates a third binary variable “A and B” as a potential independent variable.

groupBy

A variable space definition primarily has the mandatory property “groupBy”. In the simplest case this is the only property of a variable space. It may be a one-dimensional or a multi-dimensional group-by. See the generic documentation at https://docs.xplain-data.de/xplaindoc/interfaces/genartefacts.html how to code a group-by in JSON.

Each cell in this group-by results in on or a number of variables. For example, there might be the sub-object “Diagnoses” attached to the root Object “Patient”, and there is a categorical dimension “Diagnosis Code” with typically a couple of thousand diagnosis codes, e.g. “pneumonia”, “broken finger”, “depression”, … With that group-by in a variable definition the algorithm will form a couple of thousand variable, one for each of the diagnosis codes for example

“number of pneumonia cases of the patient”, “number broken finders”, …

Technically each of this variables is an aggregation dimension - it aggregates information from deeper level objects on to the level of the root object (or predictive model object) to form a variable attached to the root object (a variable which characterises the patient and is used as a potential independent variable of the model).

Each of these aggregation variables is compared against the defined thresholds to finally for a number of binary as an input to the model (NOR-model).

The group-by may be any multi-dimensional group-by. Typically there is a second dimension in the group-by, the time or the relative time. Based on that variable can be built like

“number of pneumonia cases of the patient last month, “, “number of pneumonia cases of the patient last 3 month, “, …

See also the property relativeTimeAttributeConfiguration for more information.

inclusionPattern

The property inclusionPattern is an optional array of strings, each one interpreted as a pattern. If the argument inclusionPattern is given, only attributes of the defined object which match one of this pattern it is used for building independent variables.

independentAggregationConstraints

Those selections can be used to restrict the aggregations when building independent variables. I.e. any selections provided here will be added as a where-clause when building an independent variable (an aggregation dimension). Events prior to the target outside of this selection are kind of “invisible” to the algorithms - treated as they would not exist.

The parameter is a list of selections, and different components in the array are combined with an AND logic. A typical scenario is that there is a list of types of events A and B which should be ignored, i.e. the to be ignored events are “notA AND notB”. Therefore the negate flag has to be put into the individual elements. (In a future release we might provide a more comfortable version for the, where you can directly specify the to be ignored events, for example as the property toBeIgnoredEvents.)

This parameter is important as the functionality “ignore factor = show next causal factor behind” is built based on that. (See also rejectAsExplanation which does not ignore events, but just tries to find a different characterisation. See also the property rejectAsExplanation.)

independentEventConstraints

Time constraints specific to objects (first key), “from” and “to” keys in the second map Example: “Events” : {“from” : “2014-01”, “to” : “2017-01”}, This parameter will soon be deprecated.

independentEventFromDate

Global time constraints (for all objects) for aggregation of independent variables. See also independentEventConstraints. This parameter will soon be deprecated.

independentEventToDate

Global time constraints (for all objects) for aggregation of independent variables. See also independentEventConstraints. This parameter will soon be deprecated.

independentVariableSets

The independentVariableSets define the search space in the underlying object model from which independent variables are formed. independentVariableSets (plural) is an array of variable sets. Typically, however there is only one variable set. I.e. the property in JSON usually looks like:

“independentVariableSets: [

{
“defaultThresholds”: [],

“autoSpaceDefinitions”: [],

“variableSpaceDefinitions”: []

}

]

Each independent variable may have the properties “defaultThresholds”, “autoSpaceDefinitions” and “variableSpaceDefinitions”. defaultThresholds is a set of threshold which will be used as a default in all variable sets if they are not given explicitly. autoSpaceDefinitions define search spaces in a simplifies, less verbose way - many details are automatically generated here. variableSpaceDefinitions contain explicit and more detailed definitions how to search the object model for independent variables.

For more details on autoSpaceDefinitions and variableSpaceDefinitions see the corresponding files.

injectVariablesStepwise

If true, a very superficial model is built first which primarily serves to prune a large set of independent variables relative to that model so to quickly get rid of a large amount of irrelevant variables. After that the bootstrapping of models of models starts in terms of multiple variableSelectionIterations (see this parameter). It is recommended to keep this variable = true in general.

leaveOneOutSearch

If true, in a final model building step each variable is pruned to search for alternatives. If thereby the likelihood increases, the alternative variables found will be used in the model (instead of the one left out). The process is repeated till there are no changes in the model any more - which is a very expensive process. Therefore, per default set this to false.

marginIncluded

See the related parameter predictOnRelativeTimeWithMarginOf for an explanation.

maxDegreeInteractions

For two variables “A” and “B” the interaction variable “A and B” is an interaction of degree 2. The variable “A and B and C” is an interaction of degree 3. The parameter defined the maximum degree of interactions. With this parameter set to 2, only second oder interaction variables are built.

maxIndependentVariables

Upper limit on the number of variables the final model will have. Variables will be pruned till the number of variables is less then this limit.

maxNumberVariablesPerAttribute

maxNumberVariablesPerAttribute is an optional limit for the maximum number of variables which are built based on each attribute. With auto-space definitions you quickly arrive at multiple millions of potential independent variables. You can use this parameter (at least during development of a model) to ensure that you unintentionally don’t create huge variable spaces which result in a long runtime. maxNumberVariablesPerAttribute is per default set to 5000.

minBeta

Variables which after model fitting have beta < minBeta will be removed (even if they have a significant p-value). The model will be re-fitted till all variables have beta >= minBeta. beta is also termed the contribution of the variable.

minImpact

The impact is the product of the contribution of a variable (beta) times the support of the variable (an approximate measure of the relevance of the variable in terms of how many target cases this variable generates). Variables which have an impact less than minImpact will be pruned.

minSupport

Variables with support less than minSupport will be pruned. Support means number cases / number of samples with the corresponding variable = 1.

modelScopeAttributes

This is less a parameter of the model configuration, but rather defines how to set the model scope from within a UI (from current selections in the Xplain Object Explorer).

The optional parameter modelScopeAttributes is a list of attributes.

It serves for defining the model scope in an alternative or additional way to explicitly given selections in the property modelScopeSelections. Any selections which are currently set as the globalSelections in the current session and which are on attributes configured in the modelScopeAttributes are added to the model scope. As the model scope is a selection on the predictiveModelObject (root object), the modelScopeAttributes also need to be attributes from the predictiveModelObject.

An example: The attribute “Gender” is listed in the modelScopeAttributes, and there is currently a selection set on this attribute, e.g., Gender = female. When starting the causal discovery algorithms, this selection is interpreted as model scope, i.e., the model is built only for female patients.

modelScopeDimension

Name of modelScopeDimension - set when exporting a model and used during import to pick this dimension up again by name. Don’t use this in configurations

modelScopeSelections

Those selections define the explicit model scope (an array of selections). Those need to be a selection on level of predictiveModelObject.

In addition - if constrainModelScopePriorToTarget is set to true and there is a time selection for the target events - the model scope is automatically amended / restricted to those instances (patients) which did not have the exact target event prior to the time frame of the target selections. This for example means that - if the target is “Diagnosis = X” AND “Type of diagnosis = confirmed”, there will be patients in the model scope which had a non-confirmed diagnosis X prior to the selected time frame. (This behaviour may be changed - see the parameter primaryTargetSelectionAttributes).

Furthermore, note that this implicit part of the model scope is not displayed in the returned configuration JSON (initial UI model).

name

Name of the model - it also serves as an ID. hen calling the method buildPredictiveModel you may provide the parameter modelName, which then overrides the name explicitly given in the configuration.

nonTargetReferenceEventTimeDimensions

The property nonTargetReferenceEventTimeDimensions has a list of time dimensions. This list defines which objects and based on which time dimension (i.e. which sort of events) are considered as surrogate target events. If no targetAndSurrogateTargetEventSpace is given, all events for objects with a nonTargetReferenceEventTimeDimension are considered as surrogate target events based on the corresponding time. If there is a targetAndSurrogateTargetEventSpace, only objects and events in that space are considered as surrogate targets (even if there are additional nonTargetReferenceEventTimeDimensions.

nonTargetRelativeTimeReferenceType

The relative time is defined as the time relative to the target event which is to be predicted. For object instances of the predictive model object (e.g. patients or machines) which serve as a reference group to the target group there is no such target event which can be used a the reference point. A surrogate event (or surrogate point in time) is used instead. There are different strategies available to choose this surrogate event, which can be set by the property nonTargetRelativeTimeReferenceType. It may have one of the following values:

random_event: A surrogate point is randomly chose in the admitted time frame (the most basic strategy - but not recommended)

last_event: The last available event in the data respectively in the allowed event space

first_event: The first available event in the data respectively in the allowed event space

equal_distribution: A surrogate point is randomly chose in the admitted time frame, however, from the same distribution as that of the target events

equal_lifetime_distribution: For each object instance (patient, machine, …) a “start of life” is determined first, e.g. the first event available (according to nonTargetReferenceEventTimeDimensions if available) or the first event within optionally configured startOfEpisode selections. The surrogate point for the non-targets is then chosen randomly from the same distribution as that of the “life times” of the target events. The life time thereby is the time difference between start of life and the corresponding event.

equal_lifetime_event: Identical to the strategy equal_lifetime_distribution, except that - after randomly choosing the surrogate time - the closes next event to that time is determined. This one is chosen as the the surrogate event.

All strategies use only those dimensions/objects configured in nonTargetReferenceEventTimeDimensions to find a surrogate point in time / surrogate event. Also, if the strategy is tied to an event, the targetAndSurrogateTargetEventSpace will also be applied to restrict the admissible reference events.

The strategy equal_lifetime_event is the one which is usually recommended.

object

The object for which an auto-space definition is defined.

pairwiseScoringByFalseNegativePredictions

This parameter is experimental. For the moment, keep this parameter to true in general.

The exact meaning of this parameter has to be re-evaluate (in particular as soon as the AND gate is to be introduced / inhibitors effects). As off now it seems: Marginal scoring relative to a model should always / implicitly be based on false negatives. After that still lots of identical / correlated pairs exist. To get rid of those, it probably does not make a difference whether this is done based on false negatives or based on simple correlations - the primary goal is to get rid of correlated pairs (no matter how they correlate to the target), and if correlated the missing improvement in false negatives in a model with both variables will indicate that correlation probably equally well … And indeed practical evaluations show that there seem to be not much of a difference between pairwiseScoringByFalseNegativePredictions = false/true.

predictOnRelativeTimeWithMarginOfDays

See the new parameter “predictOnRelativeTimeWithMarginOf”. This parameter is deprecated

predictiveModelObject

The object which is in focus of the statistical analysis (where you want to make predictions for or understand causes for properties or events of this object). Usually this is the root object (indeed in other cases some features might not be available). In health care it is typically the patient and we predict events of the patient. We might, however, as well predict events within hospital cases or any other sorts of episodes of the patient. In this case the predictive model object needs to be the corresponding object.

predictiveModelType

Defines the type of predictive model which is to be built. Will not be put into the official documentation, as basically we are supporting only NOR-models at the moment.

primaryTargetSelectionAttributes

In some settings, certain parts of the target selection play a particular role. For example the target might be diagnosis=”cancer” AND type=”confirmed”. Events exactly corresponding to the target selection cannot appear prior to the target event => cannot go into independent variables if predictOnRelativeTimeWithMarginOf > 0. The primary part of the target selection is “cancer”, and typically one might not just want exactly the target selection events to not go into independent variables, but in general “cancer” event (i.e. also cancer events with type=”suspicion”).

The list of primaryTargetSelectionAttributes can be used to tag attribute such that target selections in those attributes go into the independentAggregationConstraints (the joint selection - joint on potentially multiple attributes found). The method-call in XplainSession may override the primaryTargetSelectionAttributes, however, only if those are null. If in the configuration the primaryTargetSelectionAttributes are given (non-null), then those will prevail. In particular, if primaryTargetSelectionAttributes are given as an empty array this will disable “injecting” primaryTargetSelectionAttributes from the UI - which is in wanted in case the margin <= 0 (i.e. we want equal time events go onto the independent variables - the SW case).

Depending on the settings in primaryTargetSelectionAttributes, there is a difference between selecting the target with two selections and clicking the bulb on one, or generating a pin and clicking the bulb on the pin. In Emanuel’s case the target is: RP + Schneideinheit, but we likely do not want to exclude all Schneideinheit-Events in general as independent events.

processTimeScale

See the related parameter predictOnRelativeTimeWithMarginOf for an explanation.

rejectAsExplanation

See also independentAggregationConstraints. As compared to independentAggregationConstraints, events defined in this selections are not ignored, but the attributes and states specified in the given selections are not used for building independent variables.

This tag is experimental! Currently available in the Object Explorer UI as “find alternative”. The feature might be removed soon, as it is too difficult to understand relative to other features (ignore/independentEventConstraind or correlated variables).

relativeIndependentEventsPeriod

See also the parameter “predictOnRelativeTimeWithMarginOf” which is the counterpart to this parameter. predictOnRelativeTimeWithMarginOf and relativeIndependentEventsPeriod together define the interval relative to the target or surrogate target event from which information enters into the explanatory variables.

relativeTimeAttributeConfiguration

If you use a relative time axis (the parameter predictOnRelativeTimeWithMarginOf) you may optionally also define the time ranges which the algorithm uses to build independent variables. This is done in terms configuring an attribute (ElapsedTimeAttribute) - the ranges in this attribute are used to constrain variables / build different variables. See the method addElapsedTimeAttribute in the generic documentation. An example of such an attribute configuration looks like:

“relativeTimeAttributeConfiguration”{
“attribute” : “Custom Relative Time Ranges”, “creationMode”: “IRREGULAR”, “binBoundariesInBaseUnits”: [[-200, -100, -50, -10, 0]], “upperBinBoundaryIncluded”: false, “baseUnit”: “DAY”, “dimensionTimeUnit” : “MILLISECOND”

},

The property relativeTimeAttributeConfiguration is optional - if you do not provide this property, standard exponential ranges are used.

The “Time to target of model” tag in the variable space definitions serve as a place holders - the relative time attribute will be used there.

saveResultsToCSV

If saveResultsToCSV = true, results of building a predictive model will be saved to csv files (including predicted probabilities). Those files will end up in the results folder.

significance

This parameter determines the required significance of the independent variables. More precisely, the given significance translates into a required likelihood gain. If, when removing one degree for freedom (prune a variable) and re-fitting the model, the likelihood drops by more then the required likelihood gain, then this variable will not be removed. The default value is significance = 0.9.

The significance can be given as a parameter of the method buildPredictiveModel, in which case it will override the significance given in the model configuration.

startOfEpisodeAttributes

Less a model configuration parameter but a parameter which determines how the startOfEpisodeSelections are guessed based on current global selections. The approach, however, is “wired”, should not be used any more and is now set to deprecated (July 2022).

Is deprecated - do not put into docu.

startOfEpisodeSelections

See also startOfLifeDimension - the difference to startOfLifeDimension is that the first startOfEpisode event defines the start of life.

startOfLifeDimension

In case of surrogate relative time strategies like equal_lifetime_distribution or equal equal_lifetime_event this parameter may optionally be give to defied which date dimensions defines the “start of life”. In case of patients, for example, it might be a dimension like “enrolled since”. In general only target events after start of live a are considered - the first after is the target. If there is no such event, then the instance is out of scope of the model.

subSamplePeerGroup

In some cases there might be a very small target group, but a large peer group which serves as a reference (non-target group). This happens for example if we have 10 million patients, but want to understand a very rare disease in this population. To avoid long computation time, you may sub-sample the peer group.

targetAndSurrogateTargetEventSpace

targetAndSurrogateTargetEventSpace is an optional array of selections.

In case of prediction on a relative time: Typically the targetSelections define the zero point in time for the relative time axis (and the zero point for non target cases (patients) is defined according to the defined strategy, e.g. equal_lifetime_event). In some cases, however, we may want to set the zero point according to an explicitly given selection. An example: We have two products in the market and we want to predict which one is the first to be prescribed, depending on the prior history. I.e. target and non-target patients both have one or the other event which defines the zero point, and relative to the prior history we want to predict which of the two is prescribed. The selections given in targetAndSurrogateTargetEventSpace may be used to define what the “market” is.

In general, if this member field is provided: This selection defines (as the name says) the space of target and surrogate target events. The targetSelections and targetAndSurrogateTargetEventSpace need to be defined on the same object, and (as of July 2022) this tag is allowed only in combination with nonTargetRelativeTimeReferenceType = first_event.

Target events are restricted to that space (selection). Surrogate targets come from the rest of that space. As the model scope is restricted to cases with non-null (relative) times (restrictModelScopeToNonNullMaterializedReferenceTimes()), the model scope is implicitly restricted to cases which had at least one event in targetAndSurrogateTargetEventSpace.

Also, if the target selections have a time period defined and constrainModelScopePriorToTarget = true, then - in general - the model scope is restricted “Patients” which did not have that event prior to the defined period. The same applies to non-target Patients: Patients which had surrogate target events prior to the defined period also go out of model scope.

targetDimension

Name of targetDimension - set when exporting a model and used during import to pick this dimension up again by name. Don’t use this in configurations.

targetSelections

Defines the to predicted target. It may be a selection on a sub-object of the predictiveModelObject (typically a to be predicted event) or a selection immediately on the predictiveModelObject. Typically the targetSelections are not given explicitly in the model configuration, but are handed over as a parameter of the method buildPredictiveModel.

thresholds

Optionally thresholds individual to the variable space may be given as an array of numbers.

timeDimensions

In case sub-objects of the root object have multiple time dimension, this list of time dimension can be used to specify which of the time dimensions is to be used in model building.

variableSelectionIterations

Then number of “bootstrap loops”. With each loop the required likelihood gain is increased so to finally arrive at the required significance. In those loops each previous model servers as a basis relative to which new variables are injected.

variableSetSplitLimit

If - after marginal and pairwise pruning and before explicitly fitting models - the size of a variable set (number of variables) exceeds this limit, the variable set is split into multiple sets, model fitting and pruning is done for each set, sets are merged then, fitted and pruned so to avoid that building e.g. NOR-models is done on a set of very many independent variables (which might be numerically challenging).