Advanced Ledger Format¶

Graphene can perform better if the parquet input is organized with some constraints in mind. Some of the basic requirements of the parquet input are also more flexible if certain alternate conditions can be met. This section describes these considerations in detail.

Columns¶

Note

Column names are case-sensitive and required columns must match those specified in the table.

Additional details on the input columns¶
Column Name	Data Type	Required	Nullable?	Description
Trial	INT64	Yes	No	Also known as Year. In the general case of a multi-year analysis Trial is a more appropriate term to avoid confusion.
Time	DOUBLE	Yes	No	In models that represent the time dimension of a scenario as day-of-year, this must be converted to an absolute timestamp.
Type	STRING	Yes	No	Any choice may be made for Type values that distinguish records from each other, as long as they are consistent across all inputs. Choosing standard values like “Loss” and “Premium” will prevent having to further customize modelling, since built-in Graphene features will be expecting standard values.
Value	DOUBLE	Yes	No	The actual monetary value associated with the transaction in the currency defined for the Ledger. If the Ledger currency is set to USD, the value in this field is US dollars. Also, Ledgers use a Profit & Loss perspective of the insured which means losses are represented as negative values and income as positive values.
Metadata Field 1…n	INT64 FLOAT32 DOUBLE[note] STRING	No	Yes	Each Ledger can contain any number of metadata fields. For a loss model, this would typically be information about the event that caused the loss (e.g. EventId, Peril, Severity) or information about the exposure affected (e.g. LineOfBusiness, TIV, Location, etc.).

Note

DOUBLE format Metadata columns will be handled as single precision floating point values (32 bit)

Since we support DOUBLE data types, and since a number of special values can appear in these columns, it is important to note that NaN is converted and treated as a null value in columns where it values can be null. When columns such as Time or Value do not allow null, NaN is not permitted.

Since Graphene is case sensitive to column names (i.e. "Trials" and "trials" are different columns), be sure provide the correct column name in parquet input. AWS Athena treats columns as case insensitive in parquet, so when using data directly with Athena or when requesting SQL queries via the Graphene Analysis API (which uses Athena), columns that only differ by case.

Data types¶

When multiple ledger inputs are provided, Graphene expects column data types to be defined identically for columns with the same name. An exception is made for integer columns. 32-bit integer metadata columns will be automatically converted to 64-bit integers when mixed inputs are detected. It is recommended that all input ledgers use the same integer data types for any given metadata column. Sub-optimal performance will result from relying on automatic handling of mixed integer types.

Note

When multiple ledger inputs are provided, Graphene will not attempt to handle a column that has differing types among multiple files. Instead, a user-facing error will be returned.

Occurrence Key¶

Since independent events in models can occur at the same time, an otherwise optional metadata column is required. This is known as the Occurrence Key, and it is provided as a metadata column in the Ledger. It does not need to have a specific name, but it must consist of exactly one column, of integer type that uniquely separates individual events within a Trial.

Note

If every event is guaranteed to have a distinct Time, then providing an Occurrence Key is optional.

An illustration of the handling of this column is as follows:

Trial	Time	Type	Value	Event ID	Location
1	3842394.0	Loss	10M	32497	North Carolina
1	3842394.0	Loss	20M	32497	South Carolina
1	3842394.0	Loss	5M	55206	North Carolina

During application of financial terms, the above Ledger uses an Occurrence Key of "Event ID", so that these two records together represent one event that causes different losses in two different locations. The ground-up loss for event 32497 is 30M, and for event 55206, it is 5M in the above example. Since these losses are distinguished by the Occurrence Key, they are processed as two distinct events for financial modelling, even though there are three records with identical Time. Without the Occurrence Key, this would be treated as a single event with a 35M ground-up loss for the purposes of financial modelling.

Currency column¶

It is possible to specify the currency for each individual record in the input ledger if necessary. This can be achieved by providing a metadata column representing Currency in the ledger. The Currency column can then be defined by the parameter currency_column in the ledger load template, such as ScaledLoad. It does not need to have a specific name, but it must consist of exactly one column of string type with values that match the currencies specified in the exchange rate table. If a currency column is used, each value in the ledger is converted from the specified currency to the base currency.

Note

it is not allowed to specify both currency and currency_column at the same time.

Data Organization¶

Graphene assumes some ordering of the rows within parquet files for performance. Other internal details of the format are suggested to further enhance performance of the system.

Row Order¶

Records need to be ordered by Trial, Time, Occurrence Key. There are exceptions when using filtering parameters at the input file level, but in general, this is essential to producing correct metrics output in all cases. Ordering all records by Trial is preferred for performance reasons, regardless of special exceptions.

Column Order¶

No specific column ordering is required.

Row Group Size¶

Splitting data across row groups is recommmended such that each row group contains records from exactly one trial. Otherwise, row groups of 1000-10000 records are appropriate.

Compression¶

The parquet format includes built in support for data compression. We support snappy compression which is one of the original Apache Parquet format compression schemes.

File Partitioning¶

Ledgers may be provided as collections of files. Graphene expects Ledger data to be accessible via Amazon S3 storage, and data may be partitioned into multiple files within a single folder.

Some restrictions are:

a folder may contain only one independent set of losses (for all its trials)
trials do not span multiple files

File extensions can follow a prescribed naming convention to enable the system to handle them efficiently without additional metadata being provided. Validation errors will result when trying to run an analysis on a data set that contains ambiguous partitions.

For example:

s3://example_bucket/uploads/ledgers/example_ledger_upload/
  |-- ground_up_loss.parquet.1of4
  |-- ground_up_loss.parquet.2of4
  |-- ground_up_loss.parquet.3of4
  |-- ground_up_loss.parquet.4of4

When providing a reference to the partitioned files as in the above example, only the folder URL is used (it must include the trailing slash). This is in contrast to referencing the full file URL when data is not partitioned. In non-partitioned cases, the XofY extension is omitted. Examples of malformed partitioned data sets include folders containing gaps (missing the .2of4), or inconsistent total counts (.3of5, .4of6).

Partitioned parquet files in the same folder do not require using the prescibed file extension above. The name of each file must be unique within the folder it is stored.

For example:

s3://example_bucket/uploads/ledgers/example_ledger_upload/
  |-- ground_up_loss_1.parquet
  |-- ground_up_loss_2.parquet
  |-- ground_up_loss_3.parquet
  |-- ground_up_loss_4.parquet

When using multiple partitioned parquet files without the prescibed file extension above the system will not have the information to check for any malformed partitioned data containing gaps or missing files.

Advanced Ledger Format¶

Columns¶

Data types¶

Occurrence Key¶

Currency column¶

Data Organization¶

Row Order¶

Column Order¶

Row Group Size¶

Compression¶

File Partitioning¶

Table of Contents

Previous topic

Next topic

This Page