Advanced Ledger Format¶
Graphene can perform better if the parquet input is organized with some constraints in mind. Some of the basic requirements of the parquet input are also more flexible if certain alternate conditions can be met. This section describes these considerations in detail.
Columns¶
Note
Column names are case-sensitive and required columns must match those specified in the table.
Column Name |
Data Type |
Required |
Nullable? |
Description |
---|---|---|---|---|
Trial |
INT64 |
Yes |
No |
Also known as Year. In the general case of a multi-year analysis Trial is a more appropriate term to avoid confusion. |
Time |
DOUBLE |
Yes |
No |
In models that represent the time dimension of a scenario as day-of-year, this must be converted to an absolute timestamp. |
Type |
STRING |
Yes |
No |
Any choice may be made for Type values that distinguish records from each other, as long as they are consistent across all inputs. Choosing standard values like “Loss” and “Premium” will prevent having to further customize modelling, since built-in Graphene features will be expecting standard values. |
Value |
DOUBLE |
Yes |
No |
The actual monetary value associated with the transaction in the currency defined for the Ledger. If the Ledger currency is set to USD, the value in this field is US dollars. Also, Ledgers use a Profit & Loss perspective of the insured which means losses are represented as negative values and income as positive values. |
Metadata Field 1…n |
INT64 FLOAT32 DOUBLE[note] STRING |
No |
Yes |
Each Ledger can contain any number of metadata fields. For a loss model, this would typically be information about the event that caused the loss (e.g. EventId, Peril, Severity) or information about the exposure affected (e.g. LineOfBusiness, TIV, Location, etc.). |
Note
DOUBLE format Metadata columns will be handled as single precision floating point values (32 bit)
Since we support DOUBLE
data types, and since a number of special values can appear
in these columns, it is important to note that NaN
is converted and treated as a
null
value in columns where it values can be null
. When columns such as Time
or Value do not allow null
, NaN
is not permitted.
Since Graphene is case sensitive to column names (i.e. "Trials"
and "trials"
are different columns),
be sure provide the correct column name in parquet input. AWS Athena treats columns as case insensitive
in parquet, so when using data directly with Athena or when requesting SQL queries via the Graphene
Analysis API (which uses Athena), columns that only differ by case.
Data types¶
When multiple ledger inputs are provided, Graphene expects column data types to be defined identically for columns with the same name. An exception is made for integer columns. 32-bit integer metadata columns will be automatically converted to 64-bit integers when mixed inputs are detected. It is recommended that all input ledgers use the same integer data types for any given metadata column. Sub-optimal performance will result from relying on automatic handling of mixed integer types.
Note
When multiple ledger inputs are provided, Graphene will not attempt to handle a column that has differing types among multiple files. Instead, a user-facing error will be returned.
Occurrence Key¶
Since independent events in models can occur at the same time, an otherwise optional
metadata column is required. This is known as the Occurrence Key
, and it is provided
as a metadata column in the Ledger. It does not need to have a specific name, but it
must consist of exactly one column, of integer
type that uniquely separates individual
events within a Trial.
Note
If every event is guaranteed to have a distinct Time, then providing an Occurrence Key is optional.
An illustration of the handling of this column is as follows:
Trial |
Time |
Type |
Value |
Event ID |
Location |
---|---|---|---|---|---|
1 |
3842394.0 |
Loss |
10M |
32497 |
North Carolina |
1 |
3842394.0 |
Loss |
20M |
32497 |
South Carolina |
1 |
3842394.0 |
Loss |
5M |
55206 |
North Carolina |
During application of financial terms, the above Ledger uses an Occurrence Key of
"Event ID"
, so that these two records together represent one event that causes
different losses in two different locations. The ground-up loss for event
32497 is 30M, and for event 55206, it is 5M in the above example. Since these losses
are distinguished by the Occurrence Key, they are processed as two distinct events
for financial modelling, even though there are three records with identical Time.
Without the Occurrence Key, this would be treated as a single event with
a 35M ground-up loss for the purposes of financial modelling.
Currency column¶
It is possible to specify the currency for each individual record in the input ledger if necessary.
This can be achieved by providing a metadata column representing Currency
in the ledger.
The Currency
column can then be defined by the parameter currency_column
in the
ledger load template, such as ScaledLoad. It does not need to have a specific name,
but it must consist of exactly one column of string
type with values that match
the currencies specified in the exchange rate table. If a currency column is used,
each value in the ledger is converted from the specified currency to the base currency.
Note
it is not allowed to specify both currency
and currency_column
at the same time.
Data Organization¶
Graphene assumes some ordering of the rows within parquet files for performance. Other internal details of the format are suggested to further enhance performance of the system.
Row Order¶
Records need to be ordered by Trial, Time, Occurrence Key. There are exceptions when using filtering parameters at the input file level, but in general, this is essential to producing correct metrics output in all cases. Ordering all records by Trial is preferred for performance reasons, regardless of special exceptions.
Column Order¶
No specific column ordering is required.
Row Group Size¶
Splitting data across row groups is recommmended such that each row group contains records from exactly one trial. Otherwise, row groups of 1000-10000 records are appropriate.
Compression¶
The parquet format includes built in support for data compression. We support snappy compression which is one of the original Apache Parquet format compression schemes.
File Partitioning¶
Ledgers may be provided as collections of files. Graphene expects Ledger data to be accessible via Amazon S3 storage, and data may be partitioned into multiple files within a single folder.
- Some restrictions are:
a folder may contain only one independent set of losses (for all its trials)
trials do not span multiple files
File extensions can follow a prescribed naming convention to enable the system to handle them efficiently without additional metadata being provided. Validation errors will result when trying to run an analysis on a data set that contains ambiguous partitions.
For example:
s3://example_bucket/uploads/ledgers/example_ledger_upload/
|-- ground_up_loss.parquet.1of4
|-- ground_up_loss.parquet.2of4
|-- ground_up_loss.parquet.3of4
|-- ground_up_loss.parquet.4of4
When providing a reference to the partitioned files as in the above example, only the folder URL is used (it must include the trailing slash). This is in contrast to referencing the full file URL when data is not partitioned. In non-partitioned cases, the XofY extension is omitted. Examples of malformed partitioned data sets include folders containing gaps (missing the .2of4), or inconsistent total counts (.3of5, .4of6).
Partitioned parquet files in the same folder do not require using the prescibed file extension above. The name of each file must be unique within the folder it is stored.
For example:
s3://example_bucket/uploads/ledgers/example_ledger_upload/
|-- ground_up_loss_1.parquet
|-- ground_up_loss_2.parquet
|-- ground_up_loss_3.parquet
|-- ground_up_loss_4.parquet
When using multiple partitioned parquet files without the prescibed file extension above the system will not have the information to check for any malformed partitioned data containing gaps or missing files.