.. Index:: Ledger Format; Advanced
.. _advanced-ledger:

Advanced Ledger Format
======================

Graphene can perform better if the parquet input is organized with some constraints
in mind.  Some of the basic requirements of the parquet input are also more flexible
if certain alternate conditions can be met.  This section describes these
considerations in detail.

Columns
-------

.. note:: Column names are case-sensitive and required columns must match those specified in the table.

.. list-table:: Additional details on the input columns
  :widths: 10 10 10 10 60
  :header-rows: 1

  * - Column Name
    - Data Type
    - Required
    - Nullable?
    - Description
  * - Trial
    - INT64
    - Yes
    - No
    - Also known as Year. In the general case of a multi-year analysis Trial is a more appropriate term to avoid confusion.
  * - Time
    - DOUBLE
    - Yes
    - No
    - In models that represent the time dimension of a scenario as day-of-year, this must be converted to an absolute timestamp.
  * - Type
    - STRING
    - Yes
    - No
    - Any choice may be made for Type values that distinguish records from each other, as long as they are consistent across all inputs.  Choosing standard values like "Loss" and "Premium" will prevent having to further customize modelling, since built-in Graphene features will be expecting standard values.
  * - Value
    - DOUBLE
    - Yes
    - No
    - The actual monetary value associated with the transaction in the currency defined for the Ledger. If the Ledger currency is set to USD, the value in this field is US dollars. Also, Ledgers use a Profit & Loss perspective of the insured which means losses are represented as negative values and income as positive values.
  * - Metadata Field 1...n
    - INT64 FLOAT32 DOUBLE[note] STRING
    - No
    - Yes
    - Each Ledger can contain any number of metadata fields.  For a loss model, this would typically be information about the event that caused the loss (e.g. EventId, Peril, Severity) or information about the exposure affected (e.g. LineOfBusiness, TIV, Location, etc.).


.. note:: DOUBLE format Metadata columns will be handled as single precision floating point values (32 bit)


Since we support ``DOUBLE`` data types, and since a number of special values can appear
in these columns, it is important to note that ``NaN`` is converted and treated as a
``null`` value in columns where it values can be ``null``.  When columns such as Time
or Value do not allow ``null``, ``NaN`` is not permitted.

Since Graphene is case sensitive to column names (i.e. ``"Trials"`` and ``"trials"`` are different columns),
be sure provide the correct column name in parquet input.  AWS Athena treats columns as case insensitive
in parquet, so when using data directly with Athena or when requesting SQL queries via the Graphene
Analysis API (which uses Athena), columns that only differ by case.


Data types
----------


When multiple ledger inputs are provided, Graphene expects column data types to be defined identically for columns
with the same name. An exception is made for integer columns. 32-bit integer metadata columns will be automatically
converted to 64-bit integers when mixed inputs are detected. It is recommended that all input ledgers use the same
integer data types for any given metadata column. Sub-optimal performance will result from relying on automatic
handling of mixed integer types.

.. note:: When multiple ledger inputs are provided, Graphene will not attempt to handle a column that has differing
          types among multiple files. Instead, a user-facing error will be returned.


.. _occurrence-key:


Occurrence Key
~~~~~~~~~~~~~~


Since independent events in models can occur at the same time, an otherwise optional
metadata column is required.  This is known as the ``Occurrence Key``, and it is provided
as a metadata column in the Ledger.  It does not need to have a specific name, but it
must consist of exactly one column, of ``integer`` type that uniquely separates individual
events within a Trial.

.. note::  If every event is guaranteed to have a distinct Time, then providing an Occurrence Key is optional.

An illustration of the handling of this column is as follows:

+-------+-----------+------+-------+----------+----------------+
| Trial | Time      | Type | Value | Event ID | Location       |
+=======+===========+======+=======+==========+================+
| 1     | 3842394.0 | Loss | 10M   | 32497    | North Carolina |
+-------+-----------+------+-------+----------+----------------+
| 1     | 3842394.0 | Loss | 20M   | 32497    | South Carolina |
+-------+-----------+------+-------+----------+----------------+
| 1     | 3842394.0 | Loss | 5M    | 55206    | North Carolina |
+-------+-----------+------+-------+----------+----------------+

During application of financial terms, the above Ledger uses an Occurrence Key of
``"Event ID"``, so that these two records together represent one event that causes
different losses in two different locations.  The ground-up loss for event
32497 is 30M, and for event 55206, it is 5M in the above example. Since these losses
are distinguished by the Occurrence Key, they are processed as two distinct events
for financial modelling, even though there are three records with identical Time.
Without the Occurrence Key, this would be treated as a single event with
a 35M ground-up loss for the purposes of financial modelling.


.. _currency-column:


Currency column
~~~~~~~~~~~~~~~

It is possible to specify the currency for each individual record in the input ledger if necessary.
This can be achieved by providing a metadata column representing ``Currency`` in the ledger.
The ``Currency`` column can then be defined by the parameter ``currency_column`` in the
ledger load template, such as ScaledLoad. It does not need to have a specific name,
but it must consist of exactly one column of ``string`` type with values that match
the currencies specified in the exchange rate table. If a currency column is used,
each value in the ledger is converted from the specified currency to the base currency.

.. note::  it is not allowed to specify both ``currency`` and ``currency_column`` at the same time.


.. index:: Parquet; Recommendations, Input; Data Organization

Data Organization
-----------------

Graphene assumes some ordering of the rows within parquet files for performance.
Other internal details of the format are suggested to further enhance performance
of the system.


.. index:: Input; Row Order, Parquet; Order

Row Order
~~~~~~~~~

Records need to be ordered by Trial, Time, Occurrence Key.  There are exceptions
when using filtering parameters at the input file level, but in general,
this is essential to producing correct metrics output in all cases. Ordering
all records by Trial is preferred for performance reasons, regardless of special
exceptions.


Column Order
~~~~~~~~~~~~

No specific column ordering is required.

Row Group Size
~~~~~~~~~~~~~~

Splitting data across row groups is recommmended such that each row group contains
records from exactly one trial.  Otherwise, row groups of 1000-10000 records are
appropriate.

Compression
~~~~~~~~~~~

The parquet format includes built in support for data compression.  We support
snappy compression which is one of the original Apache Parquet format compression
schemes.


.. index:: Partitioning, Input; Partitioning, Parquet; Partitioning

File Partitioning
~~~~~~~~~~~~~~~~~

Ledgers may be provided as collections of files.  Graphene expects Ledger data to be
accessible via Amazon S3 storage, and data may be partitioned into multiple files
within a single folder.

Some restrictions are:
 - a folder may contain only one independent set of losses (for all its trials)
 - trials do not span multiple files

File extensions can follow a prescribed naming convention to enable
the system to handle them efficiently without additional metadata being provided.
Validation errors will result when trying to run an analysis on a data set that
contains ambiguous partitions.

For example:

::

    s3://example_bucket/uploads/ledgers/example_ledger_upload/
      |-- ground_up_loss.parquet.1of4
      |-- ground_up_loss.parquet.2of4
      |-- ground_up_loss.parquet.3of4
      |-- ground_up_loss.parquet.4of4

When providing a reference to the partitioned files as in the above example,
only the folder URL is used (it must include the trailing slash). This is in
contrast to referencing the full file URL when data is not partitioned.  In
non-partitioned cases, the XofY extension is omitted.  Examples of malformed
partitioned data sets include folders containing gaps (missing the .2of4), or
inconsistent total counts (.3of5, .4of6).

Partitioned parquet files in the same folder do not require using the prescibed
file extension above. The name of each file must be unique within the folder it
is stored.

For example:

::

    s3://example_bucket/uploads/ledgers/example_ledger_upload/
      |-- ground_up_loss_1.parquet
      |-- ground_up_loss_2.parquet
      |-- ground_up_loss_3.parquet
      |-- ground_up_loss_4.parquet


When using multiple partitioned parquet files without the prescibed
file extension above the system will not have the information to
check for any malformed partitioned data containing gaps or missing files.