Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Preparing a dataset for upload

Once you choose a natural language source, include all the information you’ll need in your analysis so you can get the best results from Daylight. Invest time now, since Daylight learns directly from the data you upload and data can’t be modified after uploading.  

Daylight’s Create a Project page accepts comma-separated value (CSV) format files with appropriately formatted columns. At minimum, every uploaded data file must be a CSV file and have a column titled text

...

Key vocabulary

  • A verbatim is the conversational text component of the sample you have collected. 

  • A document is a row of your source data, including the conversational text and any associated metadata.

  • Metadata is structured data that creates context for text responses. Metadata may include demographics, dates, scores, or product details. 

  • A CSV file, or comma-separated value file, is a plain-text file format that is used to organize data. CSV files exclude styling information that is included in an Excel XLS or XLSX file format. You can create a CSV file with most spreadsheet editors. 

Supported languages and multilingual datasets 

Daylight includes 15 natural language processing pipelines that analyze unstructured text in one language at a time. For best results with a multilingual dataset, split your data into one language per CSV file. Then, select the appropriate language when uploading each file.

Metadata

To add metadata, designate specific headers as you create your CSV file. These headers tell Daylight how to treat the contents of a column. Columns without a Daylight-compatible header will be ignored.
When you name a metadata column, the first value you prepend it with is the one that Daylight uses.

Columns can only use one data type. All data type names are ignored after upload. For instance, string_Name1, number_Name2 or string_number_Name3, become Name1, Name2, Name3.

...

Data type

...

Examples

...

Text (Required)

...

Column header: text

...

  • The natural language samples (verbatims) for Daylight to analyze

  • Only one text column is permitted per file

  • Each piece of text may not exceed 500,000 characters in length

...

Example: text

  • I loved the free coffee and the room was very clean, but it smelled strongly of cigarette smoke.

  • I booked this room last-minute when my travel plans changed. The price was ok considering it was last-minute but it was way out of the way. 

  • We come to this hotel every year, and we appreciate the consistently top-notch experience!

...

Title

...

Column header: title

...

  • Any identifier that is associated with text

  • Only one column permitted per data file

  • Isn’t analyzed as part of language sample, but can help organize text

...

Example: title

  • Recent stay

  • Hotel visit

  • We’ll definitely be back!

...

String

...

Column header: string_[FieldName] 

...

  • Information that helps categorize text

  • Include as many string columns as needed

  • Can only filter fields in Daylight with up to 10,000 values

  • Helps filter your data by category

...

Example: string_MemberLevel

  • None

  • Business

  • Loyalty

...

Number

...

Column header: number_[FieldName] 

...

  • Any numeric-only data associated with text

  • Include as many number columns as needed

  • Can optionally use in Driver feature

...

Example: number_MemberSince

  • 2015

  • 2018

  • 1998

...

Score

...

Column header: score_[FieldName] 

...

  • Any score or rating data associated with text

  • Include as many score columns as needed

  • Recommended for using the Drivers function

...

Example: score_OverallExperience

  • 7

  • 10

...

Date

...

Column header: date_[FieldName] 

...

  • Any date or time associated with text

  • Include as many date columns as needed

  • Daylight assumes that all dates are in a UTC timezone unless you include an ISO 8601 date with a specific timezone

  • Accepts ISO 8601 strings, Unix timestamps, or US-style formats

  • Helps you filter your project, especially if you upload data more than once

...

Example: date_CheckoutDate

ISO 8601 formatted dates:

  • 2018-04-10

  • 2018-04-10T13:45

  • 2018-04-10T13:45:00Z

US-style dates:

  • 04/10/2018

  • 04/10/2018 13:45:15

  • 4/10/18 1:45 PM

...

String (where more than one value can be selected)

...

Column header: string_[FieldName], (string_[FieldName], …..)

...

You can have multiple values for string based metadata fields for cases where it is possible for more than one response to be selected for a single question. There are two ways to format such data:

  1. Record the multiple values across multiple columns with the same string_[FieldName] column header.

  2. Record the multiple values within a single column with the values separated by the pipe character (“|”).

Both options produce a metadata field with the name [FieldName] with all of the supplied values being applied to the corresponding document.

...

Example: Which product have you tried? [Choose all that apply]

  1. Use string_ProductsUsed as the column header for as many columns as needed

  2. Use one column with the header string_ProductsUsed and format the value in this column as ProductA | Product B | …