Details
-
New Feature
-
Resolution: Done
-
Major
-
None
-
None
-
All
-
GreenHopper Ranking:0|i1pwqj:
-
9223372036854775807
-
Small
Description
Too reach this goal : https://jira.talendforge.org/browse/TDC-3876
1. In TCK/Record, add a new interface on schema : getLabel()
2.a AvroSchema is a TCK schema that wraps an Avro schema. Need to set the label meta-data if exist, so setLabel(String l), must be added also.
https://github.com/Talend/component-runtime/blob/7a4f24e0c876d6b0cd9d2686b6740cb960bc39ce/component-runtime-beam/src/main/java/org/talend/sdk/component/runtime/beam/spi/record/AvroSchema.java#L102
2.b Which property will be used in Avro that will contain the label (original column's name) ?
3. TCK AvroRecord will also need some adjustements to retrieve the avro property, for example "talend.component.label" that will contain the real name of the column
The property can be retrieve using : https://github.com/Talend/component-runtime/blob/7a4f24e0c876d6b0cd9d2686b6740cb960bc39ce/component-runtime-beam/src/main/java/org/talend/sdk/component/runtime/beam/spi/record/AvroPropertyMapper.java
4. Fileio generates first an Avro IndexRecord, and fields name are already computed : https://github.com/Talend/cloud-components/blob/3ac7d1c50bccab3bc38eaefc7c200ba115166c3a/fileio/src/main/java/org/talend/components/fileio/runtime/SimpleRecordFormatCsvIO.java#L557
:warning: We should add meta to keep also original name
5. Should we integrate in TCK, Avro names limitation ? For example, should we enforce technical fields name to be Avro compliant ?
We currently sanitize names in TCK Record : https://github.com/Talend/component-runtime/blob/7a4f24e0c876d6b0cd9d2686b6740cb960bc39ce/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/record/SchemaImpl.java#L132
But it is not full Avro compliant.
Since we will have a "label" metadata, we could have technical field name fully compliant with Avro. This mean, the given name for a field will be store in "label" and then sanitized to be stored in technical name.
/!\ @undx must validate this update of TCK
6. If the technical name already exist, "foo", an incremental index could be added : "foo_2"
7. To retrieve value in TCK, getString(name), we still use technical name, and so loop on schema. Label (original column name) should be use in outputs to get real fields names (for example to update column name)
8. We have some TCK/Record->Avro/IndexRecord & Avro/IndexRecord->TCK/Record converters that should deal with this "label". Need some checks.
Actions:
A/ Add get/setLabel in TCK API
B/ Improve sanitize name method (/!\ has been copied in 2 several places) to be full Avro compliant
C/ Work on fileio to make it "label" compliant
D/ Fix TCK/Record<-->Avro/IndexRecord converters if not already done for fileIO
- /!\ The full feature (cross team) is for end of May. We should deliver it asap since.
- /!\ This will impact TCK, TPD must be bump to last version ok TCK
- OK: B/ will generate new column names that can break studio jobs (only if it has columns name with '$' since allowed in java variables but not within Avro) => :heavy_check_mark: : Not really since studio doesn't support '$' in schema fields ! Studio name limitation is same as Avro.
Attachments
Issue Links
- is related to
-
TCOMP-1198 Tacokit beam tests. SchemaParseException => drop unsupported characters
- Done