Parsing EDI to XML (and vice verse) - CodeProject
I've been meaning to provide a brief overview of the EDI format from the perspective of parsing and interpretation. I have an affinity for data formats and parsers. My first encounter with EDI was when I worked for an airline services provider. I was tasked to engineer the EDI communication for the company's new line of business applications. Information on the subject was scarce and scattered across multiple resources on the web. I had to traverse numerous websites in order to assemble a finite set of rules, sufficient enough to be unambiguously understood by a normal human being and eventually automated. It was like solving a giant puzzle.
Most of the articles related to EDI revolved around business controversies and comparisons between the different formats and dialects. Completely irrelevant to my research. I still don't understand why do so many EDI formats still co-exist nowadays (> 5000). It appeared to me that EDI was veiled in mystery and the lack of information and cooperation was not something to be considered as a simple act of randomness...
I will leap over the entertaining side of EDI, like the conspiracy behind the multiple formats, the rebellious movement against VANs, and the ever ongoing discussion on whether XML will eventually bury EDI (with UBL being the latest contender). My goal here is to share my knowledge on the basics of parsing an EDI message, and hope that someone else may find that useful.
I'll retract from the trivial and won't go into details on the format itself - anyone interested please have a look at the below resources, which proved to be useful for me (in a random order):
Let's just point some interesting excerpts:
- A single group can only contain messages of the same type and version. EDIFACT has a very limited use of groups and an interchange usually contains messages of the same type, therefore the group segments in EDIFACT are optional.
- A Group is identified by it's first segment, which is mandatory and always has a repetition of one.
- Although EDI format may seem loose enough to allow ambiguous EDI structures, e.g same segment on the same level, it's up to the owner of the definitions to ensure the structure is valid. However, when writing a parser, make sure you don't end up in an endless loop because of a faulty structure.
- Group is an EDIFACT term. In X12 they are called Loops and their visual representation is slightly different.
- Loops can be bounded or unbounded - the first having a start and end segment, where the later repeats according to a count, with the first segment being unique. The corresponding terminology in EDIFACT is Group and explicit loop. They are very rarely used in EDIFACT and in X12 only few transactions support it.
Right, assuming that at this stage we are well familiar with the format, let's get down to something real. Let's parse an EDI message!
Parsing the message
For the purpose I'll need two samples - one for EDIFACT and one for X12:
UNA:+.? ' UNB+UNOB:1+102096559TEST:16:ZZUK+PARTNERID:01:ZZUK+071101:1701+131++INVOIC++1++1' UNH+509010117+INVOIC:D:00A:UN' BGM+380::*380::+12345678:9+AP' DTM+137:19980610:102' TAX+7+VAT:::+++:::16.00' MOA+124:16.00' MOA+125:100.00' UNT+40+0001' UNZ+1+1'
ISA*00**00**16*102096559TEST *14*PARTNERTEST*071214*1406*U*00204*810000263*1*T*>~ GS*IN*102096559TEST*PARTNER*20071214*1406*810000263*X*004010~ ST*810*166061414~ BIG**0013833070**V8748745***DI~ NTE*GEN~ CTT*1~ SE*44*166061414~ GE*1*810000263~ IEA*1*810000263~
Details on the communication channels widely used to exchange EDI messages, with AS2 and FTP being the most popular, are also out of the scope of this article. The story starts after a message has been physically received.
The steps I took to parse the message are:
1. Identify the message format - is it EDIFACT or X12 or other? EDIFACT always starts with UNB or UNA segment. X12 always starts with ISA segment. Make sure you get rid of the BOM before you proceed as we'll be counting characters here and every interference can ruin our efforts. No BOM, no leading blanks, no extra spaces.
2. Identify the separators - once the format is identified the relevant message properties must be extracted. These properties are the separators:
- data element separator
- component data element separator
- repetition separator
- segment terminator
- release indicator
X12 properties are extracted as follows:
contents = Inbound EDI message result.DataElementSeparator = contents.ToString(); string isa = string.Concat(contents.Take(106)); string isaElements = isa.Split(result.DataElementSeparator); result.ComponentDataElementSeparator = string.Concat(isaElements.First()); result.RepetitionSeparator = isaElements != "U" ? isaElements : "^"; result.SegmentTerminator = string.Concat(isaElements.Skip(1).First()); if (result.SegmentTerminator == " " || string.IsNullOrEmpty(result.SegmentTerminator) || result.SegmentTerminator == "G") result.SegmentTerminator = Environment.NewLine;
which tells us that:
- data element separator is the 4th character in the message *
- component data element separator is the 16th data element in ISA segment (zero indexed) first character >
- repetition separator is the 11th data element in ISA segment (zero indexed) if it's not U. If it's U then the repetition separator is the default ^.
- segment terminator is the 16th data element in ISA segment (zero indexed) second character ~.
If the segment terminator is not present, then segment terminator is a new line.
EDIFACT properties are extracted as follows:
If no UNA segment is present, then the default values are used -
result.ComponentDataElementSeparator = ":"; result.DataElementSeparator = "+"; result.ReleaseIndicator = "?"; result.RepetitionSeparator = "*"; result.SegmentTerminator = "'";
If UNA segment exists, the properties are extracted as -
var una = UNA segment result.ComponentDataElementSeparator = una.ToString(); result.DataElementSeparator = una.ToString(); result.ReleaseIndicator = una.ToString(); result.RepetitionSeparator = "*"; result.SegmentTerminator = una.ToString();
- data element separator is the zero character in the UNA segment (zero indexed) :
- data element separator is the zero character in the UNA segment (zero indexed) :
- component data element separator is the first character in the UNA segment (zero indexed) +
- repetition separator is *
- segment terminator is is the 5th character in the UNA segment (zero indexed) '
- release indicator is is the 3d character in the UNA segment (zero indexed) ?
3. Iterate through the interchange groups - once the separators are known we can proceed with the interchange. The parser should traverse the interchange structure and iterate through the interchange groups\loops.
4. Identify the message type and version - for every group it needs to identify the message type and version of the transaction. X12 contains the version information in the interchange group start segment. The parser will loop through all transactions and start parsing them one by one.
Message type and version properties are extracted as follows:
- version is the 8th data elment in GS segment, first 6 characters.
- message type is the first data element in ST segment.
It needs to be noted that X12 comes with two different versions - one for the message and one for the ISA segment. The later is contained in the ISA segment itself and is used to parse the interchange header. This means that X12 messages can have an interchange header and transaction messages in two different versions. In our example the ISA version is 00204.
- version is the second data element in UNH segment, second and 3d component data element D00A
- message type is the second data element in UNH segment, first component data element INVOIC
5. Parse the transaction according to a formal grammar - once we know exactly what message we've got, the real parsing begins. I'll asume that most of you are familiar with the terminology and techniques of parsing (otherwise why would you be still reading). In order to parse an EDI message we need a formal grammar, which is the actual definition of the EDI rules and in our case is in the form of XML schema or .NET class.
ediFabric has a predefined set of definitions, which can be extended\amended to suit every dialect or requirement. I also added an additional property, called Origin, which together with message type and version forms a unique key identifying the definition. This allows you to cater for multiple customer versions of the same message in the same time. It's an old pattern to combine the two halves in a single key - one is part of the external content and the other is under our control.
Anyway, regardless of how the grammar is retrieved, it is used to uniquely parse the EDI message according to the rules of that same grammar. It defines the form we would like to see at the end of the parsing process.
What are the main challenges in parsing the EDI message ? Undoubtedly it's the conversion from a linear segments structure to a hierarchical structure. An EDI message contains a simple flat list of segments. It's the parser's function to transform that flat structure into a hierarchical tree, where every node is either a parent or a child.
The goal is to build a structure, every element of which is connected to one or more other elements and is aware of three things - which is its parent, which are its children and what is its order on the same level.
The parser will process the message according to the grammar and will produce a parse tree, which is a well ordered hierarchical set. Our EDI message has been converted into an object and we know how to manipulate that object. Here comes XML.
The resulting object model
The resulting object model conforms to ISO/TS 20625. It can generate XML or be instantiated from XML, which adds the necessary cross platform flavor and allows for easy transformation (XSLT, XQuery, etc.).
In the closing of this already stretched narrative, I'd like to express my opinion on the use of parsing EDI to XML. What would be the use of it ?
I'm far from the heretic thought that a product like ediFabric can compete with a full blown, commercial and costly EDI parser. But in the same time I couldn't find an open source or low budget solution to offer me the flexibility I needed. It is the alternative I was after.
My interest was not only to design an EDI to XML parser. As a software professional I wanted it to be robust, extendable and to require very little maintenance. It was designed to cater for multiple custom formats and to easily change existing or add new definitions.
I don't look at XML as an alternative to EDI. I believe the two are complementary - EDI is popular in it's own business domain, lightweight in size, and with established communication channels. XML is standard and natively supported by almost every programming language. I felt there was a gap and I had to unite the dots. It was an isomorphism, which should have supposedly made EDI more application-friendly.
It's already too late. Another time, another place. That's it for me.