Need to build an interface for XML to SQL Datadata

This topic has 37 replies, 4 voices, and was last updated 5 years, 8 months ago by Jeff Reed.

Viewing 15 posts - 1 through 15 (of 38 total)

1 2 3 →

Author

Posts
June 25, 2018 at 12:16 pm #7241

Jeff Reed
Spectator

We are kicking off a project to design/build/test an XML conversion interface for the NAACCR ‘I’ files that we are currently converting from fixed length format.

This will be a solution delivered with a combination of C# class libraries and Oracle/SQL Server stored procedures and potentially third party tools like XML Spy. Currently we store the NAACCR Layout(s) to include fixed length field positions and fields types in the database which we use to dynamically build insert statements. At this point I am not sure if we will need to build a XML to Flat file converter or directly write the XML records into the database.

Looking for recommendations on how to proceed with designing an XML interface. Understanding the complexity of the XML hierarchy as well as pointing us to any existing code/applications/test files would be a great help.

June 25, 2018 at 1:42 pm #7247

Rich Pinder
Moderator

Thanks… Jeff this a long needed item, and one the community will be grateful for you to work on !! Rich Pinder here, programmer in earlier life and a long time NAACCR member.

No matter which tools/platforms/approach you take, your work will help for sure! And as a resource for the Vendor community, I’m sure many will be eager to learn from your success.

An open/generic approach which might work on a number of platforms WOULD be optimum. SQLServer and MySQL come to mind as ones more commonly seen these days by folks who might be interested in loading into SQL. Oracle was our platform of choice long ago, before it became too pricey for us.

One thought be optimize a C# tool to unpack the XML (using XMLSpy to get the XML structure?), via a Class Library which might be up’gradable as the next Version of our XML def comes into play – but then perhaps look to an open source DB tool, maybe using the slow (but workable) ODBC generic driver approach, to spit out the insert code – that way it may be usable by more systems. Just a thought.

While back I believe Isaac/Kentucky built something to load directly into relation database (MySQL perhaps?)… perhaps reviewing their approach would be good.

Thats it for a quick comment – would love to chat (or e’chat) more on the thread as it unfolds. I’m always reachable @ 323) 442-1640

Rich Pinder
USC School of Medicine

June 25, 2018 at 7:30 pm #7271

Kathleen Beaumont
Spectator

Hi Jeff,

NPCR produced XML Exchange Plus as a working example of an implementation of the NAACCR XML v1.3 specification. Whether you choose to use the XMLPlus.dll or simply use its API as a launch-point for your own implementation, you might find the information in the XMLExchangePlus help file a good place to start.

The installer (includes XMLPlus.dll, a Windows application demonstrating the library’s features, and a comprehensive help file) can be found at the NPCR web site:
https://www.cdc.gov/cancer/npcr/tools/registryplus/xml-exchange-plus.htm

Kathleen Beaumont
retired programmer for EDITS50 and XML Exchange Plus

June 26, 2018 at 10:33 am #7273

Fabian Depry
Moderator

Hi Jeff,

Another resource you might find useful is the Java implementation of the NAACCR XML specifications: https://github.com/imsweb/naaccr-xml

You probably won’t be able to use any of the code (Java doesn’t play well with C# and vice versa) but you can certainly see many examples and get some ideas.

The project home page contains links to the specifications and to a wiki page. I strongly encourage you to take a look at the wiki, it has all kind of information about NAACCR XML, including a (Java-based) standalone tool allowing you to go from XML to flat and flat to XML, and also a bunch of valid and invalid XML files that can be used to test your implementation.

I hope this helps!

June 27, 2018 at 8:57 am #7281

Jeff Reed
Spectator

Many thanks to all for the great resource links provided. I have always been a big fan of “borrowing” code where available and will be sure to develop as generically as possible to help share the wealth.

My next steps will be to review open source tools as Rich suggested, review XMLExchangePlus, thank you Kathleen, and dig into the Java logic, thank you Fabian. Should be able to get some good design and architecture ideas from those sources.

At this point I am thinking of developing C# class libraries for generic table definitions and integrating the code for converting XML to flat file’s for “bulk loading”. I am thinking the flat files could be “|” delimited or fixed length format with wrappers to load the flat files into the respective databases. This not not rule out using the tools suggested above.

The second step of post load ETL processing would be more DB specific, Oracle in my case, as the data is validated and normalized for relational DB storage. We should be able to stick to ANSI standard SQL to help with different DB platforms.

Many thanks to all, will keep you posted, I feel the excitement building!

Jeff

June 27, 2018 at 12:19 pm #7282

Kathleen Beaumont
Spectator

converting XML to flat file’s for “bulk loading”

This is just a thought, but what if you instead created “workspace” tables in your database that followed the structure of the XML file, i.e., a table for Patient data and another for Tumor data. Each would have primary key columns, and the Tumor tuples would have foreign key relationships to the Patient tuple.

This would make it easier to perform case-specific and batch-update “fixes” to data on the way into the production tables. (This may not be something you do at your registry, but over the years I’ve heard lots of programmers say they need to fix misspellings and other such tweaks.) It would also be easy to write a “reviewer” app to look up cases, filter an incoming data file on any criteria, generate reports, whatever your analysts are clamoring for.

Then you could run a handful of SQL insert/update queries to pull the data from the workspace tables into the production tables, and zap the workspace tables for next time.

I just sort of hate to see you convert XML to flat ASCII files.

I’ll butt out now.

Kathleen

June 27, 2018 at 12:36 pm #7283

Jeff Reed
Spectator

Good Feedback thank you Kathleen. Part of my initial design considerations were to minimize re-coding of existing systems that dynamically loaded files based on NAACCR version from flat files. but it would make sense to load the XML directly and transform from there. Last time I had to load XML data we had to revert back to flat or delimited files due to performance issues that may not be as big of a problem with current SW/HW. And no butting out your in too deep now …

June 27, 2018 at 12:38 pm #7284

Rich Pinder
Moderator

…butt out… oh no! I’d suggest a ‘retired programmer’ is just what’s needed here !!! (how do I insert a grinning emoje here !)

This approach that you’re all talking about (i think) parallels what some of us talked about long back in the WG history, where it became apparent that the most sexy way to handle schmoozin the XML structured data into relational model would be via an intermediate step where you essentially build relations to emulate whatever structure our NAACCR XML flavor goes to (we have record/pt/tumor now…but down the line… LOTS more comin, me thinks). I think this philosophy of processing our stuff is along same lines you did Fabian for getting XML into SAS (datasets).

Using the native SQL flavor that a vendor has installed makes a lot of sense. Extra/temp tables also makes sense, where review and manipulation of the data can be done before loading into the production tables.

Using an open source, file based solution (ie SQLite) would also be a way to offer an alternative to unpack the XML into a ANSI compliant SQL based datastructure, without using a full fledged RDBMS.

Great THREAD’n here folks… Fabian, mebe this little bunch should petition NAACCR to spin up another of those InPerson ‘workfest’ trips, like we did to initially grok out the XML structure! where did we do that …. was that Atlanta ?? (bad memory chips in this ‘probably should be retired’ programmer) [insert another emoji here, with fingers in ears, wavin hands!]

adios for now …
r

June 27, 2018 at 3:21 pm #7285
Kathleen Beaumont
Spectator
Hi Jeff,

And no butting out your in too deep now …

You may change your mind after this… 😉

I admit that I prefer to a fault to write my own utilities, especially when existing tools annoy me. So if I were doing this, I would
- read in a Patient record from the XML data file
- iterate through a list of all possible Patient-level data items *
- dynamically build an INSERT statement for just the items provided in the XML record
- and execute it
* You know what items matter to you, so if the XML contains additional/unexpected data items (i.e., ItemDefs) you are not interested in them anyway.

Then do the same for all of the Tumor data associated with this Patient. I don’t think it would take long to write a quick-and-dirty test app to see how the performance compares to using the XML importer built into Oracle. If performance is acceptable, you could make the tool easily configurable to modify the “list of items we care about”, so that future changes to the NAACCR dictionary can be accommodated.

Now remember, if you respond to me you’re just encouraging me to keep making suggestions!

Good luck, and I hope you’ll continue to post about what you’re planning. The subject is very interesting to me.

Kathleen
June 28, 2018 at 7:38 am #7286

Jeff Reed
Spectator

Thank you Kathleen,

What a nice dissection you gave in navigating a NAACCR XML record from a patient record point of view. That makes a lot of sense to adopt XML translations to break down the records as part of this loading process. I could see breaking out the different record type’s into separate XML load modules to support relational data models that would be useful. We only get the ‘Incident’ record type so our focus may be a little different as it centers on tumors with the classic struggle to uniquely identify a patient to weed out duplicates. (Pls use a separate thread for the duplicate topic 😉

My background is heavy on database architecture tied to ETL performance so I do have a classic bias of “I can do it faster” that I am working on. That strength/weakness makes me still lean towards producing files for bulk inserting rather than individual insert statements as we regularly need to process over 100 thousand files a day. As we already have a lot of business rules and validation tables in use in the database we are looking to leverage that in the end solution. That said, I will continue to try and keep a focus on a universal solution that will support a broader audience.

If nothing else my flat earth file mentality may keep “encouraging” you add light to this sea of data ….

June 28, 2018 at 10:26 am #7287

Kathleen Beaumont
Spectator

Jeff,

Going back to basics: You are presently bulk-importing from flat-format files, and said you hoped to minimize re-coding existing systems, so let’s loop back to that concept.

Keep in mind that a big motivation for using XML is to allow transmitting complete text data (e.g., physician’s notes, etc.) instead of truncating it to a couple of thousand characters, so you don’t want to use the current NAACCR undelimited format. (These data items are identifiable in the NAACCR XML dictionary with allowUnlimitedText=”true”.) Also note that the plan is to eliminate the “starting position” attribute of NAACCR data items in the not-distant future, so positioning in a flat file will be something you’ll have to maintain with your own “record layout”. Boy, this is getting ugly really fast, isn’t it?

When importing from flat ASCII, do delimit the fields so that you can know where the text fields begin and end. But I recommend not delimiting with the pipe character (‘|’). In fact, don’t use any keyboard character because somebody is bound to embed it in a text field. I like the guillemet for this purpose (‘»’, typed from the keyboard with Alt+0187) because it is really unlikely to be typed during data entry. And you’ll need to flatten the CRLFs in text data; I think SQL Server understands “~” to be a linefeed, but you’d have to look into what Oracle uses.

You’ll have to write the converter yourself; the freebie IMS and NPCR tools perform conversions between XML and the traditional undelimited NAACCR flat record.

BTW, when you import data now are you loading into temporary tables, or are you loading straight into your production tables? And how long does it take you to load 100 thousand data files of Incident records each day? Again, I’m asking simply because I am finding this whole topic really interesting… so if you get bored with amusing me, feel free to ignore me!

Kathleen

July 18, 2018 at 11:55 am #7446

Jeff Reed
Spectator

Status Update:

My first task was to test Oracle’s capabilities to store and process XML. I used their bulk Data Loader program to load the sample xml NAACCR version 140 file into a CLOB (Character Large Object Blob) field that could then be parsed with Oracles XML functions. The field size for a CLOB can be up to 128 Terabytes. Didn’t take me long to say forget this Oracle XML code given the CDC DLL’s capabilities to parse a NAACCR XML file.

Next task was to create a XMAL application, (Screen pic in attachment), with a back-end of C# for parsing an XML file using the CDC’s XMLPlus.DLL. I got the application to read the XML file and assemble patient/tumor records in a record/field delimited file suitable for bulk loading into our DB. I am adopting what I am calling the KB delimiter in honor of Kathleen B., &#187. (»)

Next Queued tasks
• Integration of the CDC’s EDIT50.dll for scoring and reporting.
• Identify/create better testing data. The Current XML sample file is a NAACCR version 140 ‘A’ record type. I need to test 160 and 180 versions for record type I.
• Continue refining/testing load to add error handling and reporting.

Attachments:
You must be logged in to view attached files.

July 18, 2018 at 3:45 pm #7458

Kathleen Beaumont
Spectator

Hi Jeff,

I’m glad you were able to use the CDC’s XMLPlus.dll to parse a sample data file.

Regarding the second bullet in your “Next Queued tasks” list: An easy way to create a NAACCR XML data file uses the GUI app in the CDC download, XMLExchange Plus. Open the Convert form to select a flat-format data file and an EDITS50-compatible NAACCR metafile (or, if your registry has a custom metafile, use that). Press F1 to bring up the Help pages for this process, and generate the data file(s) you need for your continuing development.

The point I’m trying to make is that you can use a flat data file of Record Type ‘160’ and generate a NAACCR XML data file that meets today’s requirements. Be sure to read the help for drop-down control “Record Type”, which allows you to specify the Record Type of the resulting data file.

Kathleen

July 18, 2018 at 5:56 pm #7459

Kathleen Beaumont
Spectator

Regarding this “Next Queued task”:

* Integration of the CDC’s EDIT50.dll for scoring and reporting.

If you are just starting with the EDITS50 API, you might want to invest a half hour watching the “Tour of the EDITS50 API documentation” WebEx (see https://staging.naaccr.org/forums/topic/edits50-tools-implementation-forum/ for the link).

Kathleen

July 20, 2018 at 10:39 am #7466

Jeff Reed
Spectator

Are there any sample large record sets files for V18 fixed and XML files out there? Also looking for recommendations/links for tools for helping generate sample files (SEER?).
Author

Posts