Using SAS with NAACCR XML

Home Forums NAACCR XML Standard Using SAS with NAACCR XML

Tagged: 

Viewing 15 posts - 1 through 15 (of 37 total)
  • Author
    Posts
  • #6467
    AnonymousLinda Coyle
    Spectator

    As a vendor, we are making significant progress toward full support of NAACCR XML. We are currently converting all of the SEER Abstracting Tool (SEER*Abs) so that they create NAACCR XML. SEER*DMS imports and exports NAACCR XML. We are evaluating all processes related to SEER and NAACCR submissions. We are dedicated to this project and we are making significant progress. SAS is one of the biggest challenges that remains. Can the workgroup provide guidance or support in developing a plug-in, library, or whatever would make it easier to import NAACCR XML in SAS?

    #6483
    Isaac Hands
    Moderator

    Thank you for bringing this to our attention, it has prompted a lot of good discussion within the workgroup.

    In the short term, you can convert a NAACCR XML file into a fixed-width file using the software here:
    https://github.com/imsweb/naaccr-xml
    or here:
    https://www.cdc.gov/cancer/npcr/tools/registryplus/xml-exchange-plus.htm

    But, that short term solution will not help after the 2020 XML transition when the fixed-width format is no longer maintained, so it will be important to have a longer term plan.
    Any SAS developer can use the SAS XML Mapper to develop their own custom NAACCR XML parsing logic: https://support.sas.com/rnd/base/xmlengine/index.html
    However, as a NAACCR Workgroup, we would like to provide a great solution for all NAACCR SAS users instead of asking everyone to develop their own. Currently, we do not have any SAS users in the NAACCR XML Workgroup, so if you know any that would like to participate, please tell them to get in touch with me (isaac@kcr.uky.edu).
    Until we get some SAS expertise in our workgroup, we were wondering if one of the following options would work:

    1. I know that R and SAS are very similar, and since SAS is very expensive and therefore more difficult to develop for when you do not already have a license, if the NAACCR XML Workgroup provided a solution for processing NAACCR XML in R, would that help with a SAS solution?

    2. Given that SAS can connect to relational database servers easily, if the NAACCR XML Workgroup provided a standardized way to import NAACCR XML files into one or more databases, would that be a good solution for SAS?

    We recognize that SAS is a critical software tool for many NAACCR community members and want to provide an easy transition to XML as much as possible. We welcome additional feedback and discussion on this issue.

    #6544
    AnonymousFabian Depry
    Moderator

    Isaac,

    I will let Linda comment on your proposed solution.

    But just so you know, we (at IMS) have a few SAS experts and we are currently investigating the SAS XML Mapper (it seems to be the standard solution for that type of problem).

    Our first proof of concept went very well, but we are not ready to post any results yet. Once we are, we will report them to the NAACCR XML work group and update this forum.

    #6646
    AnonymousFabian Depry
    Moderator

    I think we are done looking at SAS for now.

    For reading SAS, it looks like an acceptable solution will involve an XMLMap file that tells SAS how to construct the data sets based on the different level of data in XML (so one data set for NaaccrData, one for Patient and one for Tumor, although all those sets can be defined in a single XMLMap file). The XMLMap will define SAS variables based on their NAACCR ID attribute (using XPath); it will also define a few “ORDINAL” variables which will be used as identifiers for every rows of the data sets (they are called ORDINAL because they are counters incremented when a specific tag is found in the data files). A SAS program will then be able to “merge” back the different data sets using the ORDINAL variables as a pivot (or linkage variable); the end result will be a single data set where the NaaccrData data is repeated for every Patient and Tumor, and the Patient data is repeated for every Tumor (which is the same behavior as reading flat files). There is one caveat to this solution: SAS will read and process every variables defined in the XMLMap; so using a mapping file that defines all variables won’t be practical for large data files (the processing will be too slow). Instead, a smaller mapping file should be used with just the variables that are needed for the program. Hopefully it will be possible to create those specialized XMLMap files using an open-source software. I am attaching an example of a mapping file including only a few variables:
    – naaccr-xml-v16-data-sample.xml: a very simple NAACCR XML sample file
    – naaccr-xml-v16-sas-def-minimal.map: an XMLMap file containing the definition of one variable at each XML level (plus the ordinal variables)
    – readin.level2.sas: a simple SAS program that merges Patient and Tumor data from the sample files and print frequencies of the defined variables.
    – readin.level2.output: the results of running the SAS program (I only copied the relevant frequencies)

    For writing SAS, the conclusion would be “don’t do it”. We found no satisfactory way of using an XMLMap to re-create a valid NAACCR XML file. There are other solutions that don’t use an XMLMap but they are very involved and require some type of coding that most people wouldn’t be willing to do. There are other tools and software that can recode variables and that will probably be updated to support NAACCR XML; the best approach for recoding XML files would be to switch to those tools.

    *** Update: looks like I can’t upload the files in this post; all the files have been uploaded in the java NAACCR XML project in GitHub:
    https://github.com/imsweb/naaccr-xml/tree/master/docs/sas

    Attachments:
    You must be logged in to view attached files.
    #6649
    Isaac Hands
    Moderator

    This sounds like a great way to do it, thank you for posting the files and explanation. I wonder what is the best way to get other NAACCR Community SAS users to try it out and give feedback.

    #6676
    AnonymousValerie Yoder
    Spectator

    I work in informatics at the Utah Cancer Registry and use SAS frequently so I am very interested in this topic. I will try out the code soon.

    We read NAACCR files much more often than we write them, and we very rarely read all the variables in SAS. I think it could be reasonable to create a full mapping file for the community and recommend that they only keep the relevant variables- I can just delete variables from the .map file Fabian provided using Notepad++.
    Even though writing from SAS is problematic, so much of our work and analysis goes through SAS that I don’t see us moving away from it for a long time. We would likely want a different tool to convert SAS output to NAACCR XML (like Seer Data Viewer? or something in Python, if its xml packages can make a valid file).

    #6677
    Isaac Hands
    Moderator

    Valerie,
    Please post your results when you get a chance to try out the XML code in SAS, I know others in the NAACCR Community will be interested in your experience.
    You mentioned using a different tool to convert SAS output to NAACCR XML, possibly Python, and I noticed that there is an “official” way to access SAS datasets from Python with this library: https://github.com/sassoftware/saspy
    If someone created a Python script to create NAACCR XML from a SAS dataset, would that be something you would be interested in?

    #6708
    AnonymousBruce Riddle
    Spectator

    Like many registries, we use SAS to pre-process all the incoming transmissions before loading them into a registry database. I need to separate RAPID reports from DEFINITIVE reports (they arrive mixed), clean up dates that are
    incorrect (MMDDYY, YYMMDD, YYDDMM,etc.), standardize hospital numbers (some reporters are required to send us
    blank hospital numbers), etc. So currently I read in all the variables, process, and write out all the variables. I’ve tried to figure out a way to do this with an XML and come up empty so far. And as part of the process, I generate management reports that make registry processing of cases easier. Converting each transmission (sometimes as few as 2 cases) to a flat file from XML, process, and then back to XML could be very labor intensive.

    My trails with the SAS Mapper have not been successful so far; SAS says my files are too large and I’ve only used 8,000 cases.

    I have also discovered, as many have said, that XML files are very large. I worry, given the charges for disk space and rapidly shrinking financial resources, if we can afford XML. Those of us without a programming staff will need some very robust tools to make this all work.

    B

    #6719
    Isaac Hands
    Moderator

    Bruce,
    Thank you for trying out the XML Mapper in SAS, we will discuss the issues you bring up in our XML Workgroup calls and post on this thread when we have some ideas of how to move forward.
    The tasks you are describing are straightforward in a language that has first class support for XML such as Python, Java, C#, etc., but we are still trying to figure out the best way to deal with large XML files in SAS that need hundreds of variables. Writing out XML files seems to be an afterthought in SAS, so we are still trying to get to the bottom of that issue as well. As you probably know, SAS is a very expensive piece of software, many of us on the XML Workgroup are not as familiar with it as we need to be, so it is taking longer to find solutions to these problems than with our currently published NAACCR XML software tools and libraries, but we are working on it.
    -Isaac

    #6720
    AnonymousBruce Riddle
    Spectator

    I tried out the sample code Fabian posted on XML files I created using various tools and our data for one year. The good news is that I got identical results from the XML files exported by the tools although they differed in size. For 8,000 cases, one file was 188,725 KB and one was 148,833 KB. The bad news is that it is very slow. SAS is provided under license to NPCR Registries and many take advantage of the opportunity. Few registries I know have any staff who know any JAVA, Python, or C++. If I know C++ or JAVA well enough to write code to manipulate XML, I would get a much better paying job.

    One suggestion here was to use the XML tools built into MS SQL. We will explore that idea.

    B

    #6952
    AnonymousBruce Riddle
    Spectator

    My experiments with SAS and XML have not been very successful. The loss of SAS eliminates a very powerful tool both for basic file processing prior to loading data in to the registry database and also working with the data on export from the registry database. I have little hope that SAS will invest in a more advanced XML tool.
    Here is one idea for a solution to at least create analytical files. SAS Proc Import will read delimited files with a header. This provides an option for two applications. One application is to be able to export from the main database selected variables in a pipe delimited format with a header. To make this more user friendly, the application needs a configuration page where you can just check the variables you need and be able to keep that list as a file for future use. Some users will only need to set the configuration once. Then SAS Proc Import can read in the delimited file and create the SAS data set.
    The second application would read an XML file and perform the same task as above.
    In both instances, one line for patient/tumor. Very few exercises require the entire set of all NAACCR variables so these analytic data sets should be fairly small.
    The major advantage of this method is that you do not need any input or format statement. The significant disadvantage is that PROC Import selects the input format so sometimes you get numeric when you want character, etc.
    Another version of above is write out two separate files. One file of pipe delimited data and a second file of the input format. The input format could easily dragged into a SAS program. The configuration page could allow for selection of formats. For example, I read in all dates as character since NAACCR allows date with blanks. In SAS, I can fill in the blanks before creating a SAS date that can be manipulated.
    The XML file for a standard time period, 1995 to 2018, will be very large. Few registries will have the storage capacity to keep a reasonable number of these files around. The ability to easily create analytic files is very important. Finding a very convenient way to upzip, run a tool or GenEdits, and re-zip will be important.

    #6955
    Isaac Hands
    Moderator

    I am not ready to give up on SAS yet.
    It sounds like your suggestions above are along the same lines of what I have been wondering: If SAS doesn’t really support large XML data files, is there an intermediate format that SAS could use instead of the XML directly?
    For example, here is a list of all “DBMS” formats that SAS can use natively in the PROC IMPORT function:
    https://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#a003094743.htm

    Is it possible/probable/straightforward to create some sort of library that SAS can call directly (Java, Python, etc.) that will take an XML file, create one of these intermediate formats, and then load into SAS datasets so that the rest of SAS is happy?

    #6985
    Isaac Hands
    Moderator

    Following up on my last post to this thread…
    I wonder if using a CSV formatted file would be a good intermediary between SAS and XML? The CSV format doesn’t suffer from many of the same limitations as the fixed-width file, such as needing to know the position and length of all variables beforehand, so translating between XML and CSV will not require maintenance of Volume II metadata to go along with every NAACCR Item. CSV will still be limited for conveying multi-tier data, such as Patient/Tumor/etc., but SAS does not understand multi-tier data models anyway, so maybe that’s OK for this use case.
    I have been playing around with some Java code running inside SAS that can generate CSV from NAACCR XML and then load the data as a SAS dataset. So far, it looks promising, it takes about 4.5 minutes to load a 6GB NAACCR XML file into a SAS dataset with this method, using a pretty basic Windows 10 desktop computer, not sure if that will be acceptable, but it might make a nice proof of concept.
    Here is what the SAS code looks like:

    filename xmlfile 'C:\\Users\\isaac\\Documents\\ky9515v16.xml';
    filename csvfile 'C:\\Users\\isaac\\Documents\\ky9515v16.csv';
    
    data _null_;
    	do;
    		declare JavaObj j1 ("edu/uky/kcr/naaccrxml/csv/ConvertXmlToCsv", xmlfile, csvfile); 
    		j1.callVoidMethod ("convert");
    		j1.delete();
    	end;
    run;
    
    proc import datafile=csvfile
    	out=fromcsv
    	dbms=csv;
    	getnames=yes;
    run;

    The Java code behind this is using the Java NAACCR XML library from IMS

    #6986
    AnonymousFabian Depry
    Moderator

    I think this is a good idea.

    At the end, this is similar to what the NAACCR XML Utility tool does, except it translate XML into NAACCR fixed-column instead of CVS.

    Did you use specialized code to read the XML, or did you use the existing Java library to read the data “patient by patient”?

    #6989
    Isaac Hands
    Moderator

    I just used the existing Java library to read each patient as it occurred in the XML file, writing out an incremental “csvPatientId” number for each <Patient> element so that SAS would know where the unique patients were. If there is interest in this technique, I will post my Java code.

    As a fun exercise, I started this experiment by creating an Access Database instead of a CSV file from the XML, mostly because SAS has “native” support for Access databases, much better than SAS XML support, and Microsoft Access has been mentioned several times as a tool that some registries use. Unfortunately, I quickly ran into limitations of the Access database format, specifically the 2GB file size and the number of fields in a table:
    https://support.office.com/en-us/article/Access-specifications-0cf3c66f-9cf2-4e32-9568-98c1025bb47c
    From what I can tell, Access can load CSV files with some fiddling, so maybe this solution would help both Access and SAS users.

Viewing 15 posts - 1 through 15 (of 37 total)
  • The forum ‘NAACCR XML Standard’ is closed to new topics and replies.

Copyright © 2018 NAACCR, Inc. All Rights Reserved | naaccr-swoosh-only See NAACCR Partners and Sponsors