At the end of the previous part of this series, our SSIS job is able to read content of a file, filter out the unnecessary rows from it, and extract the most common terms of it along with their frequency points. Now, in this article, we will write the errors and results in files, and check the running of the finished job.
Writing data to files
As the last few steps of the data flow, we will write data to files.
Firstly, we can realize a warning sign on the Term extraction component. It says that the error outputs will be lost, so we should add some transformations to them. Taking its advice, we will lead the error output of the Term extraction component to an error file, so it will write the errors of the term extracting into this file.
The SSIS component, which is responsible for defining a simple file as the data destination, is the Flat File Destination. Drag-and-drop this component from the SSIS toolbox/Other Destinations to the design interface, next to the Term extraction, and name it Error file destination. Then, pull the red error arrow from the Term extraction to it, and click OK on the opening window, that you leave the error action as Redirect row.
(Bonus: you can format the data flow components to be equally spaced: click on the Format menu and choose Make Equal under Vertical Spacing.)
As you double-click on the Error file destination, you can see on the editor (on the screenshot below), that the Flat File Destination needs a flat file connection manager, too, similarly to the Flat File Source, to connect to the file where it will write the data. However, we don’t want to use the previously created manager from the first part of this series, because it reaches a lyrics file source, but we want to connect to a new, dedicated error file destination, so you have to create a flat file connection manager again.
After clicking the New… button, a window on the next picture comes up. Leave the default Delimited setting, because we don’t know how wide the error file columns will be, while the other options provide fixed column widths.
After that, on the General page of the Flat File Connection Manager Editor, set the manager similarly than the previous one, with the differences that the connection manager name will be Error file manager, and the path will be dataerror.txt in our solution. The columns are set automatically by the SSIS, you can look at them on the other pages than General. These columns are set from the input of the Flat File Destination, the error output of the Term Extraction, but you can change their properties, if you want (e.g. names), or delete some of them. However, I don’t recommend changing their DataType or column width, because you may have to do datatype conversions between input data and file output data, or it can lead to errors that it couldn’t match them, or it can lead to data loss.
After setting the manager, go to Mappings page on the Flat File Destination Editor because of the warning. As you can see there, the column mappings are already set by SSIS automatically, if you left every column the same as before in the Flat File Connection Manager Editor. If you didn’t, then order the inputs to the file output columns, and if everything is right, click OK.
After these, we would like to write the results of the term extraction to another output file, so, bind the default output (blue arrow) of the Term extraction component to another Flat File Destination component, whose name will be Lyrics scores file destination.
You have to create a new flat file connection manager for this Flat File Destination, too, to make a connection to a third file. We set this new manager like the Error file manager previously, but its name will be One lyrics scores file manager, and its file path will be data/output_scores_one_song.txt in the solution. The columns of the file will be Term and Score after the input columns, as you can see on the other pages. Realize, that the column delimiter is set to comma by default.
Change the names of the columns, if you want, but don’t change their types and sizes. You can change the column separator, too, if you like. After you are ready with the connection manager, look at the Mappings on the Flat File Destination Editor window. If you gave other names to the output columns, then order the input columns to them. After that, click OK.
Testing the job
Now, you are ready with your job, so you can try it out with the Start button in Visual Studio. The full data flow is on the screenshot below.
After running the job, you should see green checks on the components, like in the next picture, this shows, that the job was successful. When you open error.txt, you should see that it’s empty. The Conditional Split could filter 3 rows out, these contained the words “Chorus” or “Liedtext”. The flow has written 9 of the most frequent words into the results file.
Now, let’s look at this file, its content is on the next picture. Beautiful, isn’t it? All of them are simple nouns, shorter than 12 characters, and occur at least twice in the text. There are two columns in this file, separated by a comma, as it was set in the connection manager of the result file.
Comparing this list with the original lyrics, including the title of the song, I found that these words really occur as much times as it is written into the results. Interesting, that into the 4 occurrences of “hand”, the occurrences of “hands” were counted, too. This is because the term extraction handles the plural forms of the nouns and transforms them into their singular form by the counting.  However, the word “die” occurs in the lyrics 3 times, it isn’t on the list, because it’s just a verb. The word “you” comes to a similar fate, opposite its occurrence of 15, because it’s a pronoun.
On my GitHub repository, you can find the result files under the data directory. The output_scores_one_song.txt contains the result of analyzing Hand of Doom in case of the frequency scoring, while the output_scores_one_song_tfidf.txt file contains the result of TFIDF scoring of the same song (I changed the output file name in the connection manager of the result file, and set the Term Extraction to the TFIDF scoring instead of the Frequency scoring). As you can see, the same words are there, only the numbers and the ranking of the words are other.
In this article, we set an error file and a results file to write the data to, and then, we tested the whole job, and checked the results.
Summing up this article series, we built an SSIS job that analyzes one specific lyrics file, and returns a result file that contains the list of the most frequent words of the lyrics file, along with their points. In a later article series, I’d like to show you how to analyze more files at once.