The ability to format and manipulate Excel character strings is extremely important for data analysis. SQL String functions are useful for querying and manipulating data. Luckily for you, this article will solidify your string function knowledge in both applications.
Transactional SQL, T-SQL is Microsoft’s version of SQL. Its minor syntax differences from SQL won’t be covered in depth here, but we will encounter a couple of examples.
I’m very excited to share string functions with my audience because this function has so much carryover from Excel to SQL.
The ExceltoSQL series aims to help Excel users learn SQL. String functions effectively bridge the knowledge between these applications.
So What are String functions?
String functions are data handling functions used to format or manipulate data. They make data more approachable for users and are a crucial part of the data analyst’s toolbox.
Microsoft supplies a hefty toolbox of String functions for our disposal but we will review a small sample to solidify our foundation.
Check out the Comprehensive list from Microsoft Support below if you have extra time:
Here are six of the most frequently used string functions from Excel. Let’s see the effect of each one in Excel and in SQL.
CONCAT – Connects multiple text fields together
UPPER – Converts lower case string to Uppercase
LEFT RIGHT MID – Returns the leftmost, Rightmost, or a specified starting point from a data string
LEN – tells us how many characters are in a string
SUBSTITUTE – Replaces a specified character with another one
**In SQL Use replace for the same purpose
TRIM – Removes excessive Spaces in a text field
String Functions are most effective when combined with their other string function friends.
Now let’s check out a couple of string function combos!
LEFT/RIGHT/MID + LEN
What if we want to get the Name and title out of the string below? We can’t simply extract it with RIGHT since our target string varies in length.
477-251-6544, Jeremy, Chef |
However, the leading numbers are fixed so we can direct Excel to pull the rightmost contents of D2 after skipping the first fifteen characters with this string.
=RIGHT(D2,LEN(D2)-15)
Let’s Pretend we have an SQL table with Various contacts. This is the statement to ignore the first fifteen characters, just like we did in Excel.
SELECT RIGHT(TEXT,LEN(TEXT)-15) ABBRV_COL FROM dbo.Comments where PostId = 53071
Notice the text ABBRV_COL. This is simply an alias so that the results of our calculated String function don’t read (no column name).
Also, the full Statement reveals I specified table name and the Post ID number so that only one result was returned.
TEXT is the field name residing in the Comments table.
OK, let’s put it all together and combine even more string functions to clean up a really messy data string.
There are all kinds of Issues with this data:
To use this data meaningfully at any point some serious cleaning is required.
The first part of being an effective Problem Solver is identifying the problem. Data formatting issues are no exception.
List out each data issue individually:
Some Records run into a second column. Really? How gross is that?
Inconsistent Characters
Phone Numbers have Parenthesis, dashes, and asterisks
Comma Usage isn’t consistent
The mixture of Upper and Lowercase Characters
Extra Garbage spacing in our Data
Now the most important question you can ever ask yourself…
No, it’s not “What are we having for lunch?”
How about…
“Should we use Excel or SQL to clean up this data?”
How about both?
It’s time to attack all of our data issues one at a time.
Data Running Multiple Columns?
Use the CONCAT function to combine our two ugly columns into the one really ugly column.
(CONCAT(D12:E12)
Inconsistent Characters
(UPPER(CONCAT(D12:E12)), “(“, “”), “)”, “”), “*”, “”), “,”, “”),”-“,””)
Use the Substitute function for each unique character removed. To fix our example, SUBSTITUTE will be needed five times. ( “(”, “)”, “*”, “-“, “,”)
Substitute(“*”, “”) This replaces the asterisk with the contents between the quotes. In this, you are replacing an asterisk with Nothing.
Continue and Nest the rest of the characters you want to replace.
(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(UPPER(CONCAT(D12:E12)), “(“, “”), “)”, “”), “*”, “”), “,”, “”),”-“,””))
Many times our environment is not Case Sensitive so using the UPPER function is a matter of formatting preference. If the environment is case-sensitive, UPPER becomes much more important.
The environment we are dealing with is not Case Sensitive but we are aiming to get everything into the same format. Insert an UPPER command into our sequence to convert everything.
And finally, for those funky extra spaces, wrap the entire expression in a nice TRIM command.
Excel
=TRIM(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(UPPER(CONCAT(D12:E12)), “(“, “”), “)”, “”), “*”, “”), “,”, “”),”-“,””))
Lets see what we did?
That looks really clean
How do we accomplish the same feat in SQL? Let’s pull a SELECT QUERY STATEMENT and find out.
Start with:
SELECT TRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(UPPER(CONCAT(TEXT, TEXT2)), ‘(‘, ”), ‘)’, ”), ‘*’, ”), ‘,’, ”),’-‘,”))
Remember the simple SELECT ALL FROM TABLE format that shows us all the fields in a table. For this example, we are only concerned with the two fields we are displaying, TEXT and TEXT2.
The rest of the String Function tells us how to format the results of our two text fields.
Did you notice the difference in our Statement from the Excel Version?
How about this part?
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
And this part…
‘(‘, ”), ‘)’, ”), ‘*’, ”), ‘,’, ”),’-‘,”
Our SQL statement had two minor exceptions from the Excel syntax.
1)SQL uses single quotes to encapsulate text where Excel would use double-quotes.
2) REPLACE function is needed instead of SUBSTITUTE. Excel has a replace function but it works a little differently.
Here is our entire statement pointed towards a real table. Notice the FROM tells us the table name and the WHERE condition specifies we only want data from 3 different Posts.
SELECT TRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(UPPER(CONCAT(TEXT,TEXT2)), ‘(‘, ”), ‘)’, ”), ‘*’, ”), ‘,’, ”),’-‘,”))
As FORMATTED_TXT
FROM dbo.comments WHERE PostID IN (‘53068’, ‘53070’, ‘53071’)
See what this does.
Holy smokes, it did the same thing as in Excel!
You just learned how to do something really awesome using two different languages. Now go tell all your friends.