
This is software (AWS) generated transcription and it is not perfect.
When I was in college, I needed some kind of a job to just help me pay for school and life so I was willing to take just about anything, and I ended up finding a job at a company called Fusion-io. They make these really fast PCI Express storage drives for servers, and the job I landed with them was a quality assurance technician and what that meant is basically that I would have to look at these PCI express card and make sure they had all their stickers and screws and everything like that before they were boxed up and sent to customers and that was a terrible job, I hated it. But it got me in a position where I saw some problems and thought, hey, I have the skill to be able to make a difference here. The problem was, is that we were keeping track of all these inspections just with excess spreadsheets so divided up by workwear and say, all right this serial number, yes, it's got all screws, yes, it's got stickers, check, check, check. It was just so hard to maintain that if we ever wanted to go back and find the inspection for a specific serial number or if one of us happen to be out that day and they needed inspection data, there was just no way to get what we needed and I was in school for computer science of the time and thought that I could build something that would solve this problem and my boss at the time, encouraged me to do that so I just started building this little website and used PHP and MySQL back end to keep track of these inspections and it grew over time and version 2.0, I've built with Python and Django anyway, it got me into this world of collecting and managing data and from there that team was within the operations group at Fusion-io and in the operations, the department had a lot of needs around analytics and reporting and things like that so they started forming this team and it worked out that I was able to join the team is kind of a junior, I think my title was a systems engineer, but it was just kind of, Hey, we're going to give you these little things here and there, and we'll see how you do with them and grow from there and our task was to build a data warehouse from the ground up and start reporting on these things that the company needed and that was kind of my first exposure to the world of analytics. From there I moved over to my current company Instructure just through a former co-worker. We did the same thing over here. We built the data warehouse from the ground up and as it's grown my role and responsibility, has grown with it. So at this point, I'm now the manager of Data Engineering here at Instructure that started out with pretty humble beginnings, but it worked out okay for me.
Basically two different categories that my responsibilities fit into as far as the technical side goes. Number one is servicing the analytics needs. So we have a number of analysts who need data to be able to write reports and so my job is to go and get that data for them and put it into assistant where they can access it so managing the data warehouses is the big responsibility there. On the other side of that is we have a number of different systems that we use for enterprise applications our CRM system and our ERP system and a number of others and we have a lot of needs in terms of moving data back and forth between these different systems. To system, the system integrations are a big part of what I do as well, and that's the technical side. On the managerial side, it is just typical managerial like hiring and recruiting and working with the people and managing the workload for the team and all those kinds of things. My weekly hours, I'm pretty lucky I'm able to work just about 40 hours a week, and occasionally there will be things that need to be done outside of business hours or things like that. But for the most part, it's a really stable 9 to 5 type schedule. I do have the option to work from home if needed. I got little kids at home, so I don't do that as often as I would like to, just because they have a hard time with that but working from home is definitely an option. There is a little bit of travel involved. I typically go to one or more training conferences every year just to stay abreast of all the new technologies and options that we've got out and we have an office that we have in Budapest and we've started hiring a few engineers over in Budapest as well. So I've been over there to meet with them and work with them and get to know them a little better. So, traveling and outside of regular hours is kind of a rare thing, for the most part, it's really stable typical work hours and I don't have to do crazy hours or anything like that.
I'll try to keep this brief. We have a ton of tools that we use, we will start in the middle and kind of work our way out from there. So, in the middle, we have our data warehouse, and at the moment, that is a tool called Snowflake and it sits on top of Amazon's S3 offering, which is just cloud storage and it allows you to put files out there on S3 and then it can connect to those files and read them and you can just write simple SQL queries to access the day that you put out there and you can get down on the computer you use and you pay according to the size of they call it the "virtual warehouse" you pay according to the side of the machine you're using for the queries and how long it runs and those types of things. So Snowflake is the heart of it. Prior to Snowflake, we were using Amazon Redshift, which is a cloud data warehouse offering and that worked well for us but we kind of push that to its limits and we were looking for something that performed a little better and Snowflake fit the bill in a number of different ways. First of all, it's decoupled the storage from the compute power so we can store as much as we want and not have to pay extra for compute, for example, with Redshift, you buy a node at a time and that node comes with, let's say, two terabytes of storage and if you fill it up, you have to buy another node and that comes with more RAM and more compute and everything. So, by decoupling those you can kind of control your costs a little bit better so it's been advantageous that way. Then the other thing that's pretty good about it is, it really performs and really meets our needs a little bit better than Redshift did so that's kind of the heart of our systems and then we've got tools to get data in and tools to get data out of that. So the main tool that we use, the main language we used to get data in is Python. We have a whole bunch of Python scripts that are connecting to various APIs and databases and systems and then we pull data out of those systems and load it into Snowflake from there and from there, we can model the data and do whatever we need to, to manipulate that to get what we need for the reports. And for the reporting, we use a tool called Tableau, which is a visualization tool, and you can bring in your data and make pretty charts and graphs out of it and go from there that's kind of the heart of it. But we have a whole bunch of other like ancillary tools that we use as well for different use cases. For example, we have certain data sets that are large enough that they'll bring down the server if we try to just run a simple python job with it. So, for example, a structure makes canvas and the weblog data for canvas is just massive, it's something on the order of 2 to 3 billion rows per month, and we've gone back clear to 2010 or whatever, and just have so much data there, that's it's hard to manage so we're using a tool called Airflow, which allows us to kind of work it straight and, spark cluster out on Amazon has an offering called EMR, Elastic MapReduce, so this airflow tool allows us to write "Pythonish" script and then it basically spin up a cluster, a Spark cluster and run the job through that and dump all the data out to Amazon S3 and from there were able to copy it into Snowflake. AWS web services is a big part of what we do. So we're using Lambda, which is a tool where you can essentially write a little script and it will run basically on whatever trigger you define it and you don't have to worry about, how much memory do I need? How much storage do I need on this machine? How much RAM I have to buy? Just you just write your code and put it out there and it figures out the rest so it's really nice. For the system integrations, we have a tool that we use called Dell Boomi, and it's kind of a simple dragon drop tool where you say, I got this source system over here. Here's the query or whatever to get the data that I need out of that system and then I can manipulate the data from there and then connect to a destination system and it manages all the connections and data and everything like that. We probably could do Python or something similar, like to that for those types of jobs as well. But we've settled on Boomi in most of those cases because it makes more sense, and it's a little easier to maintain and things like that. So many different options and tools that we're using there and it gives it a lot of variety and keeps it interesting to get into those types of different systems and choosing the best tool for the job has been one of the more rewarding parts because I'm not shoehorned into, well, this is the Python way to do so we have to do it that way. It's really what's the best tool for the circumstance, and then we go and use that tool and that's been a really nice thing about working here.