
This is software (AWS) generated transcription and it is not perfect.
sure. So, um, my previous career was in higher education. Uh, did some teaching, but a lot of administrative work I was administering and Mannering and managing and coordinating a couple of academic programs doing some student advising. Um, and I just reached a point where I felt that, um, I could I learned everything I had was gonna learn in that role. And I wanted something new. Um, and I learned what data science was, and it seemed to tap this, like, uh, interesting set of skills that I wasn't really developing or using in my current career. So did about six months of research, spent about a year doing prep work and then quit my job and did the boot camp cool.
So we're a large company. Clarification. We have a centralized data science team on DSO. I will be assigned as the data science owner for two or three particular projects of recorder. And, um, you know, ah, lot of clarity has several businesses, but one of them is scholarly publishing data. So way on something called web of science, which anyone who's that scientific research is probably heard of, But it's, ah, large database of basically every scientific journal article ever published. And so I work on tools that make it easier to use that data base. Eso, for example. Uh, if you're a researcher and you have a manuscript and you're not sure where the best places submit it is, we can make recommendations based on historical data. Or if you're an editor at the Journal, you've got 50 submissions in your Q. What three should you read today that are most likely to actually make it through your submission pipeline? So I work on. I work with the product teams directly. Um, they often you know what they want to accomplish, but they don't know how to do it, which is why they're coming to us if they knew what the solution was. They would just write up a ticket and send it to the engineering team. But they don't. So they come to us and we usually spend a long time hashing that part out. It could be months just figuring out what the requirements are, what the data are gonna be, how we're going to get the data. One of the maintenance requirements, etcetera. And then usually once that squared away the delivery parts a matter of weeks. So it's everything from, uh, all right, higher up. People will decide the project. It's funded, but then it's everything from determining what the product requirements are, what the success requirements are, what the metrics are and then delivering it.So for my role, the top priority. Everything is team based. So my priority is making sure that we're using the data in a way that's both effective and statistically rigorous. Um, yeah, the pain, like that's number one at number one and number two, probably. And then number three would be, uh, making sure that the the user experience actually reflects the kind of thing that the product team wants to solve. So, like any time we put any machine learning model in production, I will go to the website and use it a bunch just to, like, make sure that it's actually doing the kind of thing it's supposed to be doing and that it's, you know, it makes sense how to use it. It's intuitive, etcetera, um, pain points. Um, you probably if you talk to the scientists, you're gonna get a lot of same stuff. Just getting data is the pain point. Uh, in a large organization like ours, you know, there's seven or eight different, uh databases that are maintained by different teams with different strategies with different legacies and histories. And so, um, just a lot of times, just like navigating access to the things that you need is a pain point. Another pain point is, um, deploying and managing our services in such a way that is both relatively straightforward, but also viable in the long term. So, for example, there's a lot of new technologies. There's a lot of, like libraries and open source code that that solves a lot of the like deployment issues around machine learning. But, you know, how do we know that in six months that package isn't gonna be maintained? Or that that startup that has that cool thing that makes your life easier isn't gonna be brought up and erased by Google or something like that? So we have to be really careful when we buy into a technology that we're gonna be able to use it for the long term.
eso we have a lot of data. It's, uh, uncommon to train a model with less than millions of data points on. Sometimes it's hundreds of million's, so we use Apache Spark Ah lot, um, for pretty much any, like e t. L job. Any ongoing processing job? A lot of model training if it's distribute, training, will use Apache Spark. Um, we try to write everything in Scala because it's faster. All of the deployment of our work is done in Java based micro services because our whole companies platform is of Java based platform. Um And then, you know, the one big exception to that is any kind of deep learning model all that stuffs written in python and so well, right, well trained models in python. Usually the team tends to prefer tensorflow. But if someone sees pytorch, that's fine, too. On Ben. Well, serialize that object and then we'll come up with some way to actually like, ingest that and deploy that as a Java based micro service. So I'm using spark. I'm using python I'm using Scholar. Uh, we do use data bricks, which makes a lot of that easier. I don't know if you're familiar with it, but it's like a network, not network notebook interface for cloud computing. That makes using sparks pretty seamless. Um, and the other than that all of the standards, like engineering technologies like get so forth.And then what? Algorithms We use whatever is best for the job. So we don't have a preference for any particular algorithm. It just based on the business requirement, you know? Does it need to be explainable? Does it need to be fast? You know, what does the data look like? We look at all those things and we pick the best solution.