Top products from r/datascience

We found 103 product mentions on r/datascience. We ranked the 205 resulting products by number of redditors who mentioned them. Here are the top 20.

Next page

Top comments that mention products on r/datascience:

u/tpintsch · 2 pointsr/datascience

Hello, I am an undergrad student. I am taking a Data Science course this semester. It's the first time the course has ever been run so it's a bit disorganized but I am very excited about this field and I have learned a lot on my own.I have read 3 Data Science books that are all fantastic and are suited to very different types of classes. I'd like to share my experience and book recommendations with you.

Target - 200 level Business/Marketing or Science departments without a programming/math focus. 
Textbook - Data Science for Business https://www.amazon.com/gp/product/1449361323/ref=ya_st_dp_summary
My Comments - This book provides a good overview of Data Science concepts with a focus on business related analysis. There is very little math or programming instruction which makes this ideal for students who would benefit from an understanding of Data Science but do not have math/cs experience. 
Pre-Reqs - None.

Target - 200 level Math/Cs or Physics/Engineering departments.
Textbook -Data Mining: Practical Machine Learning Tools and Techniques https://www.amazon.com/gp/aw/d/0123748569/ref=pd_aw_sim_14_3?ie=UTF8&dpID=6122EOEQhOL&dpSrc=sims&preST=_AC_UL100_SR100%2C100_&refRID=YPZ70F6SKHCE7BBFTN3H
My comments: This book is more in depth than my first recommendation. It focuses on math and computer science approaches with machine learning applications. There are many opportunities for projects from this book. The biggest strength is the instruction on the open source workbench Weka. As an instructor you can easily demonstrate data cleaning,  analysis,  visualization,  machine learning, decision trees, and linear regression. The GUI makes it easy for students to jump right into playing with data in a meaningful way. They won't struggle with knowledge gaps in coding and statistics. Weka isn't used in the industry as far as I can tell, it also fails on large data sets. However, for an Intro to Data Science without many pre-reqs this would be my choice.
Pre-Req - Basic Statistics,  Computer Science 1 or Computer Applications.

Target - 300/400 level Math/Cs majors
Textbook - Data Science from Scratch: First Principles with Python
http://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/149190142X
My comments: I am infatuated with this book. It delights me. I love math, and am quickly becoming enamored by computer science as well. This is the book I wish we used for my class. It quickly moves through some math and Python review into a thorough but captivating treatment of all things data science. If your goal is to prepare students for careers in Data Science this book is my top pick.
Pre-Reqs - Computer Science 1 and 2 (hopefully using Python as the language), Linear Algebra, Statistics (basic will do,  advanced preferred), and Calculus.

Additional suggestions:
Look into using Tableau for visualization.  It's free for students, easy to get started with, and a popular tool. I like to use it for casual analysis and pictures for my presentations. 

Kaggle is a wonderful resource and you may even be able to have your class participate in projects on this website.

Quantified Self is another great resource. http://quantifiedself.com
One of my assignments that's a semester long project was to collect data I've created and analyze it. I'm using Sleep as Android to track my sleep patterns all semester and will be giving a presentation on the analysis. The Quantified Self website has active forums and a plethora of good ideas on personal data analytics.  It's been a really fun and fantastic learning experience so far.

As far as flow? Introduce visualization from the start before wrangling and analysis.  Show or share videos of exciting Data Science presentations. Once your students have their curiosity sparked and have played around in Tableau or Weka then start in on the practicalities of really working with the data. To be honest, your example data sets are going to be pretty clean, small,  and easy to work with. Wrangling won't really be necessary unless you are teaching advanced Data Science/Big Data techniques. You should focus more on Data Mining. The books I recommended are very easy to cover in a semester, I would suggest that you model your course outline according to the book. Good luck!

u/jakeporway · 7 pointsr/datascience

Hey everyone! I’m seeing many questions from budding or new data scientists in the thread trying to figure out the best path ahead - How do I get started in a career in data science? What skills do I need? What should I major in?

As we all know, data science is becoming increasingly popular, yet the term is still hotly debated.

So to start us off, my view is that a data scientist is basically a statistician who can program. Data science is the art of using the latest computer science and statistical techniques to collect, analyze, visualize, and otherwise draw conclusions from data. Most of the thorny topics being discussed these days about bias, quality of data, modeling, learning, and data cleaning all come from the healthy body of statistics we've built over the last 100 years. The novelty of data science comes from a technical need to be able to handle the volume of data now available and to wrangle it from many disparate forms into a clean, usable format. Beyond that, all the other skills attributed to a data scientist - visual communication skills, good written skills, subject matter expertise - hold true for anyone doing science, from biology to anthropology.

What’s interesting to note is that the skills needed by a true “data scientist” are exceedingly rare. Using Drew Conway’s data science Venn diagram (yes, I still reference this one), one needs to have:

Hacking skills: These are programming and scripting skills, but are often not taught in universities or even in industry.

Statistical experience: Not many people are trained in formal statistics beyond simple linear regression. A good data scientist should be an expert in questions of bias, advanced modeling, and causal inference.

Machine learning: I’m going to single out machine learning chops, as not every hacker + stats person has these. It takes a special skill set to build efficient neural networks and to understand how they do/don’t work.

Substantive expertise: The data scientist may not need to be an expert in the field themselves but, if they’re not, they better learn enough of it from an expert to be able to interpret results or think creatively. At DataKind we solve this by teaming non-profits with the data scientists to bring their expertise.

The good news? With this diversity of skills needed, there are lots of pathways you can follow and no one way. For example, Drew himself was a computer science undergrad who went on to get a Ph.D. in political science. His graduate work drew him into the world of statistics and data, including machine learning concepts that inspired him to study the social networks of terrorists and do predictive analytics on voting behaviors. He also has great communication skills, picked up some basic visualization, and has strong business and management savvy from his time in government and intelligence. Other people I know have come from mathematics backgrounds and then picked up programming and computer science to be able to build more advanced models. No matter how you get there, you’ll need to build up your programming and stats skills and not lose sight of your soft skills of communication and creativity.

To learn more about the paths of data scientists, I also recommend a great book Sebastian Gutierrez, one of the moderators of /r/datascience put together called Data Scientists at Work - http://www.amazon.com/Data-Scientists-Work-Sebastian-Gutierrez/dp/1430265981

The bad news? There are lots of pathways you can follow so it can feel overwhelming to figure out how to get started.

The Internet is now littered with online courses to teach you data science. Check out Coursera first and foremost. There are also fellowships through Insight and the Data Incubator that will round out your data science training over about 12 weeks. I’m also a huge fan of John Foreman’s Data Smart for a good intro to data science algorithms and thinking if you’re more of the self-learning type. Of course the best way to learn is to do: Check out online competitions through Kaggle or DrivenData to take part in machine learning competitions. Start small and look at questions you’re genuinely interested in. Lastly, don’t underestimate the power of meeting people in person. Immerse yourself in the data science community as best you can. Attend local Meetups, check out webinars or local conferences, and keep posting questions on /r/datascience of course and you’ll soon be well on your own data science path. When you’re ready to start the job search, don’t forget that we do a monthly jobs round up over at DataKind to help you use your powers for good - check out our list for January!

No matter how you get there, enjoy the journey. Data science is a thrilling and exciting field and whether you know Linear Algebra backwards and forwards or not is not as important as rolling up your sleeves and having fun digging in wherever you’re at. Good luck!

u/sasquatch007 · 1 pointr/datascience

Just FYI, because this is not always made clear to people when talking about learning or transitioning to data science: this would be a massive undertaking for someone without a strong technical background.

You've got to learn some math, some statistics, how to write code, some machine learning, etc. Each of those is a big undertaking in itself. I am a person who is completely willing to spend 12 hours at a time sitting at a computer writing code... and it still took me a long time to learn how not to write awful code, to learn the tools around programming, etc.

I would strongly consider why you want to do this yourself rather than hire someone, and whether it's likely you'll be productive at this stuff in any reasonable time frame.

That said, if you still want to give this a try, I will answer your questions. For context: I am not (yet) employed as a data scientist. I am a mathematician who is in the process of leaving academia to become a data science in industry.


> Given the above, what do I begin learning to advance my role?

Learn to program in Python. (Python 3. Please do not start writing Python 2.) I wish I could recommend an introduction for you, but it's been a very long time since I learned Python.

Learn about Numpy and Scipy.

Learn some basic statistics. This book is acceptable. As you're reading the book, make sure you know how to calculate the various estimates and intervals and so on using Python (with Numpy and Scipy).

Learn some applied machine learning with Python, maybe from this book (which I've looked at some but not read thoroughly).

That will give you enough that it's possible you could do something useful. Ideally you would then go back and learn calculus and linear algebra and then learn about statistics and machine learning again from a more sophisticated perspective.

> What programming language do I start learning?

Learn Python. It's a general purpose programming language (so you can use it for lots of stuff other than data), it's easy to read, it's got lots of powerful data libraries for data, and a big community of data scientists use it.

> What are the benefits to learning the programming languages associated with so-called 'data science'? How does learning any of this specifically help me?

If you want a computer to help you analyze data, and someone else hasn't created a program that does exactly what you want, you have to tell the computer exactly what you want it to do. That's what a programming language is for. Generally the languages associated with data science are not magically suited for data science: they just happen to have developed communities around them that have written a lot of libraries that are helpful to data scientists (R could be seen as an exception, but IMO, it's not). Python is not intrinsically the perfect language for data science (frankly, as far as the language itself, I ambivalent about it), but people have written very useful Python libraries like Numpy and scikit-learn. And having a big community is also a real asset.

> What tools / platforms / etc can I get my hands on right now at a free or low cost that I can start tinkering with the huge data sets I have access to now? (i.e. code editors? no idea...)

Python along with libraries like Numpy, Pandas, scikit-learn, and Scipy. This stuff is free; there's probably nothing you should be paying for. You'll have to make your own decision regarding an editor. I use Emacs with evil-mode. This is probably not the right choice for you, but I don't know what would be.


> Without having to spend $20k on an entire graduate degree (I have way too much debt to go back to school. My best bet is to stay working and learn what I can), what paths or sequence of courses should I start taking? Links appreciated.

I personally don't know about courses because I don't like them. I like textbooks and doing things myself and talking to people.

u/hattivat · 11 pointsr/datascience

IMHO,

step 1: Read https://www.amazon.com/Pragmatic-Programmer-20th-Anniversary-2nd/dp/0135957052/ref=dp_ob_title_bk

step 2: Read https://realpython.com/python-pep8/ and https://docs.python-guide.org/dev/virtualenvs/

step 3: Write a REST API which takes arguments from the URL, uses these arguments to run some predictive model of your creation, and then returns the result; since you already know Python, I'd recommend using Flask, there are many free tutorials, just google it. If using Python, I highly recommend using PyCharm (the free community edition is enough) over Jupyter or Anaconda, the latter will let you do many bad things which would trigger a red warning in PyCharm (such us doing import in the middle of the file).

step 4 (optional, but recommended): Learn the basics of Java (this tutorial should be more than enough https://www.tutorialspoint.com/java/index.htm ) and read https://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882

step 5: Write a publisher application which reads a csv, xml, or json file from disk (for bonus points: from someone else's public REST API for data, for example https://developer.walmartlabs.com/docs/read/Search_API ), and turns the data contained within into a list of python dictionaries or serializable objects (btw, read up on serializing, it's important), and then sends the results into a kafka or rabbitMQ queue. I would strongly recommend sending each item/record as a separate queue message instead of sending them all as one huge message.

step 6: Learn how to use cron (for bonus points: Airflow) to make the application from step 5 automatically run every second day at 8 am

step 7: read the closest thing in existence to being the data engineering book: https://dataintensive.net/

step 8: Write a consumer application which runs 24/7 awaiting for something to appear in the queue, and when it does, it calls your rest api from step 2 using the data received from the queue, adds the returned result (predicted price, or whatever) to the data, then runs some validation / cleaning on the data, and saves it in some database (SQLite is the easiest to have running on your local computer) using an ORM (such as SQLalchemy).

step 9: Add error handling - your applications should not crash if they encounter a data-related exception (TypeError, IndexError, etc.) but instead write it to a log file (as a minimum, print it to the console) and continue running. External problems (connection to the database, for example) should trigger a retry - sleep(1) - retry cycle, and after let's say 5 retries if it's still dead, only then the application should crash.

step 10: For bonus points, add process monitoring - every time your application processes a piece of data, record what category it was in a timeseries database, such as influxdb. Install grafana and connect it to inlfuxdb to make a pretty real-time dashboard of your system in action. Whenever your application encounters a problem, record that in influxdb as well. Set grafana to send you an email alert whenever it records more than 10 errors in a minute.

Step 11: More bonus points, add caching to your application from step 2, preferably in Redis (there are libraries with helpful decorators for that, e.g. https://pythonhosted.org/Flask-Cache/ )

I'm assuming you are familiar with Spark, if not, then add that to your learning list. A recommended intro project would be to run some aggregation on a big dataset and record the results into a dedicated database table allowing for fast and easy lookup (typical batch computing task). You could also rewrite the applications from points 5 and/or 8 to use spark streaming.

I also heavily recommend learning how to use docker and kubernetes (minikube for local development), this is not only super useful professionally, but also makes it much easier to do stuff such as running spark and airflow on your home computer - downloading and running docker images is way easier than installing any of those from scratch the traditional way.

One crucial advice I can give is the mindset difference between data science and data engineering - unlike in data science, in data engineering you normally want to divide the process into as small units as possible - the ideal is to be processing just one [document / record / whatever word is appropriate to describe an atomic unit of your data] at a time. You of course process thousands of them per second, but each should be a separate full "cycle" of the system. This minimizes the impact of any crashes/problems and maximizes easy scalability¹. That is of course assuming that the aim is to do some sort of ETL, if you are running batch aggregations then that is of course not atomic.

¹ As an example, if your application from step 5 loaded all the data as one queue message, then the step 8 application would have to process it all in some giant loop, so to parallelize it you would have to get into multi-threaded programming, and trust me - you don't want that if you can avoid it (a great humorous tale on the topic http://thecodelesscode.com/case/121 ). You also have to run it all under one process, so you can't easily spread across multiple machines, and there is a risk that one error will crash the whole thing. If on the other hand you divide the data into the tiniest possible batches - just one item per message, then it's a breeze to scale it - all you need to do is to run more copies of the exact same application consuming from the same queue (queue systems support this use case very well, don't worry). Want to use all 8 CPU cores? Just run 8 instances of the consumer application. Have 3 machines sitting idle that you could use? Run a few instances of the application on each, no problem. Want the results really fast? Use serverless to run as many instances of your app as you have chunks of data and thus complete the job in an instant. One record unexpectedly had a string "it's secret!" in a float-only field and it made your app crash? No problem, you only lost that one record, the rest of your data is safe. Then you can sit back and watch your application work just fine while the colleague who decided to use multi-threading for his part is on his fifth day of overtime trying to debug it.

u/parts_of_speech · 12 pointsr/datascience

Hey, DE here with lots of experience, and I was self taught. I can be pretty specific about the subfield and what is necessary to know and not know. In an inversion of the normal path I did a mid career M.Sc in CS so it was kind of amusing to see what was and was not relevant in traditional CS. Prestigious C.S. programs prepare you for an academic career in C.S. theory but the down and dirty of moving and processing data use only a specific subset. You can also get a lot done without the theory for a while.

If I had to transition now, I'd look into a bootcamp program like Insight Data Engineering. At least look at their syllabus. In terms of CS fundamentals... https://teachyourselfcs.com/ offers a list of resources you can use over the years to fill in the blanks. They put you in front of employers, force you to finish a demo project.

Data Engineering is more fundamentally operational in nature that most software engineering You care a lot about things happening reliably across multiple systems, and when using many systems the fragility increases a lot. A typical pipeline can cross a hundred actual computers and 3 or 4 different frameworks.doesn't need a lot of it. (Also I'm doing the inverse transition as you... trying to understand multivariate time series right now)

I have trained jr coders to be come data engineers and I focus a lot on Operating System fundamentals: network, memory, processes. Debugging systems is a different skill set than debugging code, it's often much more I/O centric. It's very useful to be quick on the command line too as you are often shelling in to diagnose what's happening on this computer or that. Checking 'top', 'netstat', grepping through logs. Distributed systems are a pain. Data Eng in production is like 1/4 linux sysadmin.

It's good to be a language polyglot. (python, bash commands, SQL, Java)

Those massive java stack traces are less intimidating when you know that Java's design encourages lots of deep class hierarchies, and every library you import introduces a few layers to the stack trace. But usually the meat and potatoes method you need to look at is at the top of a given thread. Scala is only useful because of Spark, and the level of Scala you need to know for Spark is small compared to the full extent of the language. Mostly you are programatically configuring a computation graph.

Kleppman's book is a great way to skip to relevant things in large system design.

https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321

It's very worth understanding how relational databases work because all the big distributed systems are basically subsets of relational database functionality, compromised for the sake of the distributed-ness. The fundamental concepts of how the data is partitioned, written to disk, caching, indexing, query optimization and transaction handling all apply. Whether the input is SQL or Spark, you are usually generate the same few fundamental operations (google Relational Algebra) and asking the system to execute it the best way it knows how. We face the same data issues now we did in the 70s but at a larger scale.

Keeping up with the framework or storage product fashion show is a lot easier when you have these fundamentals. I used Ramakrishnan, Database Management Systems. But anything that puts you in the position of asking how database systems work from the inside is extremely relevant even for "big data" distributed systems.

https://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638

I also saw this recently and by the ToC it covers lots of stuff.

https://www.amazon.com/Database-Internals-Deep-Distributed-Systems-ebook/dp/B07XW76VHZ/ref=sr_1_1?keywords=database+internals&qid=1568739274&s=gateway&sr=8-1

But to keep in mind... the designers of these big data systems all had a thorough grounding in the issues of single node relational databases systems. It's very clarifying to see things through that lens.

u/coffeecoffeecoffeee · 2 pointsr/datascience

More important than any other programming tool you'll touch. One job might use SAS, another might use R, and a third might use Python. All three will use SQL. Maybe different dialects of SQL, but the syntax and the concepts are going to be identical.

SQL is how you query data from databases, which is where companies store data. It is the language for data querying. You might do some manipulation or more complicated work in a language like Pig or Hive, but you will always use SQL in some capacity.

Basic knowledge may help, but you should be comfortable with a variety of concepts. I'd be pretty nervous if you didn't know what an outer join was, but I'd probably cut you from the interview loop if you didn't know what grouping was. If you're not comfortable with query optimization that's probably fine because different dialects optimize queries differently and you're not going to be a DBA.

If you don't know SQL, I've found that the best reference is Sam's Teach Yourself SQL in 10 Minutes. It's a tiny book of individual lessons that take around ten minutes each. You'll get comfortable with all of the basics of writing SQL. SQL is a fundamental skill for a data scientist, but on the bright side, it's far easier to pick up than R or Python.

u/doddyk96 · 1 pointr/datascience

Thank you so much for your reply. I actually do plan on taking Andrew Ng's course just cause the book I am talking about is very limited to Python but I've heard great things about it. However, the Stanford course I was referring to was the Statistical Learning course based on the ISL book.

Yes I plan on doing some kaggle challenges once I feel comfortable with my skills to build up my portfolio or see if I can find some other novel projects to work on.


Ideally I'd like to be in a data science consultancy type role where I get to work on different kinds of projects and don't necessarily need very specialized domain knowledge. But at this point I think more direction as to what kind of roles exits would also be helpful. I just don't know what the field is actually like and I've never really met anyone doing data science for a living.

Thank you again for your reply. It was very helpful.

u/DataWave47 · 3 pointsr/datascience

You're welcome. Thanks for providing some additional detail. This helps. I think if you read up on the CRISP-DM and use that framework to walk your way through some of these challenges it will be very beneficial to you. I'd recommend giving this document a read when you have the time. I think that if you show them that you are comfortable with these guidelines and know how to work your way through it to solve a problem it will go a long way. Model selection can be a bit tricky depending on the situation but I think most practitioners have a favorite model that they go to. Sounds like you're already familiar with Wolpert's "No Free Lunch Theorem" suggesting to try a wide variety of techniques. Personally, this is where I'd start digging deeper into tuning parameters (cross-validation, etc.) to help with that decision. Ultimately though, it's important to have a firm understanding of the strengths/weaknesses of the different models and their use cases so you can make an informed selection decision. Kuhn and Johnson's book Applied Predictive Modeling will be a good read to help you prepare.

u/fieldcady · 1 pointr/datascience

First off, thank you for your service!

I hate to say it but you've got quite a lot of ground to make up. It's hard for me to gauge whether you have the coding skills needed. I get the impression that it's mostly sys admin stuff, which is good but not really sufficient (correct me if I'm wrong). You may want to teach yourself python if you don't use it yet.

The Coursera class on machine learning is something you should look into, since it will introduce you to a large body of knowledge that is critical for DS and probably all new to you.

I also encourage reading a book on data science, which would give you a good overview of the field as a whole and let you assess where the gaps are in your knowledge. I published one recently, which has great coverage of topics but has gotten mixed reviews so far. Here's another one which has better reviews, and is by a guy I know and respect.

u/briangodsey · 3 pointsr/datascience

One of the best not-very-technical books on data science in business is Thinking With Data. It's quirky but gets at the core of what good data science is supposed to be.

Beyond that, Data Science for Business has some great stuff in it, but you would probably want to skip the more technical parts, which might end up being most of the book, depending on your interest in that. Same for Think Like a Data Scientist (apologies for the self-promotion).

Medium.com has some solid articles about data science and various aspects of business, but they are scattered and I haven't yet seen a collection of articles that broadly cover what you're looking for.

u/[deleted] · 1 pointr/datascience

I did now. Any way of getting a sticky/wiki/FAQ of useful materials /common questions for noobs like me? People can vote/review books and MOOC's / Kaggle competitions, and what was the best for them. Give us newbies something to get started on so we don't have to flood the sticky. Then gives more of a community support rather than one person's suggestion.


For instance

Applied Predictive Modeling

or the less theory version

Intro to Statistical Learning were two books that helped me with understanding statistical models and had applications and exercises in R

R for Data Science was decent enough and had updated packages for making tidy data.


I found the Data Science Coursera Specialization decently useful, but didn't go deep enough. It did give me enough of a taste to know this is the direction I want my career to go in. So I'm hesitant to do more MOOCs.




I also don't have experience in Data Science hiring, but have it for consulting/actuarial. I'd be happy to help critique resumes during my free time for all the graduating students.

u/onetwosex · 1 pointr/datascience

OP, so that you know, you mention uncle Bob's "Clean Code", but your link redirects to the book "Clean Coder". They're both great, but different.

I've ordered the book Practical Statistics for Data Scientists: 50 Essential Concepts. Looks great to brush up the basics of statistics and machine learning. Since I haven't actually read it yet, take my input with a grain of salt.

u/core_dumpd · 3 pointsr/datascience

Jose Portilla on Udemy has some good python based courses (and also frequents this subreddit). There's regularly sales or some sort of coupon code available to get any of the courses for $10-$15, so it's very reasonable.

For books:

https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662/ref=asap_bc?ie=UTF8 ... it's not out yet, but due any day. You can also get preview access on sites like Safari Online (which would also have all the books below).

https://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/149190142X/ref=sr_1_1

For general python:

https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_1

https://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994/ref=sr_1_1

No Starch Press, OReilly, APress and Manning generally have pretty good quality publications. I'd usually skip anything from Packt, unless it's specifically received good reviews.

u/Zedmor · 1 pointr/datascience

I am in probably same boat. Agree with your thoughts on github. I fell in love with this book: https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130/ref=sr_1_1?ie=UTF8&qid=1474393986&sr=8-1&keywords=machine+learning+python

it's pretty much what you need - guidance through familar topics with great notebooks as example.

Take a look at seaborn package for visualization.

u/ianblu1 · 5 pointsr/datascience

I usually recommend this book for this sort of problem: https://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/149190142X

In it you'll get your feet wet with respect to basic python and be exposed to how you would implement some core algorithms from scratch. Once you know that it should be relatively straightforward to move to the higher level libraries.

It's important to note that there aren't really "equivalent functions" mapping R to python. This is because R and python optimize for different things. R is a declarative analysis language- you tell it what you want it to do, not how to do it. Python is a full featured programming language also used for software development, so it supports many different paradigms (OO, functional, etc.). There are component libraries such as sklearn that implement declarative apis that will let you say things like "fit a model with these characteristics" or pandas that lets you say things like "what is the average of value in all of these columns". But in general python itself doesn't really work that way. You build things bottoms up.

u/Robin_Banx · 7 pointsr/datascience

Almost the exact same trajectory as you - graduated with a psych degree, learned a lot of stats and experiment design, then did the Coursera ML course.

Reading this book is probably the biggest thing that took me from knowing there to doing well in interviews (before that it was just scattered projects): https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291 A second edition is coming out pretty soon, so watch out for that.

If I were doing it today, this is probably the best material out there: https://www.dunderdata.com/ It starts from scratch and gives you an amazing tour of Pandas. Author's also working on a practical Machine Learning book.

u/ThatOtherBatman · 3 pointsr/datascience

Exactly what it sounds like. They're going to be testing your ability to design a clean and efficient solution to a problem.
You don't need to come up with the "correct" solution. They're going to be more interested in how you think through the problem, your communication skills, etc.
I highly recommend [this] (https://www.amazon.com/Cracking-Coding-Interview-6th-Programming/dp/0984782850/ref=sr_1_1?ie=UTF8&qid=1465591474&sr=8-1&keywords=Cracking+the+coding+interview) book.

u/seabass · 3 pointsr/datascience

Here's the description of the book:

Data Scientists at Work is a collection of interviews with sixteen of the world's most influential and innovative data scientists from across the spectrum of this hot new profession...Through incisive in-depth interviews, this book mines the what, how, and why of the practice of data science from the stories, ideas, shop talk, and forecasts of its preeminent practitioners across diverse industries...

---
Here's the list of people in the book:
Table of Contents

u/cfors · 22 pointsr/datascience

Designing Data Intensive Applications is your ticket here. It takes you through a lot of the algorithms and architecture present in the distributed technologies out there.

In a data engineering role you will probably just be munging data through a pipeline making it useful for the analysts/scientists to use, so a book recommendation for that depends on the technology you will be using. Here are some of my favorite resources for the various tools I used in my experience as a Data Engineer:

u/wrtbwtrfasdf · 1 pointr/datascience

I'd focus on:

  • Networking(personnel connections)
  • Coding interview practice: Cracking the Coding interview
  • Create a portfolio from personal projects with various datasets: Kaggle is great for this. Github too.

    Big Data stuff with pyspark or dask would probably be a boon too. MongoDB is also pretty easy to pick-up.
u/JustJeezy · 1 pointr/datascience

SQL in 10 Minutes

This was the book used in an introductory course I took. It did a pretty good job of explaining everything and was pretty easy to follow.

u/nakun · 2 pointsr/datascience

Hi,

I am starting out studying Data Science myself (not employed in the field). Here is what has been useful to me:

Data Quest has some free lessons and they are good. (They also have a weekly newsletter that has learning resources/articles).

Practical Statistics for Data Scientists has been very helpful in getting me up to speed on statistics (note that the code here uses R, not Python).

u/Aidtor · 1 pointr/datascience

If you want to be valuable to companies post graduation you should learn more about programming (design templates, how to write tests, how to go from a paper to code). I recommend this book as a good starting place. Once you're comfortable with how the different methods work, pick up this book.

u/shaggorama · 1 pointr/datascience

You'll probably find this article and its references interesting: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

I also strongly recommend this book: http://www.amazon.com/Guerrilla-Analytics-Practical-Approach-Working/dp/0128002182

If you're looking something more technical about actually doing analyses, this is book is very accessible: http://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485

If you use R, this book is really great: http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR/

u/dewgazi · 1 pointr/datascience

Big Data: A Revolution that will Transform how We Live, Work, and Think by Mayer-Schonberger and Cukier (https://www.amazon.com/Big-Data-Revolution-Transform-Think/dp/0544227751)

Data Science for Business: What you Need to Know about Data Mining and Data Analytic Thinking by Provost and Fawcett (https://www.amazon.com/Data-Science-Business-Data-Analytic-Thinking/dp/1449361323/ref=pd_bxgy_14_img_3?_encoding=UTF8&psc=1&refRID=XSYTKYEVG8W52XART2BD)

Data and Goliath by Schneier (https://www.amazon.com/Data-Goliath-Battles-Collect-Control/dp/039335217X)

Cathy O'Neill's book is ok. It is worth reading, I thought it could have been better.

Dataclysm is great.

u/codefying · 1 pointr/datascience

My top 3 are:

  1. [Machine Learning] (https://www.cs.cmu.edu/~tom/mlbook.html) by Tom M. Mitchell. Ignore the publication date, the material is still relevant. A very good book.
  2. [Python Machine Learning] (https://www.amazon.co.uk/dp/1783555130/ref=rdr_ext_sb_ti_sims_2) by Sebastian Raschka. The most valuable attribute of this book is that it is a good introduction to scikit-learn.
  3. Using Multivariate Statistics by Barbara G. Tabachnick and Linda S. Fidell. Not a machine learning book per se, but a very good source on regression, ANOVA, PCA, LDA, etc.
u/monkeyunited · 3 pointsr/datascience

Data Science from Scratch

Python Machine Learning

DSFS covers basics of Python. If you're comfortable with that and want to dive into implementing algorithm (using Tensorflow2, for example), then PML is a great book for that.

u/KingEnchiladas · 1 pointr/datascience

I'm a sophomore in college wanting to get in to the data science field after I graduate. I'm currently learning Python in a class of mine and I'm looking to do some learning on my own. I've found two books, Data Science from Scratch: First Principles with Python and Data Science from Scratch: Practical Guide with Python My roommate has a copy of the first book and I've looked through it some. I'm wondering if anyone has experience with either of these, or any other resources that would be helpful for me.

Thanks for your help!

u/starkiller1990 · 3 pointsr/datascience

Check out Data Smart. http://www.amazon.com/Data-Smart-Science-Transform-Information/dp/111866146X

It shows you how to perform linear regression in Excel as well as loads more Data Science techniques such as time series forecasting, clustering, prediction etc.

Assumes no background in Maths/stats and all you need is excel

u/KeyVisual · 1 pointr/datascience

What resources would you recommend for newbies? I'm currently reading Data Science from Scratch(Grus) and Python for Data Analysis(McKinney). Anything else I should check out?

Love the blog!

u/czlapka · 7 pointsr/datascience


Maybe not exactly typical Data Science but as an introduction, a background I recommend "Data Smart: Using Data Science to Transform Information into Insight"

https://www.amazon.com/Data-Smart-Science-Transform-Information/dp/111866146X/ref=asap_bc?ie=UTF8

u/WeoDude · 4 pointsr/datascience

I don't have a tutorial for TensorFlow, but Hands on Machine Learning with Scikit-Learn and TensorFlow (https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=sr_1_1?ie=UTF8&qid=1500494347&sr=8-1&keywords=hands+on+machine+learning) should basically be the bible of machine learning implementation.

XGboost, the best way I learned it, Is through looking at Kaggles.

u/machinedunlearned · 1 pointr/datascience

I teach applied math, stats, and computation courses to B.S. degree seeking students. Two observations:

  • First, the theory is not pointless. I know, you didn't say or imply that. But I just have to get that out there.

  • Second, my observation is that many applied mathematics/statistics courses at the undergraduate level do an astoundingly poor job of connecting mathematics to real world data. I can go on about this for days, but I'll stop with saying that (a) this is an incredible disservice to students across all disciplines that require any math, and (b) there are mathematicians and statisticians who are aware of the problem and working to make it better.

    Okay, rant concluded, book recommendations! First, try Doing Data Science by O'Neil and Schutt. This assumes some knowledge of linear algebra, stats, and programming. Examples are given in R. I think this book is very good at bringing out the idea that data science involves both theory and experience, and is good at bringing out the
    feel" of working on data science problems.

    Second, if your math background is calling for plenty of math, you might take a look at Machine Learning from a Probabilistic Perspective. This takes a closer look at the data modeling process, which is somewhat lacking in more CS-oriented texts on ML. Requires good knowledge of probability, obviously.
u/ultraliks · 16 pointsr/datascience

Sounds like you're looking for the statistical proofs behind all the hand waving commonly done by "machine learning" MOOCS. I recommend this book. It's very math heavy, but it covers the underlying theory well.

u/la727 · 4 pointsr/datascience

Are there any good resources for learning more about this?

I have a tech sales background and have an interest in analytics. I picked up this book as a springboard- https://www.amazon.com/Data-Science-Business-Data-Analytic-Thinking/dp/1449361323/ref=nodl_

u/DaveVoyles · 5 pointsr/datascience

Yes, this x100. I work with so many large companies, and you've described one of the largest problems I consistently run into.


"It's all in the data -- figure it out"


I often recommend this book: Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking


If businesses cannot describe their problem in two sentences, it means they do not understand the problem they are trying to solve.

u/Wafzig · 1 pointr/datascience

This. The book that accompanies these videos link is one of my main go-to's. Very well put together. Great examples.

Another real good book is Practical Data Science with R.

I'm not sure what language the John's Hopkins Coursera Data Science courses is done in, but I'd imagine either R or Python.

u/johny5w · 5 pointsr/datascience

This might be what you are looking for, The Visual Display of Quantitative Information By Edward Tufte. The book is a little older, but the principles still stand, and it is considered a pretty seminal work for data visualization.

u/velos · 3 pointsr/datascience

Don't want to be the devil's advocate here, but I think everyone interested to get into this field must read the book Weapons of Math Destruction by Cathy O'Neil
Of what 'good' DS can do, that has been well promoted everywhere.. Of what 'disaster' it can bring, few would want to shine a spotlight on... Pursue this field, knowing both its light and dark side...

u/daturkel · 4 pointsr/datascience

The book Doing Data Science, cowritten by Cathy O'Neil (of Weapons of Math Destruction may be of interest to you.

> In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.

I haven't read the whole thing yet, but it's well-written and has a nice survey of topics.

u/th3_gibs0n · 6 pointsr/datascience

Data Engineering is different everywhere and task dependent. The best advice I can give is have SQL be your second language. Then depending on your role or daily tasks you would be looking at extra materials.

General Insightful Reads:

u/AKGeef · 2 pointsr/datascience

I don't know of any MOOCs that use Keras, so your best bet might be going through their documentation.

If you are looking for a Data Science MOOC that uses Python, University of Michigan has one here.

Also, another great resource is Joel Grus's book called Data Science from Scratch.

u/ttelbarto · 1 pointr/datascience

Hi, There are so many resources out there I don't know where to start! I would work through some kind of beginner python book (recommendation below). Then maybe try Andrew Ng's Machine Learning Coursera course to get a taste of Machine Learning. Once you have completed both of those I would reassess what you would like to focus on. I will include some other books I would recommend below.

Beginner Python - https://www.amazon.co.uk/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?keywords=python+books&qid=1565035502&s=books&sr=1-3

Machine Learning Coursera - https://www.coursera.org/learn/machine-learning

Python Machine Learning - https://www.amazon.co.uk/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=sr_1_7?crid=2QF98N9Q9GCJ9&keywords=hands+on+data+science&qid=1565035593&s=books&sprefix=hands+on+data+sc%2Cstripbooks%2C183&sr=1-7

https://www.amazon.co.uk/Data-Science-Scratch-Joel-Grus/dp/1492041130/ref=sr_1_1?crid=PJEJNNUBNQ8N&keywords=data+science+from+scratch&qid=1565035617&s=books&sprefix=data+science+from+s%2Cstripbooks%2C140&sr=1-1

Statistics (intro) - https://www.amazon.co.uk/Naked-Statistics-Stripping-Dread-Data/dp/039334777X/ref=sr_1_1?keywords=naked+statistics&qid=1565035650&s=books&sr=1-1

More stats (I haven't read this but gets recommended) - https://www.amazon.co.uk/Think-Stats-Allen-B-Downey/dp/1491907339/ref=sr_1_1?keywords=think+stats&qid=1565035674&s=books&sr=1-1

u/nkk36 · 12 pointsr/datascience

I've never heard of that book before, but I took a look at their samples and they all seem legitimate.

I would just buy the Ebook for $59 and work through some problems. I'd also maybe purchase some books (or find free PDFs online). Given that you don't have a deep understanding of ML techniques I would suggest these books:

  1. Intro to Statistical Learning
  2. Data Science for Business

    There are others as well, but those are two introductory-level textbooks I am familiar with and often suggested by others.
u/Chrono803 · 2 pointsr/datascience

Just a heads up, Python for Data Analysis now has a second edition :)

u/spring_m · 2 pointsr/datascience

Also check out Applied Predictive Modeling - it's in a way the next book to read after ISLR - it goes a bit more in depth about good practices, plusses and minuses of different models, feature creation/extraction.

u/k5d12 · 2 pointsr/datascience

If OP doesn't have the possibility of taking a statistical learning class, ISL is a good introduction.

u/finitedimensions · 11 pointsr/datascience

I glanced at "Hands-On Machine Learning with Scikit-Learn and TensorFlow" by Aurelien Geron and thought it is quite good. But I have not had a chance to read it deeply yet.

​

https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

u/halifaxdatageek · 4 pointsr/datascience

First off, SQL (at a basic level) is not that hard. Here's a good book that I used to teach myself.

Second, http://learnpythonthehardway.com is a good resource. It's not that hard either.

Finally, the stats part is where a lot of CS folks are weak, so that could actually be a strength for you. They know how to load and fire the gun, just not how to aim.

u/tzar1995 · 0 pointsr/datascience

Comment your code is one of the worst advices out there... I recommend you and the OP read this book before giving advices on what to do: CleanCode

​

PD: I agree with the other two points

u/alzho12 · 1 pointr/datascience

Read this book, Data Science for Business. It sounds like you don't need to code, but need to be able to converse.

u/awesome_hats · 4 pointsr/datascience

Well I'd recommend:

u/Dracontis · 2 pointsr/datascience

I'm a beginner too, so I can't give you end-to-end solution. I'll try to describe my path.

  1. You'll definetly need some statistics background. I've taken free Inferential and Descriptive Statistics courses from Udacity.
  2. I've decided to go further in Machine Learning. There I've got two choices Machine Learning A-Z™: Hands-On Python & R In Data Science and Machine Learning from Andrew Ng. I've decided to take second one and I'm on the fifth week now. It's really good for ML basics and theory, but programming assignments is horrible. So I think I'll have basic understanding of what's going on, but I will have near to no practical skills. That's why I asked question here about scientific advisory here.
  3. After I finish course, I plan to read Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems to boost knowledge of algorithms on the python.

    I have no idea what I'll do next. Maybe, I'll took several courses and nanodegrees on Coursera. Maybe I'll find guidance and start getting hands on experience on a real project. It's not so hard to start learning - it's hard to find purpose and application of your knowledge.
u/DangerousDan1834 · 1 pointr/datascience

I think professionally Excel's use is limited to view small csv + run simple calculations, you can do a whole host of "data science" analysis using just excel alone. Check out Data Smart. The first book that introduced me to clustering.

u/namnnumbr · 2 pointsr/datascience

The Elements of Statistical Learning: Data Mining, Inference, and Prediction https://www.amazon.com/dp/0387848576/ref=cm_sw_r_cp_api_i_Q9hwCbKP3YFAR

u/Bayes_the_Lord · 2 pointsr/datascience

Hands-On Machine Learning

There's a new edition coming out in August though.

u/maxmoo · 1 pointr/datascience

Before OP worries about data engineering libraries, I'd be looking to see some more fundamental software engineering skills, i.e. things you might find in https://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882 like

  • variable names
  • formatting
  • functions(!)
  • tests

u/svenhof · 3 pointsr/datascience

Good list of books.

I've also heard good things about Weapons of Math Destruction written by one of the authors of Doing Data Science. Haven't read it myself though.

https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815

u/draka1 · 4 pointsr/datascience

I highly recommend Weapons of Math Destruction to understand the impact of data science applied in the wrong way:
https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815

u/Booie2k1 · 2 pointsr/datascience

This was an interesting and thought provoking read. Not too long either.

Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy https://www.amazon.co.uk/dp/0553418815/ref=cm_sw_r_cp_apa_pp5UBbY4DGFYA