Jupyter Notebook basic course

Intro:   Jupyter Notebook is a very powerful tool, often used for interactive development and presentation of data science projects. It integrates code and its output into one document and combines visual narr

Jupyter Notebook is a very powerful tool, often used for interactive development and presentation of data science projects. It integrates code and its output into one document and combines visual narrative text, mathematical equations, and other rich media. Its intuitive workflow promotes iterative and rapid development, making notebook more and more popular in contemporary data science, analysis and more and more scientific research. Most importantly, they are completely free as part of an open source project.

The Jupyter project, the heir to the early IPython Notebook, was first released as a prototype in 2010. Although many different programming languages can be used in Jupyter Notebook, this article will focus on Python, because python is the most common in Jupyter Notebook.

To fully understand this tutorial, you should be familiar with programming, especially Python and pandas. Pandas is a data analysis package for python. In other words, if you have programming experience, the Python in this article will not be too strange, and pandas is easy to understand. Jupyter Notebooks can also run pandas or even Python, as a flexible platform, as reflected in this article.

We will:

We will answer a real question through a sample analysis so that you can see how a notebook workflow makes the task intuitive, and let others understand it better when we share it with others.

Suppose you are a data analyst and your task is to figure out the history of profit changes in the largest companies in the United States. You will find that since the first publication of the list in 1955, there have been datasets of Fortune 500 companies for more than 50 years, all of which have been collected from Fortune’s public archives. We have created a CSV file for available data (you can get it here).

As we’re going to demonstrate, Jupyter Notebooks is perfect for this survey. First, let’s install Jupyter.

The easiest way for beginners to start using Jupyter Notebooks is to install Anaconda. Anaconda is the most widely used Python distribution for data science and has all the commonly used libraries and tools pre-installed. In addition to Jupyter, some Python libraries are packaged in Anaconda, including NumPy,pandas and Matplotlib, and this complete 1000 + list is detailed. This allows you to run in your own complete data science seminar without having to manage countless installation packages or worry about the installation of dependencies and specific operating systems.

Install Anaconda:

If you are a more advanced user who has Python installed and prefer to manage your package manually, you can use pip:

Pip3 install jupyter creates your first Notebook in this section, and we’ll see how to run and save notebooks, familiar with their structure and understand the interface. We will be familiar with some of the core terms that will guide you to a practical understanding of how to use Jupyter notebooks and pave the way for the next section, which will analyze the sample data and bring everything we have learned here to life.

On Windows, you can run Jupyter, by adding Anaconda shortcuts to your start menu, which will open a new label in your default web browser, looking like the screenshot below.

This is Notebook Dashboard, dedicated to managing Jupyter Notebooks. Think of it as a startup panel for exploring, editing, and creating notebooks. You can think of it as a launch pad for exploring, editing, and creating your notebook.

Note that the dashboard will allow you to access only the files and subfolders contained in the Jupyter startup directory; however, the startup directory can be changed. You can also start the dashboard on any system (or the terminal on the Unix system) by entering the jupyter notebook command; in this case, the current working directory will be the boot directory.

Smart readers may have noticed that the dashboard URL is similar to http://localhost:8888/tree. Localhost is not a website, but represents the content that is served from your local machine (your own computer). Both Jupyter notebook and dashboard are web applications, and Jupyter starts a local Python server, provides these applications to your web browser, making them fundamentally platform-independent, and opens the door to easier sharing on web.

Most of the dashboard interface is self-evident-although we’ll briefly describe it later. What are we waiting for? Browse to the folder where you want to create your first notebook, click the New drop-down button in the upper right corner, and select Python 3 (or your favorite version).

We’ll see the results soon! Your first Jupyter Notebook will open on the new tab-each notebook uses its own tag because you can open multiple notebook at the same time. If you switch back to the dashboard, you will see the new file Untitled. You should see some green text telling notebook that it is running.

It is useful to understand what this document is. Each .ipynb file is a text file that describes the contents of your notebook in a format called JSON. Each cell and its contents, including image attachments that have been converted to text strings, are listed with some metadata. You can edit this yourself-if you know what you’re doing!-by selecting “Edit & gt; Edit Notebook Metadata” in the notebook menu bar.

You can also view the contents of your notebook file by selecting Edit from the control on the dashboard, but it is important; there is no reason to do so other than curiosity, unless you really know what you are doing.

Since you have an open notebook, in front of you, its interface doesn’t look completely unfamiliar; after all, Jupyter is actually just an advanced word processor. Why don’t you take a look? Look at the menu to learn about it, especially take the time to browse the command pallet (this is a small button with a keyboard icon (or Ctrl + Shift + P) to scroll through the command list.

You should note two very important terms, which may be new to you: cells and kernels. They are the key to understanding Jupyter and distinguishing Jupyter from being more than just a word processor. Fortunately, these concepts are not difficult to understand.

We’ll talk about the kernel later, and before we get to the cell, let’s take a look at the cell. Cells form the main body of a notebook. In the new notebook screenshot in the above section, the box with a green outline is an empty cell. We will introduce two main cell types:

The first unit in the new notebook is always a code unit. Let’s test it with a classic hello world example. Enter print (‘Hello World’) To the cell, click the run button in the bar above, or press the Ctrl + Enter key. The result should be as follows:

Print (‘Hello World’) Hello World! When you run the cell, its output will be displayed below it, and the label on its left will change from In [] to In [1]. The output of the code unit is also part of the document, which is why you can see it in this article. You can always distinguish between code and Markdown units because code cells have tags on the left and Markdown units do not. The “In” section of the tag is just an abbreviation for “input,” and the label number represents the order in which cells are executed on the kernel in which case the cell is executed first. Run the cell again and the label will change to In [2] because the cell is the second cell running on the kernel. This makes it very useful for us to delve into the kernel next.

From the menu bar, click insert and select insert cells below, create your new code unit, and try the following code to see what happens. Have you noticed any difference?

In general, the output of a cell comes from any text data specified during cell execution, as well as the value of the last line in the cell, whether it is a separate variable, a function call, or something else. For example:

Def say_hello (recipient): return ‘Hello, {}!’ .format (recipient) say_hello (‘Tim’)’ Hello, Tim’ You will find that you often use it in your own projects, and we will see more in the future.

When you run cells, you may often see their borders turn blue, while they are green when edited. There is always an active cell that highlights its current mode, green for edit mode and blue for command mode.

So far, we’ve seen how to use Ctrl + Enter to run cells, but there are still a lot more. Keyboard shortcuts are a very popular aspect of the Jupyter environment because they promote fast cell-based workflows. Many of these are actions that can be performed on the active unit in command mode.

Next, you’ll find a list of keyboard shortcuts for Jupyter. You may not be familiar with them right away, but this list should give you an idea of these shortcuts.

You can try these on your own notebook. Once you have tried to create a new Markdown unit, we will learn how to format text in our notebook.

Markdown is a lightweight, easy-to-learn markup language for formatting plain text. Its syntax has a one-to-one correspondence with HTML tags, so some of the lessons here are useful, but definitely not a prerequisite. Keep in mind that this article was written on a Jupyter notebook, so all the narrative text and pictures you see are done on Markdown. Let’s use a simple example to introduce the basics.

# this is the first level title. # # this is a secondary title. These are plain texts that make up the paragraphs. Add focus through * * bold * * and _ _ bold__, or * italic * and _ italic_. Paragraphs must be separated by empty lines. * sometimes we want to include a list. * can be indented. 1. The list can also be numbered. two。 Ordered list. [it is possible to include hyperlinks] (https://www.example.com) inline code uses a single inverted quotation mark: ‘foo ()’, code block uses three inverted quotation marks:\’\ bar ()\’or can be made up of four spaces: foo () finally, adding pictures is also simple:! [Alt] (when https://www.example.com/image.jpg) appends images, you have three options:

Use a URL of an image on the web.

Use a local URL, maintained with your notebook for example in the same git warehouse.

Add attachments through “Edit & gt; Insert Image”; this converts the image into a string and stores the .ipynb file in your notebook.

Note that this will make your .ipynb file bigger!

Markdown has a lot of details, especially when hyperlinks, you can also simply include pure HTML. Once you find yourself breaking through these basic limitations, you can refer to the official guide for Markdown creator John Gruber.

Each notebook background runs a kernel. When you run a unit of code, the code is executed in the kernel, and any output is returned to the cell to display. The state of the kernel remains the same when switching between cells-it is about documents, not individual cells.

For example, if you import libraries or declare variables in one cell, they will be available in another cell. In this way, you can think of an notebook document as equivalent to a script file, except that it is multimedia. Let’s try to feel it. First, we’ll import a Python package and define a function.

Import numpy as np def square (x): return x * x once we have executed the above cell, we can refer to np and square in any other cell.

X = np.random.randint (1,10) y = square (x) print (‘% d squared is% d’% (x, y)) 1 squared is 1, regardless of the order of cells in your notebook, this is feasible. You can try it yourself and let’s print out the variables again.

Print (‘Is% d squared is% d ≤’% (x, y)) Is 1 squared is 1? There is no doubt about the answer. Let’s try to change y.

Y = 10 what do you think will happen if we run the cell that contains the print statement again? What we got was Is 4 squared is 10!

In most cases, the workflow on your notebook will go from top to bottom, but it is normal to return to the above changes. In this case, the execution order on the left side of each cell, such as In [6], will let you know if any of your cells have stale output. If you want to reset something, there are a few very useful options from the kernel menu:

If your kernel is always being calculated, but you want to stop it, you can choose the Interupt option.

As you may have noticed, Jupyter provides options for changing the kernel, and there are actually many different options to choose from. When you create a new note from the dashboard by selecting the Python version, you are actually choosing which kernel to use.

There are not only different versions of the Python kernel, but also (more than 100 languages), including Java, C, and even Fortran. Data scientists may be particularly interested in R and Julia, and the imatlab and Calysto MATLAB kernels. The SoS kernel provides multilingual support in one notebook. Each kernel has its own installation instructions, but you may need to run some commands on your computer.

Now that we’ve seen a Jupyter Notebook, it’s time to see if they’re used in practice, which should give you a clearer idea of why they’re so popular. It is time to start using the Fortune 500 dataset mentioned earlier. Keep in mind that our goal is to understand how the profits of the largest companies in the United States have changed in history.

It is worth noting that everyone will have their own preferences and styles, but the general principles still apply, if you want, you can follow this paragraph on your own notebook, which also gives you the freedom to play.

Before you start writing a project, you may want to give it a meaningful name. It may be a bit puzzling that you can’t name or rename your notebook, from Notebook’s application, but you have to use the dashboard or your file browser to renamed the .ipynb file. We will return to the dashboard to rename the file you created earlier, which will have the default notebook file name of Untitled.ipynb.

You can’t rename it at notebook runtime, so you have to turn it off first. The easiest way to do this is to select “File & gt; Close and Halt” from the notebook menu. However, you can also shut down the kernel by selecting notebook in the laptop application “Kernel & gt; Shutdown” or in the dashboard and clicking “Shutdown” (see figure below).

You can then select your notebook, and click “Rename” in the dashboard control.

Note that closing the tab of notes in your browser will not close your notebook the way you close documents in traditional applications. The notebook kernel will continue to run in the background and need to stop running before it really “shuts down”-but if you accidentally turn off your label or browser, it’s convenient! If the kernel is shut down, you can turn off the tab without worrying about whether it is still running.

If you name your notebook and open it, we can start practicing it.

You usually use a unit of code dedicated to importing and setting up at the beginning, so if you choose to add or change anything, you can simply edit and rerun the unit without any side effects.

% matplotlib inline import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set (style= “darkgrid”) We import pandas to process our data, Matplotlib draws charts, and Seaborn makes our charts more beautiful. Importing NumPy is also common, but in this case, although we use pandas, we don’t need to use it explicitly. The first line is not the Python command, but uses something called line magic to instruct Jupyter to capture Matplotlib diagrams and render them in cell output; this is one of a series of advanced features beyond the scope of this article.

Let’s load the data.

Df = pd.read_csv (‘fortune500.csv’) is also wise to do so in a single cell because we need to reload it at any time.

Now that we’ve started, the best thing to do is to store it on a regular basis. Press Ctrl + S to save your notebook, by calling the “save and checkpoint” command, but what is this checkpoint?

Every time you create a new notebook, you create a checkpoint file and your notebook file; it will be called .ipynb _ checkpoints, in the hidden subdirectory where you save it, as well as a .ipynb file. By default, Jupyter automatically saves your notebook, every 120 seconds without changing your main notebook file. When you “save and checkpoint”, both the notebook and checkpoint files are updated. Therefore, the checkpoint enables you to resume unsaved work in the event of an accident. You can restore from the menu to the checkpoint through “File & gt; Revert to Checkpoint”.

We’re moving steadily! Our notes have been safely saved, and we load the dataset df into the most commonly used pandas data structure, which is called DataFrame and looks like a table. So what was our data rally like?

Df.head () Year Rank Company Revenue (in millions) Profit (in millions) 0 1955 1 General Motors 9823.5 806 1 1955 2 Exxon Mobil 5661.4 584.8 2 1955 3 U.S. Steel 3250.4 195.4 3 1955 4 General Electric 2959.1 212.6 4 1955 5 Esmark 2510.8 19.1 df.tail () Year Rank Company Revenue (in millions) Profit (in millions) 25495 2005 496 Wm. Wrigley Jr. 3648.6 493 25496 2005 497 Peabody Energy 3631.6 175.4 25497 2005 498 Wendy’s International 3630.4 57.8 25498 2005 499 Kindred Healthcare 3616.6 70.6 25499 2005 500 Cincinnati Financial 3614.0 584 looks right. We have the columns we need, each row corresponding to the financial data of a company for one year.

Let’s rename these columns so that we can reference them later.

Df.columns = [‘year’,’ rank’, ‘company’,’ revenue’, ‘profit’] next, we need to explore our dataset, is it complete? Did pandas read as expected? Is it missing value?

Well, len (df) 25500 looks good-from 1955to 2005, there were 500 rows a year.

Let’s check that our dataset is imported as we expected. A simple check is to see if the data type (or dtypes) is interpreted correctly.

Df.dtypes year int64 rank int64 company object revenue float64 profit object dtype: object looks like there’s something wrong with the profit line-we want it to be float64 like the income column. This indicates that it may contain some non-integer values, so let’s take a look at it.

Non_numberic_profits = df.profit.str.contains (‘[^ 0-9 -]’) df.loc.head () year rank company revenue profit 228 229 Norton 135.0 N.A. 290 1955 291 Schlitz Brewing 100.0 N.A. 294 1955 295 Pacific Vegetable Oil 97.9 N.A. 296 1955 Liebmann Breweries 96.0 N.A. 352 1955 353 Minneapolis-Moline 77.4 N.A. Just like we guessed! Some of these values are strings that represent missing data. Are there any other missing values?

Set (df.profit [non _ numberic_profits]) {‘N.A.} this is easy to explain, but what should we do? This depends on how many values are missing.

Len (df.profit [non _ numberic_profits]) 369 it is only a small part of our dataset, though not entirely irrelevant, because it is still around 1.5%. If you include N.A. Lines are simply and evenly distributed by year, and the simplest solution is to delete them. So let’s take a look at the distribution.

Bin_sizes, _, _ = plt.hist (df.year [non _ numberic_profits], bins=range (1955, 2006)) Missing value distribution), we can see that the number of invalid values in a year is also less than 25, and because there are 500 positions a year, deleting these values accounts for less than 4 per cent of the data in the worst year. In fact, with the exception of the surge in the 1990s, the missing value in most years is less than half of its peak. For our purposes, assume that this is acceptable, and then remove these rows.

Df = df.loc [~ non_numberic_profits] df.profit = df.profit.apply (pd.to_numeric) Let’s see if it works.

Len (df) 25131 df.dtypes year int64 rank int64 company object revenue float64 profit float64 dtype: object is good! We have completed the setup of the dataset.

If you want to make a report on notebook, you can do so without using the cells we created, including the workflow that demonstrates using notebook here, merge related cells (see the advanced features section below), and create a dataset setting cell. This means that if we put our data elsewhere, we can rerun the installation unit to restore it.

Next, we can solve this problem by calculating the average annual profit. We might as well draw the revenue, so first we can define some variables and a way to reduce our code.

Group_by_year = df.loc [:, [‘year’,’ revenue’, ‘profit’]] .groupby (‘ year’) avgs = group_by_year.mean () x = avgs.index y1 = avgs.profit def plot (x, y, ax, title, y_label): ax.set_title (title) ax.set_ylabel (y_label) ax.plot (x, y) ax.margins (x ≤ 0, y ≤ 0) Let’s start drawing now.

Fig, ax = plt.subplots () plot (x, y 1, ax, ‘Increase in mean Fortune 500 company profits from 1955 to 2005,’ Profit (millions)’) Increase in mean Fortune 500 company profits from 1955 to 2005 it looks like an index, but it has some large depressions. They must correspond to the recession and the dotcom bubble of the early 1990s. It’s interesting to see this in the data. But why do profits return to higher levels after each recession?

Maybe income can tell us more.

Y2 = avgs.revenue fig, ax = plt.subplots () plot (x, y2, ax, ‘Increase in mean Fortune 500 company revenues from 1955 to 2005,’ Revenue (millions) ‘) Increase in mean Fortune 500 company revenues from 1955 to 2005 this adds another side to the story. Income has hardly been hit hard, and the finance department is doing a good job of accounting.

With the help on Stack Overflow, we can stack these diagrams with + /-their standard offset.

Def plot_with_std (x, y, stds, ax, title, y_label): ax.fill_between (x, y-stds, y + stds, alpha=0.2) plot (x, y, ax, title, y_label) fig, (ax1, Ax2) = plt.subplots (ncols=2) title = ‘Increase in mean and std Fortune 500 company% s from 1955 to 2005’ stds1 = group_by_year.std (). Profit.as_matrix () stds2 = group_by_year.std (). Revenue.as_matrix () plot_with_std (x, y1.as_matrix (), stds1, ax1, title% ‘profits’,’ Profit (millions)’) plot_with_std (x, y2.as_matrix (), stds2, ax2, title% ‘revenues’,) ‘Revenue (millions)’) fig.set_size_inches (14, 4) fig.tight_layout () jupyter-notebook-tutorial_48_0 this is amazing, The standard deviation is huge. Some Fortune 500 companies have made billions, while others have lost billions of dollars, and risks have increased as profits have grown over the years. Perhaps some companies perform better than others; will the top 10 per cent of profits be more or less stable than the lowest 10 per cent?

Next, we have a lot of questions to see, and it’s easy to see how the workflow on notebook matches your thinking process, so now it’s time to put an end to this example. This process helps us easily study our datasets without switching applications, and our work can be shared and reproduced immediately. If we want to create a more concise report for a particular target group, we can quickly reconstruct our work by merging units and deleting intermediate code.

When people talk about sharing their notebook, they usually consider two patterns. In most cases, individuals share the final results of their work, as in this article itself, which means sharing non-interactive, pre-rendered versions of notebook;, however, you can also collaborate on notebook with auxiliary version control systems such as Git.

In other words, some emerging companies provide the ability to run interactive Jupyter Notebook in the cloud on web.

When you export or save it, the shared notebook will be displayed in the status of the moment it is exported or saved, including the output of all the code units. Therefore, to ensure that your notebook is shared, you can take some steps before sharing:

This will ensure that your notebook does not contain intermediate output, does not contain stale states, and executes sequentially when sharing.

Jupyter has built-in support for exporting HTML and PDF, as well as several other formats, which you can find under the File & gt; Download As menu. If you want to share your notebook, with a small private group, this feature is likely to be what you need. In fact, researchers at many academic institutions have some public or internal cyberspace, because you can export an notebook to a HTML file., Jupyter notebook can be a particularly convenient way for them to share their results with their peers.

However, if sharing exported files is not satisfactory to you, there are some more direct, very popular ways to share .ipynb files online.

As of early 2018, the number of public notebook on GitHub exceeded 1.8 million, and it is undoubtedly the most popular independent platform for sharing Jupyter projects with the world. GitHub has integrated support for .ipynb file rendering, which you can store directly in the warehouse and gists of its website. If you don’t know, GitHub is a code-hosting platform for versioning and collaboration for repositories created using Git. You need to create an account to use their services, and the Github standard account is free.

When you have a GitHub account, the easiest way to share an notebook on GitHub doesn’t even require Git. Since 2008, GitHub has provided Gist services for managed and shared code fragments, each with its own repository. Using Gists to share a notebook:

This should look like this:

If you create a public Gist, you can share it with anyone now URL, others will be able to fork and clone your work.

Creating your own Git repository and sharing it on GitHub is beyond the scope of this tutorial, but GitHub provides a number of guidelines for your reference.

For those who use git, an additional tip is to add an exception to the .ipynb _ checkpoints directory created for Jupyter in .gitignore because we don’t need to submit checkpoint files to the warehouse.

Since 2015, NBViewer has rendered thousands of notebook, every week, which has become the most popular notebook renderer. If you already put your Jupyter Notebook online somewhere, whether it’s GitHub or elsewhere, NBViewer can read your notebook, and provide a shareable URL. As part of the free service provided by the project Jupyter, you can find the relevant services at nbview.jupyter.org.

Originally developed before GitHub’s Jupyter Notebook integration, NBViewer allows anyone to enter URL, Gist ID, or GitHub username/repo/filename, and render it as a web page. The ID of a Gist is the only number at the end of its URL; for example, the string after the last backslash in https://gist.github.com/username/50896401c23e0bf417e89e1de. If you enter GitHub username/repo/filename, you will see a smallest file browser that allows you to access the user’s warehouse and its content.

The URL of the notebook displayed by NBViewer is based on the URL of the notebook being rendered and will not change, so you can share it with anyone as long as the original file remains online-NBViewer does not cache files for a long time.

Starting with the basics, we have mastered the workflow of Jupyter Notebook, delved into more advanced features of IPython, and eventually learned how to share our work with friends, colleagues, and the world. We did all this from a note!

You can see how notebook improves work experience by reducing context switching and simulating natural thinking developments in a project. Jupyter Notebook . The functionality of Jupyter Notebook should also be obvious, and we’ve introduced a lot of resources to get you started exploring more advanced features in your own projects.

If you want to provide more inspiration for your Notebooks, Jupyter is sorted out (an interesting Jupyter Notebook library), and you may find it helpful, and you’ll find Nbviewer’s home page linked to some really high-quality notebooks. You can also view our Jupyter Notebooks prompt list.

Related Passages:

Leave a Reply

Your email address will not be published. Required fields are marked *