How to Read Each Row of Dataframe Python

How to iterate over rows in a DataFrame in Pandas?

Respond: DON'T*!

Iteration in Pandas is an anti-pattern and is something you should simply exercise when you have exhausted every other option. Yous should not use whatsoever function with "iter" in its proper noun for more than a few thou rows or you volition accept to get used to a lot of waiting.

Practice you want to impress a DataFrame? Use DataFrame.to_string() .

Practice you desire to compute something? In that case, search for methods in this club (list modified from hither):

  1. Vectorization
  2. Cython routines
  3. List Comprehensions (vanilla for loop)
  4. DataFrame.apply() : i)  Reductions that tin be performed in Cython, 2) Iteration in Python space
  5. DataFrame.itertuples() and iteritems()
  6. DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should exist used in very rare circumstances, such equally generating row objects/nametuples for sequential processing, which is really the only affair these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red alert box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It's actually a little more complicated than "don't". df.iterrows() is the correct respond to this question, merely "vectorize your ops" is the improve i. I will concede that there are circumstances where iteration cannot be avoided (for instance, some operations where the result depends on the value computed for the previous row). Notwithstanding, it takes some familiarity with the library to know when. If you're non certain whether you lot need an iterative solution, you probably don't. PS: To know more than nearly my rationale for writing this answer, skip to the very bottom.


Faster than Looping: Vectorization, Cython

A practiced number of bones operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.


Next Best Matter: List Comprehensions*

Listing comprehensions should exist your adjacent port of call if 1) in that location is no vectorized solution available, 2) operation is important, but not important enough to go through the hassle of cythonizing your code, and three) you're trying to perform elementwise transformation on your lawmaking. There is a good amount of show to suggest that listing comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is unproblematic,

          # Iterating over i column - `f` is some role that processes your data issue = [f(x) for x in df['col']] # Iterating over two columns, use `zip` result = [f(x, y) for 10, y in nothing(df['col1'], df['col2'])] # Iterating over multiple columns - same data type consequence = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].to_numpy()] # Iterating over multiple columns - differing information type result = [f(row[0], ..., row[n]) for row in nix(df['col1'], ..., df['coln'])]                  

If you can encapsulate your business logic into a function, yous can use a list comprehension that calls it. Y'all can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to piece of work with - what that ways is your data types are consistent and you don't have NaNs, but this cannot always exist guaranteed.

  1. The first 1 is more than obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-instance handling logic), or ensure your business logic includes appropriate NaN handling logic.
  2. When dealing with mixed data types you should iterate over zip(df['A'], df['B'], ...) instead of df[['A', 'B']].to_numpy() as the latter implicitly upcasts data to the almost common type. As an case if A is numeric and B is cord, to_numpy() will cast the entire array to string, which may not exist what you want. Fortunately zipping your columns together is the well-nigh straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.


An Obvious Example

Allow'south demonstrate the difference with a simple case of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the functioning of the methods discussed above.

Benchmarking code, for your reference. The line at the lesser measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas lawmaking should be avoided unless you know what you're doing. Stick to the API where y'all can (i.e., prefer vec over vec_numpy).

I should mention, however, that it isn't e'er this cut and dry. Sometimes the reply to "what is the best method for an operation" is "it depends on your data". My advice is to examination out unlike approaches on your information earlier settling on one.


My Personal Stance *

Most of the analyses performed on the various alternatives to the iter family unit has been through the lens of performance. However, in most situations you lot will typically be working on a reasonably sized dataset (goose egg beyond a few thousand or 100K rows) and performance will come second to simplicity/readability of the solution.

Here is my personal preference when selecting a method to utilise for a trouble.

For the novice:

Vectorization (when possible); apply(); List Comprehensions; itertuples()/iteritems(); iterrows(); Cython

For the more experienced:

Vectorization (when possible); employ(); List Comprehensions; Cython; itertuples()/iteritems(); iterrows()

Vectorization prevails as the most idiomatic method for any problem that tin be vectorized. Always seek to vectorize! When in doubt, consult the docs, or expect on Stack Overflow for an existing question on your particular task.

I do tend to keep virtually how bad apply is in a lot of my posts, but I practice concede it is easier for a beginner to wrap their caput effectually what it's doing. Additionally, there are quite a few apply cases for use has explained in this post of mine.

Cython ranks lower down on the list because it takes more than fourth dimension and effort to pull off correctly. You will usually never need to write code with pandas that demands this level of performance that even a listing comprehension cannot satisfy.

* As with any personal opinion, delight take with heaps of common salt!


Further Reading

  • 10 Minutes to pandas, and Essential Basic Functionality - Useful links that introduce you to Pandas and its library of vectorized*/cythonized functions.

  • Enhancing Functioning - A primer from the documentation on enhancing standard Pandas operations

  • Are for-loops in pandas actually bad? When should I care? - a detailed writeup past me on list comprehensions and their suitability for diverse operations (mainly ones involving non-numeric information)

  • When should I (not) want to use pandas apply() in my code? - employ is slow (but not every bit slow equally the iter* family. There are, still, situations where ane tin can (or should) consider apply as a serious alternative, especially in some GroupBy operations).

* Pandas cord methods are "vectorized" in the sense that they are specified on the series only operate on each chemical element. The underlying mechanisms are yet iterative, because string operations are inherently hard to vectorize.


Why I Wrote this Respond

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do Ten?". Showing code that calls iterrows() while doing something within a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will probable envision the code that solves their trouble as iterating over their data to practice something. Non knowing how to iterate over a DataFrame, the first matter they do is Google it and end upwards here, at this question. They so see the accustomed answer telling them how to, and they close their eyes and run this code without e'er first questioning if iteration is not the correct affair to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that ameliorate, faster and more idiomatic solutions could be, and that it is worth investing time in exploring them. I'm not trying to start a state of war of iteration vs. vectorization, but I desire new users to exist informed when developing solutions to their problems with this library.

ortegaprike1986.blogspot.com

Source: https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas

0 Response to "How to Read Each Row of Dataframe Python"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel